Watch the replay of the Apsara Conference 2024 at this link!
By Yu Zhuang
Alibaba Cloud Container Service for Kubernetes (ACK) provides a wide range of product features and powerful and mature product capabilities in terms of scalability, scheduling, observability, cost management, and security compliance. However, if you have resources in data centers or third-party resources that cannot be migrated to ACK in a short time, and you also have requirements for the above areas, you can consider using ACK One.
At the Apsara Conference three years ago, we released ACK One, a distributed cloud container platform. With three years of development, we are glad to see that ACK One has helped more and more customers in the field of hybrid cloud and distributed cloud. Today, we will report to you on how ACK One evolved over the past three years and how it helps customers solve the challenges of multi-cloud and multi-cluster management in the distributed field.
First, let's summarize the challenges faced by enterprises in the field of distributed cloud containers. For self-managed clusters in data centers, there will be conflicts between the certainty of resources in data centers and the uncertainty of business traffic. During peak hours, resources in data centers are insufficient, leading to online business being damaged, and offline business queuing. During off-peak hours, resources in data centers are wasted. With the development of business, the number of clusters becomes large, and they come from different vendors. As a result, the O&M technologies vary greatly, resulting in a heavy workload. In addition, as enterprises have higher requirements for high availability, how to build a disaster recovery system for multi-cluster applications is also a difficulty for enterprises. Finally, for servers across different locations, how to manage servers on the edge side and GPU services through Kubernetes clusters and unify the technology stacks of the center and the edge is also a problem.
ACK One allows you to quickly build a distributed cloud-native infrastructure to meet the preceding challenges. Registered Kubernetes clusters can be connected to manage self-managed clusters in data centers and third-party public cloud clusters.
Edge Kubernetes clusters can be connected to manage servers in IDC/ENS in a containerized manner or manage distributed servers located in factories and stores.
After the connection, you can connect ECS or ECI instances on the cloud to IDC clusters to expand the elastic computing power and then use auto-scaling and intelligent scheduling provided by ACK to resolve concerns about IDC resource limitations and reduce overall costs. It also provides the same O&M experience as that of ACK clusters, including cloud-native AI suite, O&M, observability, and security compliance capabilities.
Finally, multi-cluster fleets of ACK One provide a unified control plane to manage multiple Kubernetes clusters, providing capabilities such as application distribution, job scheduling, traffic management, and O&M management.
The figure shows the architecture of ACK One registered clusters. You can connect Kubernetes clusters in the third-party public cloud and self-managed Kubernetes clusters in data centers to ACK One registered clusters.
First, you need to have a Kubernetes cluster and then install the agent pod of the registered cluster in the Kubernetes cluster. The agent pod reversely connects to the server of the ACK One registered cluster and establishes a control channel.
After that, the ACK One registered cluster controller can issue Kubernetes API requests and transfer them to the API server of the self-managed Kubernetes cluster through the agent pod. Through the ACK One registered cluster, you can obtain the extended capabilities of ACK clusters, including Prometheus monitoring, Simple Log Service, FinOps, backup center, security, Knative, Kserve for AI scenarios, Fluid, image acceleration, and ASM. The preceding capabilities of ACK clusters on the cloud can be implemented in the Kubernetes clusters of the third-party public clouds and self-managed Kubernetes clusters through the registered cluster.
In addition, to cope with the lack of scalability of resources in data centers, ACK One registered clusters support elasticity on the cloud, which can add ECS and ECI computing resources on the cloud, including CPU and GPU resources, to self-managed Kubernetes clusters. The self-managed Kubernetes schedules cloud resources to respond to business traffic growth.
It can be said that you still have full control over the self-managed Kubernetes cluster while also obtaining the same powerful scalability that only ACK clusters on the cloud possess.
First, through the registered cluster, the Kubernetes cluster deployed in data centers can be connected to ECS nodes on the cloud, or be connected to the Serverless ACS ECI computing power through the ACK Virtual Node. For ECS nodes on the cloud, custom VM images are supported, which are fully compatible with on-premises images, and auto-scaling is also supported.
If you choose to connect to Serverless ACS ECI computing power, it can support optimal elasticity, with a one-to-one correspondence between Kubernetes pods and ACS ECI instances. This setup allows for rapid startup and eliminates the need for node management. In addition, to facilitate access to Serverless computing power, you can directly install the Helm chart of ACK Virtual Node without creating a registered cluster so that clusters in data centers can quickly obtain ACS ECI computing power.
The ACK on-premises and cloud-based collaborative scheduler supports GPU sharing and scheduling to improve GPU utilization. It prioritizes the scheduling of resources in data centers and uses cloud-based computing power when resources in data centers are insufficient. After peak hours, the cloud resources are preferentially scaled down to improve resource utilization and reduce costs.
Through registered clusters, self-managed clusters in data centers can flexibly obtain cloud computing resources of tens of thousands of cores, eliminating IDC resource anxiety and reducing overall costs.
Through registered clusters, you can integrate the observability of ACK clusters, such as monitoring, logging, events, and alerts, including the out-of-the-box Grafana dashboards, into self-managed Kubernetes. A set of observability solutions is used in ACK clusters, self-managed Kubernetes clusters, and third-party public cloud Kubernetes clusters to achieve consistent cloud and on-premises observability experience. In terms of cost management, ACK provides a complete solution to support cost insight from multiple perspectives such as clusters, namespaces, node pools, and applications.
Through registered clusters, you can use the cost insight capability of ACK clusters to reduce the costs of self-managed Kubernetes clusters and third-party public cloud Kubernetes clusters.
Through registered clusters, you can use the cloud data backup and recovery capabilities to back up applications and data in self-managed Kubernetes clusters and third-party public cloud Kubernetes clusters and restore them in ACK clusters to implement cross-region application and data disaster recovery. It supports backup and restoration for stateless and stateful applications, including data. In addition, it supports a variety of backup policies and various storage implementations.
Moreover, backup and restoration enable migrating applications to the cloud and fill application configuration differences between multiple clouds through overwrite configurations.
Next, let's take a look at the architecture of edge clusters. The upper half is the control plane of the ACK edge cluster. On Alibaba Cloud, the control plane (such as APIServer and ETCD) of the ACK Pro Kubernetes cluster is reused to integrate open-source OpenYurt to provide edge node autonomy and network collaboration. After that, it integrates the AI suite, elasticity on the cloud, observability, and other capabilities of Alibaba Cloud.
The bottom half is the edge part, which can be connected to manage servers in data centers, including ENS, IDC, cross-region ECS, and servers distributed in factories, venues, and campuses.
An integrated architecture of cloud-edge collaboration is formed. You can create edge clusters on the cloud and connect distributed servers and devices to the clusters on the cloud. This allows you to schedule and use these servers and devices in containerized Kubernetes clusters and use the extended capabilities provided by ACK Pro.
If you create self-managed clusters in different locations to manage these servers and devices, the number of clusters will be large and require high maintenance costs. Currently, ACK Edge serves various industries, including online audio and video live streaming, cloud gaming, online education, and logistics.
ACK Edge focuses on on-premises data centers. Suppose that you have servers in your data center but do not want to create self-managed Kubernetes clusters to manage them for the purpose of saving labor costs.
In that case, you can use ACK edge clusters to manage these servers and run your core businesses in a containerized manner.
To this end, ACK Edge has built a series of product capabilities. The ACK Pro cluster control plane is adopted on the cloud, which offers good stability. Edge node autonomy ensures that network exceptions on the control plane do not affect the operation of edge data plane services. When network exceptions on the control plane occur, emergency operation and maintenance allow on-premises business pods to be modified. Cloud-edge hybrid elasticity facilitates the expansion of cloud ECS during peak hours. The observability capabilities of ACK are integrated to unify the technology stack. Cloud-edge traffic reuse supports access to large-scale edge nodes.
With the large-scale application of AI applications, GPU cards are in short supply, and it is difficult to obtain enough GPU cards from a single supplier. During the development and operation of AI applications, there are also problems such as low GPU utilization, lack of visualized interface and monitoring, and slow data loading.
Edge clusters allow you to manage distributed GPU computing power. For example, a cluster can manage GPU servers in both IDCs and the cloud to address the insufficiency of GPU resources. You can use the ACK cloud-native AI suite in data centers to improve AI engineering efficiency, optimize GPU usage, and accelerate the loading of images, models, and data.
Except for AI, ACK Edge can manage HPC resources at the edge, support Slurm scheduling, and run HPC tasks in a containerized manner.
Edge clusters support underlay networks, improving network performance by 30%, and support Serverless computing power on the cloud. Pods and nodes on-premises and on the cloud are in the same network plane, and pods can be directly accessed from outside the clusters. This adapts to containerized management of large-scale data center servers and unlocks more business scenarios.
First, let's take a look at the architecture of multi-cluster fleets. Multi-cluster fleets integrate and use open-source Open Cluster Management to uniformly manage multiple Kubernetes clusters, which can be ACK clusters in the public cloud, ACK edge clusters, Kubernetes clusters in data centers, and third-party public cloud clusters.
Multi-cluster fleets provide users with a unified control plane, offering capabilities such as application distribution, job scheduling, multi-cluster ingress gateways, multi-cluster services, global observability, and multi-cluster component management.
Multi-cluster fleets of ACK One fully manage multi-cluster application release system, supporting multi-cluster GitOps and multi-cluster application resource distribution with custom distribution rules. This allows you to quickly build a multi-cluster application release platform with low integration costs. They can implement the global multi-region release of batch applications with one click, which doubles the release efficiency and ensures rapid business iteration. They support integrated canary release, verification, and rollback processes to ensure release quality. At the same time, they can quickly convert single-cluster applications to multi-cluster deployments to achieve high business availability. In addition, they support multi-tenancy to implement multi-team isolation.
Finally, application release is a fully managed, O&M-free product that optimizes the control plane performance for large-scale application releases and supports the simultaneous release of thousands of applications.
The huge demand for computing resources in scenarios such as large model training and autonomous driving simulation is hitting the service capability ceiling of a single cloud region or data center. ACK One provides a Fleet + Clusters scheduling system to implement unified scheduling of tasks across clusters. It efficiently utilizes computing resources in multiple regions to meet users' optimization demands in terms of resource volume, cost, and performance.
Let's now take a look at multi-cluster traffic management. To ensure disaster recovery of applications, we usually deploy the same application in two clusters in different zones and deploy the load balancer at the front end. Traditional load balancer generally uses intelligent DNS or Layer-4 LoadBalancer. However, the problem with DNS lies in client caching, which leads to untimely switching when problems occur, resulting in business losses. The problem with Layer-4 LoadBalancer is that it cannot achieve Layer 7 request forwarding, resulting in poor flexibility.
To solve this problem, we launched a multi-cluster gateway, which provides a global ingress to achieve Layer 7 load balancing. The multi-cluster gateway itself also supports high availability across zones. Since it is a Layer 7 gateway, we can flexibly customize routing to implement routing policies based on weight, HTTP headers, and automatic fallback.
With multi-cluster gateways, we can implement multi-active disaster recovery for applications in multiple zones in the same city.
In addition, the fleet supports multi-cluster services and cross-cluster communication of services, facilitating cross-cluster migration of microservices.
Multi-cluster fleets of ACK One support global monitoring of multiple clusters. The status of multiple clusters and applications can be viewed on the same monitoring dashboard. They also support global FinOps. The costs of multiple clusters can be viewed on a single Finops dashboard, simplifying daily O&M.
Since ACK has a large number of components, version upgrades, configuration changes, and multi-cluster deployments lead to increased O&M complexity. ACK One is about to release the multi-cluster component management capability. It defines multiple component versions through component baselines and implements canary releases and exception rollbacks between clusters through batch deployment. O&M personnel can release ACK components with one click through multi-cluster component management, which greatly improves O&M efficiency and quality.
First of all, let's take a look at the open-source Argo Workflow project. Argo Workflow is one of the few graduated projects of CNCF, specifically designed for Kubernetes. It is an offline task orchestration and execution engine, suitable for data processing, scientific computing, simulation computing, and continuous integration. It is the de facto standard for Kubernetes workflows.
Open-source self-managed Argo workflow clusters are relatively simple. However, if they are used in the production environment, many problems often arise.
Stability: A large number of pods running on nodes will lead to resource contention, OOM, and disk fullness, resulting in frequent downtime. The running of large-scale workflow pods and the running of complex workflows often lead to exceptions in the Kubernetes API Server and Argo controller.
Cost and Scale: A fixed resource pool cannot be cost-effectively shared, and large-scale workflows cannot run normally.
Security and O&M: Version iteration and upgrading introduce new issues, requiring additional design for login, authentication, and CVE patching.
Open-source self-managed Argo does not support large-scale workflows in the production environment.
The above figure shows the architecture of the fully managed serverless Argo workflow cluster. It manages the open-source Argo workflows to support seamless migration of open-source Argo workflows.
In addition, you can use the API SDK interface of the workflow cluster to wrap the workflow cluster and build your own workflow submission system to adapt to specific business scenarios. Many customers in the fields of simulation, autonomous driving, and scientific computing use workflow clusters as backend engines. The event-driven programming model is supported. Git, Alibaba Cloud Object Storage Service (OSS), MNS, and Function/EventBridge are available to trigger the automatic operation of workflows, automatically running offline tasks such as CI and data processing.
The distributed Argo workflow cluster runs workflows in a serverless mode without the need to maintain worker nodes. By leveraging computing power resources on the cloud, it can implement large-scale workflows that run tens of thousands of pods simultaneously.
Compared with the open-source self-managed workflow, the fully managed serverless Argo workflow is significantly superior in terms of usability, operation efficiency, cost, stability, high performance, and large scale. It also provides after-sales support on a 7*24 basis.
For example, the stability and performance issues are fixed in a timely manner. 20,000 subtask pods can run in parallel in a single workflow and the overall 40,000 subtask pods run stably, which is 10 times higher than that of open-source projects. Subtask pods run in a serverless mode to cope with the optimal elasticity on the cloud to handle business peaks. There is no need to reserve maintenance nodes, which reduces the cost by 30% compared with open-source projects.
At present, Argo workflow clusters are widely used in multiple industries, including autonomous driving simulation, scientific computing, genomics computing, financial analysis, and digital media.
Finally, let's briefly summarize today's sharing:
Distributed clouds face a variety of scenarios and challenges. Therefore, ACK One provides registered clusters, which connect non-ACK clusters to ACK One, thereby achieving unified management of clusters on-premises and on the cloud and providing elasticity on the cloud. ACK One provides edge clusters to enable clusters on the cloud to manage servers in data centers. It supports edge node autonomy and collaborative AI. Multi-cluster fleets provide multi-cluster GitOps, application resource distribution, and O&M management, making it easy to implement multi-cluster management. Fully managed Argo workflow clusters implement workflow orchestration and run workflows in a serverless mode to reduce costs and cope with scenarios such as continuous integration, scientific computing, simulation computing, and data processing.
You can select the corresponding capabilities based on the specific business scenarios, and realize simple cross-cloud collaboration to improve business management efficiency.
Innovation and Practice of ACK in AI Intelligent Computing Scenarios
165 posts | 30 followers
FollowAlibaba Clouder - July 15, 2020
Alibaba Container Service - April 17, 2024
Alibaba Container Service - November 15, 2024
Alibaba Container Service - April 12, 2024
Alibaba Clouder - January 12, 2021
Hironobu Ohara - February 3, 2023
165 posts | 30 followers
FollowAlibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.
Learn MoreProvides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resources
Learn MoreAccelerate and secure the development, deployment, and management of containerized applications cost-effectively.
Learn MoreA secure image hosting platform providing containerized image lifecycle management
Learn MoreMore Posts by Alibaba Container Service