ACK Lingjun managed clusters are developed based on Intelligent Computing LINGJUN. This type of cluster provides fully-managed and highly-available control planes. You can deploy Lingjun computing nodes in ACK Lingjun managed clusters. This topic introduces ACK Lingjun managed clusters, and describes the features and advantages of ACK Lingjun managed clusters.
Table of contents
Usage notes
To use ACK Lingjun managed clusters, you must first create a Lingjun cluster with ACK activated in the Intelligent Computing LINGJUN console.
For more information about the operations that you can perform on ACK Lingjun managed clusters and the features of ACK Lingjun managed clusters, see the following topics:
Introduction
ACK Lingjun managed clusters provide fully-managed and highly-available control planes, and support efficient heterogeneous resource management and heterogeneous task scheduling. This type of cluster can be used as the cloud-native base of Platform for AI, and provides enhanced cloud-native capabilities that are suitable for AI and High Performance Computing (HPC) scenarios. The following figure shows the architecture of an ACK Lingjun managed cluster. The architecture decouples software from hardware and is integrated with various Alibaba Cloud services to provide stable, reliable, efficient, and secure infrastructure services for cloud-native AI workloads.
Features
Cluster management
ACK Lingjun managed clusters and ACK Pro clusters provide the same cluster management capabilities. ACK creates and manages the control planes of ACK Lingjun managed clusters. By default, the control planes of an ACK Lingjun managed cluster are deployed across three zones to ensure high availability. You can manage the lifecycle of an ACK Lingjun managed cluster. For example, you can grant permissions on the cluster, monitor the cluster, update the cluster, and manage the components in the cluster.
Node Management
ACK Lingjun managed clusters provide Lingjun node pools where you can deploy Lingjun computing nodes. Lingjun node pools support lifecycle management and provide the same management and O&M features as Elastic Compute Service (ECS) node pools. For example, you can add or remove nodes in batches, configure nodes, maintain nodes, use fully-managed nodes, schedule applications to specified nodes, monitor nodes, diagnose nodes, and run automatic node O&M tasks.
Cloud-native AI
By default, ACK Lingjun managed clusters provide components to enhance cloud-native capabilities. For example, ACK Lingjun managed clusters support topology-aware multi-GPU scheduling, and enable GPU scheduling and isolation based on eGPU, which is a GPU virtualization component for GPU-accelerated containers. ACK Lingjun managed clusters provide gang scheduling and capacity scheduling, and support the binpack scheduling policy. In addition, ACK Lingjun managed clusters support dataset orchestration and access acceleration.
Competitive advantages
Security and stability
ACK Lingjun managed clusters provide the same enterprise-class features as ACK Pro clusters and highly-available and managed control planes. This eliminates the need to manually build and configure clusters. ACK Lingjun managed clusters ensure the stability, reliability, and security of clusters and support service level agreements (SLAs) that contain compensation clauses. ACK Lingjun managed clusters can meet the requirements of enterprises in large-scale production environments.
Simplified O&M
ACK Lingjun managed clusters provide Kubernetes-native services and are deeply integrated with Intelligent Computing LINGJUN and relevant Alibaba Cloud services. ACK Lingjun managed clusters simplify operations and automate O&M for clusters and Lingjun computing nodes, provide the same management experience as ECS nodes, and significantly reduce adaption and O&M costs.
Improved efficiency and acceleration
ACK Lingjun managed clusters provide GPU sharing, GPU scheduling, and topology-aware GPU scheduling to improve the efficiency and performance of heterogeneous resources. ACK Lingjun managed clusters provide rich scheduling policies and priority-based job queue management for AI and HPC tasks. These features can improve the execution efficiency of AI training jobs and inference tasks, and provide a unified and standard method to manage and deliver AI resources and workloads.