ACK allows you to centrally schedule, manage, and maintain heterogeneous computing resources. This significantly improves the utilization of heterogeneous resources in ACK clusters. This topic describes the features that ACK provides to manage heterogeneous resources in ACK clusters for heterogeneous computing.
Background information
With the emergence of 5G, AI, high performance computing (HPC), and edge computing services, the demand for computing power increases. General computing that is based on CPUs cannot meet the growing demand for computing power. Heterogeneous computing that is based on the Domain Specific Architecture (DSA) can meet the growing demand for computing power. Various heterogeneous computing resources, such as GPUs and Field Programmable Gate Arrays (FPGAs), are widely used in the preceding services.
However, enterprises find it difficult to manage a large number of heterogeneous resources. Alibaba Cloud provides an all-in-one solution for the management of heterogeneous resources. You can use this solution to schedule and manage heterogeneous resources in a unified manner.
Introduction to ACK clusters for heterogeneous computing
ACK allows you to centrally schedule, manage, and maintain heterogeneous resources in ACK clusters, such as GPUs, FPGAs, Application-Specific Integrated Circuits (ASICs), and remote direct memory access (RDMA). This improves resource utilization in ACK clusters for heterogeneous computing. The following table describes the features that ACK provides to manage heterogeneous resources in clusters for heterogeneous computing.
Heterogeneous resource | Description |
GPU | ACK allows you to create clusters that contain the NVIDIA T4, P100, V100, and A100 GPUs. For more information, see Create an ACK cluster with GPU-accelerated nodes and Create an ACK dedicated cluster with GPU-accelerated nodes. ACK supports resource requests for individual GPUs. ACK supports automatic scaling of GPU-accelerated nodes. For more information, see Enable auto scaling based on GPU metrics. ACK supports GPU sharing, GPU scheduling, and computing power isolation. The GPU sharing and scheduling capability provided by Alibaba Cloud allows you to schedule one GPU to multiple model inference applications. This significantly reduces costs. The cGPU solution provided by Alibaba Cloud allows you to isolate the GPU memory and computing power that are allocated to different applications without the need to modify application configurations. This improves application stability. The following list describes the supported GPU allocation policies. For more information, see GPU sharing overview and Allocate computing power by scheduling shared GPU. GPU sharing and memory isolation on a one-pod-one-GPU basis: This policy is commonly used in model inference scenarios. GPU sharing and memory isolation on a one-pod-multi-GPU basis: This policy is commonly used to build the code to train distributed models. GPU allocation by using the binpack or spread algorithm: If you use the binpack algorithm, the system preferentially shares one GPU with multiple pods. This algorithm is suitable for scenarios where high GPU utilization must be guaranteed. If you use the spread algorithm, the system attempts to allocate one GPU to each pod. This algorithm is suitable for scenarios where the high availability of GPUs must be guaranteed.
ACK supports topology-aware GPU scheduling. This feature retrieves the topology of heterogeneous resources from nodes and enables the scheduler to make scheduling decisions based on node topology information, NVlinks, peripheral component interconnect express (PCIe) switches, QuickPath Interconnect (QPI), and remote direct memory access (RDMA) NICs. This optimizes scheduling options and achieves optimal performance. For more information, see Overview of topology-aware GPU scheduling. ACK supports GPU resource monitoring. This feature collects the metrics of nodes and applications, detects and sends alerts on device (software and hardware) exceptions, and can be used to monitor dedicated GPUs and shared GPUs. For more information, see Monitor GPU errors and Use Prometheus Service to monitor the GPU resources of a Kubernetes cluster.
|
FPGA | ACK allows you to create clusters that contain FPGA devices. For more information, see Create an ACK cluster with FPGA-accelerated nodes. |
ASIC | ACK allows you to create clusters that contain NETINT ASIC devices and supports resource requests for individual NETINT ASIC cards. For more information, see Create an ASIC-accelerated cluster. |
RDMA | ACK allows you to create ACK clusters that contain RDMA devices. For more information, see eRDMA. You can use Arena to submit distributed deep learning jobs to RDMA devices. Allows you to create training jobs that require high bandwidth, such as distributed deep learning jobs.
|