Overview of ACK clusters for heterogeneous computing

0.0.201

ACK allows you to centrally schedule, manage, and maintain heterogeneous computing resources. This significantly improves the utilization of heterogeneous resources in ACK clusters. This topic describes the features that ACK provides to manage heterogeneous resources in ACK clusters for heterogeneous computing.

Background information

With the emergence of 5G, AI, high performance computing (HPC), and edge computing services, the demand for computing power increases. General computing that is based on CPUs cannot meet the growing demand for computing power. Heterogeneous computing that is based on the Domain Specific Architecture (DSA) can meet the growing demand for computing power. Various heterogeneous computing resources, such as GPUs and Field Programmable Gate Arrays (FPGAs), are widely used in the preceding services.

However, enterprises find it difficult to manage a large number of heterogeneous resources. Alibaba Cloud provides an all-in-one solution for the management of heterogeneous resources. You can use this solution to schedule and manage heterogeneous resources in a unified manner.

Introduction to ACK clusters for heterogeneous computing

ACK allows you to centrally schedule, manage, and maintain heterogeneous resources in ACK clusters, such as GPUs, FPGAs, Application-Specific Integrated Circuits (ASICs), and remote direct memory access (RDMA). This improves resource utilization in ACK clusters for heterogeneous computing. The following table describes the features that ACK provides to manage heterogeneous resources in clusters for heterogeneous computing.

Heterogeneous resource	Description

Heterogeneous resource	Description
GPU	ACK allows you to create clusters that contain the NVIDIA T4, P100, V100, and A100 GPUs. For more information, see Create an ACK cluster with GPU-accelerated nodes and Create an ACK dedicated cluster with GPU-accelerated nodes. ACK supports resource requests for individual GPUs. ACK supports automatic scaling of GPU-accelerated nodes. For more information, see Enable auto scaling based on GPU metrics. ACK supports GPU sharing, GPU scheduling, and computing power isolation. The GPU sharing and scheduling capability provided by Alibaba Cloud allows you to schedule one GPU to multiple model inference applications. This significantly reduces costs. The cGPU solution provided by Alibaba Cloud allows you to isolate the GPU memory and computing power that are allocated to different applications without the need to modify application configurations. This improves application stability. The following list describes the supported GPU allocation policies. For more information, see GPU sharing overview and Allocate computing power by scheduling shared GPU. GPU sharing and memory isolation on a one-pod-one-GPU basis: This policy is commonly used in model inference scenarios. GPU sharing and memory isolation on a one-pod-multi-GPU basis: This policy is commonly used to build the code to train distributed models. GPU allocation by using the binpack or spread algorithm: If you use the binpack algorithm, the system preferentially shares one GPU with multiple pods. This algorithm is suitable for scenarios where high GPU utilization must be guaranteed. If you use the spread algorithm, the system attempts to allocate one GPU to each pod. This algorithm is suitable for scenarios where the high availability of GPUs must be guaranteed. ACK supports topology-aware GPU scheduling. This feature retrieves the topology of heterogeneous resources from nodes and enables the scheduler to make scheduling decisions based on node topology information, NVlinks, peripheral component interconnect express (PCIe) switches, QuickPath Interconnect (QPI), and remote direct memory access (RDMA) NICs. This optimizes scheduling options and achieves optimal performance. For more information, see Overview of topology-aware GPU scheduling. ACK supports GPU resource monitoring. This feature collects the metrics of nodes and applications, detects and sends alerts on device (software and hardware) exceptions, and can be used to monitor dedicated GPUs and shared GPUs. For more information, see Monitor GPU errors and Use Prometheus Service to monitor the GPU resources of a Kubernetes cluster.
FPGA	ACK allows you to create clusters that contain FPGA devices. For more information, see Create an ACK cluster with FPGA-accelerated nodes. ACK supports resource requests for individual FPGAs. ACK allows you to schedule pods to FPGA-accelerated node based on labels. For more information, see Schedule workloads to FPGA-accelerated nodes.
ASIC	ACK allows you to create clusters that contain NETINT ASIC devices and supports resource requests for individual NETINT ASIC cards. For more information, see Create an ASIC-accelerated cluster.
RDMA	ACK allows you to create ACK clusters that contain RDMA devices. For more information, see eRDMA. You can use Arena to submit distributed deep learning jobs to RDMA devices. Allows you to create training jobs that require high bandwidth, such as distributed deep learning jobs.

Feedback

Previous: Workload scaling FAQNext: GPU-accelerated ECS instance types supported by ACK

On this page （1）

Background information

Introduction to ACK clusters for heterogeneous computing

Chat now with Alibaba Cloud Customer Service to assist you in finding the right products and services to meet your needs.

Background information

Introduction to ACK clusters for heterogeneous computing

Sales Support

Technical Support

Connect & Report Abuse

About Alibaba Cloud

Our Global Network

Quick Start

Global Offices

Olympic Games Paris 2024 New

Stade Roland Garros – Glitz from the Past New

Place de la Concorde – “Breaking” the Barriers New

Vaires-sur-Marne Nautical Stadium – Sports with Sustainability New

International Broadcast Center – Images, Sounds, and Data that Captivate Billions New

Customer Success Stories New

Trust Center

Security & Compliance Center

Cloud Compliance Resources

Security Compliance FAQs

Product & Feature Update New

Cloud Forward

Press Room

Alibaba Cloud e-Magazine New

Alibaba Cloud in Analyst Research

Notice

Go Global Service New

Go Global Alliance with Alibaba Cloud

China Gateway Hot

Information Compliance

China Gateway - MLPS 2.0 Compliance New

China Gateway - Networking

China Gateway - Global Application Acceleration New

China Gateway - Security

China Gateway - Data Security New

ICP Support Hot

China Gateway - Omnichannel Data Mid-End New

China Gateway - Organizational Data Mid-End New

China Gateway - Business Mid-End New

China Gateway - AI Service for Conversational Chatbots New

China Gateway - Online Education

China Gateway - Domain Registration

Work at Alibaba Cloud

Experienced Professionals

Students and Graduates

Free Trial

Pricing

Promo Center

Price Reduction

Pay Less and Deploy More

FinOps

Elastic Compute Service (ECS)

Simple Application Server (SAS)

Elastic GPU Service

Elastic Desktop Service (EDS)

Object Storage Service (OSS)

Cloud Enterprise Network (CEN)

Web Application Firewall (WAF)

Domain Names

Container Compute Service (ACS)

Secure Access Service Edge (SASE)

Intelligent Media Services(IMS)

Edge Security Acceleration (ESA)(Original DCDN)

Intelligent Media Management

DingTalk Enterprise

YiDA

Alibaba Cloud Model Studio

Apsara Prime - For Easy Cloud Product Selection

Alibaba Cloud ECS - Cater All Your Cloud Hosting Needs

1TB CDN—Get Free 1 TB Outbound Traffic Plan Now

Security—Under Attack? Get Free Security Support

Short Message Service - Free Testing is Available

Elastic Compute Service (ECS) Hot

CloudBox

Compute Nest

Dedicated Host Hot

ECS Bare Metal Instance

Elastic Desktop Service (EDS) Featured

Cloud Phone Beta

Elastic GPU Service Featured

Simple Application Server (SAS) Hot

Auto Scaling

Batch Compute

Elastic High Performance Computing (E-HPC)

Super Computing Cluster (SCC)

Function Compute (FC)