Before enabling topology-aware GPU scheduling, you must install and configure the topology-aware GPU scheduling component. This topic describes how to install the topology-aware GPU scheduling component and enable topology-aware GPU scheduling for your cluster.
Prerequisites
An ACK managed cluster is created in the Container Service for Kubernetes (ACK) console with the instance type set to Elastic GPU Service.
The kubeconfig file of the cluster is obtained and a kubectl client is connected to the cluster.
The versions of the system components meet the following requirements.
Component
Version
Kubernetes
1.18.8 and later
Nvidia
418.87.01 and later
NVIDIA Collective Communications Library (NCCL)
2.7+
Operating system
CentOS 7.6
CentOS 7.7
Ubuntu 16.04
Ubuntu 18.04
Alibaba Cloud Linux 2
Alibaba Cloud Linux 3
GPU
V100
Procedure
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose .
On the Cloud-native AI Suite page, click Deploy.
In the Scheduling section of the page that appears, select Scheduling Policy Extension (Batch Task Scheduling, GPU Sharing, Topology-aware GPU Scheduling) and click Deploy Cloud-native AI Suite at the bottom. For more information about the parameters, see Install the cloud-native AI suite.
After the deployment, you can find the installed topology-aware GPU scheduling component
ack-ai-installerin the Components list.NoteIf you have installed a component of the cloud-native AI suite, find ack-ai-installer in the component list and click Deploy in the Actions column to install the component.