Before you enable topology-aware GPU scheduling, you need to install and configure the topology-aware GPU scheduling component This topic describes how to install the topology-aware GPU scheduling component and enable topology-aware GPU scheduling for your cluster.
Prerequisites
A Container Service for Kubernetes (ACK) Pro cluster is created and the instance type of the cluster is set to Elastic GPU Service. For more information, see Create an ACK managed cluster.
The kubeconfig file of the cluster is obtained and a kubectl client is connected to the cluster.
The versions of the system components meet the following requirements.
Component
Version
Kubernetes
1.18.8 and later
Nvidia
418.87.01 and later
NVIDIA Collective Communications Library (NCCL)
2.7+
Operating system
CentOS 7.6
CentOS 7.7
Ubuntu 16.04
Ubuntu 18.04
Alibaba Cloud Linux 2
Alibaba Cloud Linux 3
GPU
V100
Procedure
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage and choose in the left-side navigation pane.
On the Cloud-native AI Suite page, click Deploy.
In the Scheduling section of the page that appears, select Scheduling Component (Batch Task Scheduling, GPU Sharing, Topology-aware GPU scheduling, and NPU scheduling) and click Deploy Cloud-native AI Suite at the bottom. For more information about the parameters, see Install the cloud-native AI suite.
After the cloud-native AI suite is installed, you can find the topology-aware GPU scheduling component named ack-ai-installer in the Components list of the Cloud-native AI Suite page.
NoteIf you have installed a component of the cloud-native AI suite, find ack-ai-installer in the component list and click Deploy in the Actions column to install the component.