How to install the GPU sharing component - Container Service for Kubernetes

Container Service for Kubernetes (ACK) provides the GPU sharing feature that allows multiple models to share one GPU and supports GPU memory isolation based on the NVIDIA kernel mode driver. This topic describes how to install the GPU sharing component and GPU inspection tool on a GPU-accelerated node to support GPU sharing and GPU memory isolation.

Prerequisites

The cloud-native AI suite is activated before you use GPU sharing. For more information about the cloud-native AI suite and how it is billed, see Overview of the cloud-native AI suite and Billing of the cloud-native AI suite.
An ACK Pro cluster is created. When creating the ACK Pro cluster, select the GPU-accelerated architecture for the Instance Type in the Node Pool Configurations step. For more information, see Create an ACK managed cluster.
The kubeconfig file of the cluster is obtained and used to connect to the cluster by using kubectl.

Limits

Do not set the CPU policy to static for nodes for which GPU sharing is enabled.
cGPU does not support CUDA API cudaMallocManaged(). This indicates that cGPU does not allow you to request GPU memory by using the Unified Virtual Memory (UVM) method. You need to use another method, such as cudaMalloc(), to request GPU memory. For more information, visit the NVIDIA official website.
The pods managed by the DaemonSet of the shared GPU do not enjoy the highest priority. Therefore, the resources may be scheduled to pods that have higher priority and the node may evict the pods managed by the DaemonSet. To prevent this issue, you can modify the actual DaemonSet of the shared GPU. For example, you can modify the gpushare-device-plugin-ds DaemonSet used to share GPU memory and specify priorityClassName: system-node-critical to ensure the priority of the pods managed by the DaemonSet.

You can install the GPU sharing component without region limits. However, GPU memory isolation is supported only in the regions that are described in the following table. Make sure that your ACK cluster is deployed in one of the regions.

Regions

Region	Region ID
China (Beijing)	cn-beijing
China (Shanghai)	cn-shanghai
China (Hangzhou)	cn-hangzhou
China (Zhangjiakou)	cn-zhangjiakou
China (Ulanqab)	cn-wulanchabu
China (Shenzhen)	cn-shenzhen
China (Chengdu)	cn-chengdu
China (Heyuan)	cn-heyuan
China (Hong Kong)	cn-hongkong
Japan (Tokyo)	ap-northeast-1
Indonesia (Jakarta)	ap-southeast-5
Singapore	ap-southeast-1
US (Virginia)	us-east-1
US (Silicon Valley)	us-west-1
Germany (Frankfurt)	eu-central-1

Version requirements

Item	Version requirement
Kubernetes version	1.18.8 or later
NVIDIA driver version	418.87.01 or later
Container runtime version	Docker: 19.03.5 or later containerd: 1.4.3 or later
Operating system	Alibaba Cloud Linux 3.x, Alibaba Cloud Linux 2.x, CentOS 7.6, CentOS 7.7, and CentOS 7.9
GPU model	NVIDIA P, NVIDIA T, NVIDIA V, NVIDIA A, and NVIDIA H series

Step 1: Install the GPU sharing component

The cloud-native AI suite is not deployed

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Applications > Cloud-native AI Suite.
On the Cloud-native AI Suite page, click Deploy.
On the Cloud-native AI Suite page, select Scheduling Component (Batch Task Scheduling, GPU Sharing, Topology-aware GPU scheduling, and NPU scheduling).
Optional. Click Advanced to the right of Scheduling Component (Batch Task Scheduling, GPU Sharing, Topology-aware GPU scheduling, and NPU scheduling). In the Parameters panel, modify the policy parameter of cGPU. Click OK.
If you do not have requirements on the computing power sharing feature provided by cGPU, we recommend that you use the default setting policy: 5. For more information about the policies supported by cGPU, see Install and use cGPU on a Docker container.
In the lower part of the Cloud-native AI Suite page, click Deploy Cloud-native AI Suite.
After the cloud-native AI suite is installed, you can find that ack-ai-installer is in the Deployed state on the Cloud-native AI Suite page.

The cloud-native AI suite is deployed

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Applications > Cloud-native AI Suite.
Find ack-ai-installer and click Deploy in the Actions column.
Optional. In the Parameters panel, modify the policy parameter of cGPU.
If you do not have requirements on the computing power sharing feature provided by cGPU, we recommend that you use the default setting policy: 5. For more information about the policies supported by cGPU, see Install and use cGPU on a Docker container.
After you complete the configuration, click OK.
After ack-ai-installer is installed, the state of the component changes to Deployed.

Step 2: Enable GPU sharing and GPU memory isolation

On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose Nodes > Node Pools.
In the upper-right corner of the Node Pools page, click Create Node Pool.

In the Create Node Pool dialog box, configure the parameters to create a node pool and click Confirm Order.

The following table describes the key parameters. For more information about other parameters, see Create and manage a node pool.

Parameter

Description

Expected Nodes

The initial number of nodes in the node pool. If you do not want to create nodes in the node pool, set this parameter to 0.

Node Label

The labels that you want to add to the node pool based on your business requirement. For more information about node labels, see Labels for enabling GPU scheduling policies and methods for changing label values.

In this example, the value of the label is set to cgpu, which indicates that GPU sharing is enabled for the node. The pods on the node need to request only GPU memory. Multiple pods can share the same GPU to implement GPU memory isolation and computing power sharing.

Click the 节点标签 icon next to the Node Label parameter, set the Key field to ack.node.gpu.schedule, and then set the Value field to cgpu.

For more information about some common issues when you use the memory isolation capability provided by cGPU, see Usage notes for the memory isolation capability of cGPU.

Important

After you add the label for enabling GPU sharing to a node, do not run the kubectl label nodes command to change the label value or use the label management feature to change the node label on the Nodes page in the ACK console. This prevents potential issues. For more information about these potential issues, see Issues that may occur if you use the kubectl label nodes command or use the label management feature to change label values in the ACK console. We recommend that you configure GPU sharing based on node pools. For more information, see Configure GPU scheduling policies for node pools.

Step 3: Add GPU-accelerated nodes

Note

If you have already added GPU-accelerated nodes to the node pool when you create the node pool, skip this step.

After the node pool is created, you can add GPU-accelerated nodes to the node pool. To add GPU-accelerated nodes, you need to select ECS instances that use the GPU-accelerated architecture. For more information, see Add existing ECS instances to an ACK cluster or Create and manage a node pool.

Step 4: Install and use the GPU inspection tool

Download kubectl-inspect-cgpu. The executable file must be downloaded to a directory included in the PATH environment variable. This section uses /usr/local/bin/ as an example.
- If you use Linux, run the following command to download kubectl-inspect-cgpu:
```
wget http://aliacs-k8s-cn-beijing.oss-cn-beijing.aliyuncs.com/gpushare/kubectl-inspect-cgpu-linux -O /usr/local/bin/kubectl-inspect-cgpu
```
- If you use macOS, run the following command to download kubectl-inspect-cgpu:
```
wget http://aliacs-k8s-cn-beijing.oss-cn-beijing.aliyuncs.com/gpushare/kubectl-inspect-cgpu-darwin -O /usr/local/bin/kubectl-inspect-cgpu
```
Run the following command to grant the execute permissions to kubectl-inspect-cgpu:
```
chmod +x /usr/local/bin/kubectl-inspect-cgpu
```

Run the following command to query the GPU usage of the cluster:

kubectl inspect cgpu

Expected output:

NAME                       IPADDRESS      GPU0(Allocated/Total)  GPU Memory(GiB)
cn-shanghai.192.168.6.104  192.168.6.104  0/15                   0/15
----------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
0/15 (0%)

References

For more information about the release notes of the GPU sharing component, see ack-ai-installer.
To update the GPU sharing component, see Update the GPU sharing component.