When you use Container Service for Kubernetes (ACK) clusters for GPU computing, you can use labels to schedule applications to specific GPU-accelerated nodes. This topic describes the labels that are used to specify GPU models. This topic also describes how to schedule applications to specific GPU models and how to avoid scheduling applications to specific GPU models.
Labels used to specify GPU models
After a GPU-accelerated node is added to an ACK cluster, the following labels are automatically added to the node:
Label | Description |
aliyun.accelerator/nvidia_name | The GPU model. |
aliyun.accelerator/nvidia_mem | The memory size of each GPU. |
aliyun.accelerator/nvidia_count | The number of GPUs provided by the node. |
You can run the nvidia-smi command-line tool on a GPU-accelerated node to query the values of the preceding labels.
Query type | Command |
Query the GPU model |
|
Query the memory size of each GPU |
|
Query the number of GPUs provided by the node |
|
Run the following command to query the GPU models provided by all nodes in a cluster:
kubectl get nodes -L aliyun.accelerator/nvidia_name
NAME STATUS ROLES AGE VERSION NVIDIA_NAME
cn-shanghai.192.XX.XX.176 Ready <none> 17d v1.26.3-aliyun.1 Tesla-V100-SXM2-32GB
cn-shanghai.192.XX.XX.177 Ready <none> 17d v1.26.3-aliyun.1 Tesla-V100-SXM2-32GB
cn-shanghai.192.XX.XX.130 Ready <none> 18d v1.26.3-aliyun.1 Tesla-V100-SXM2-32GB
cn-shanghai.192.XX.XX.131 Ready <none> 17d v1.26.3-aliyun.1 Tesla-V100-SXM2-32GB
cn-shanghai.192.XX.XX.132 Ready <none> 17d v1.26.3-aliyun.1 Tesla-V100-SXM2-32GB
Schedule applications to specific GPU models
You can schedule applications to specific GPU models by using the preceding labels. This section provides an example.
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage and choose in the left-side navigation pane.
On the Jobs page, click Create from YAML in the upper-right corner. The following page appears.
After you create the application, choose
in the left-side navigation pane of the cluster details page. On the Pods page, you can find that a pod is scheduled to a node equipped with the specified GPU model.
Avoid scheduling applications to specific GPU models
You can avoid scheduling applications to specific GPU models by configuring node affinity and anti-affinity. This section provides an example.
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage and choose in the left-side navigation pane.
On the Jobs page, click Create from YAML in the upper-right corner. The following page appears.
After you create the application, choose
in the left-side navigation pane of the cluster details page. On the Pods page, you can find that a pod is scheduled to a node that does not have the aliyun.accelerator/nvidia_name label.
References
After you install the scheduling component ack-ai-installer provided by the cloud-native AI suite, you can add a label to a GPU-accelerated node to enable a scheduling policy, such as GPU sharing or topology-aware GPU scheduling. For more information, see Labels for enabling GPU scheduling policies and methods for changing label values.