Labels used to specify GPU models

When you use Container Service for Kubernetes (ACK) clusters for GPU computing, you can use labels to schedule applications to specific GPU-accelerated nodes. This topic describes the labels that are used to specify GPU models. This topic also describes how to schedule applications to specific GPU models and how to avoid scheduling applications to specific GPU models.

After a GPU-accelerated node is added to an ACK cluster, the following labels are automatically added to the node:

Label	Description
aliyun.accelerator/nvidia_name	The GPU model.
aliyun.accelerator/nvidia_mem	The memory size of each GPU.
aliyun.accelerator/nvidia_count	The number of GPUs provided by the node.

You can run the nvidia-smi command-line tool on a GPU-accelerated node to query the values of the preceding labels.

Query type	Command
Query the GPU model	`nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 \| sed -e 's/ /-/g'`
Query the memory size of each GPU	`nvidia-smi --id=0 --query-gpu=memory.total --format=csv,noheader \| sed -e 's/ //g'`
Query the number of GPUs provided by the node	`nvidia-smi -L \| wc -l`

Run the following command to query the GPU models provided by all nodes in a cluster:

kubectl get nodes -L  aliyun.accelerator/nvidia_name
NAME                        STATUS   ROLES    AGE   VERSION            NVIDIA_NAME
cn-shanghai.192.XX.XX.176   Ready    <none>   17d   v1.26.3-aliyun.1   Tesla-V100-SXM2-32GB
cn-shanghai.192.XX.XX.177   Ready    <none>   17d   v1.26.3-aliyun.1   Tesla-V100-SXM2-32GB
cn-shanghai.192.XX.XX.130   Ready    <none>   18d   v1.26.3-aliyun.1   Tesla-V100-SXM2-32GB
cn-shanghai.192.XX.XX.131   Ready    <none>   17d   v1.26.3-aliyun.1   Tesla-V100-SXM2-32GB
cn-shanghai.192.XX.XX.132   Ready    <none>   17d   v1.26.3-aliyun.1   Tesla-V100-SXM2-32GB

Schedule applications to specific GPU models

You can schedule applications to specific GPU models by using the preceding labels. This section provides an example.

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage and choose Workloads > Jobs in the left-side navigation pane.

On the Jobs page, click Create from YAML in the upper-right corner. The following page appears.

Show sample code

apiVersion: batch/v1
kind: Job
metadata:
  name: tensorflow-mnist
spec:
  parallelism: 1
  template:
    metadata:
      labels:
        app: tensorflow-mnist
    spec:
      nodeSelector:
        aliyun.accelerator/nvidia_name: "Tesla-V100-SXM2-32GB" # Schedule the application to Tesla V100-SXM2-32GB. 
      containers:
      - name: tensorflow-mnist
        image: registry.cn-beijing.aliyuncs.com/acs/tensorflow-mnist-sample:v1.5
        command:
        - python
        - tensorflow-sample-code/tfjob/docker/mnist/main.py
        - --max_steps=1000
        - --data_dir=tensorflow-sample-code/data
        resources:
          limits:
            nvidia.com/gpu: 1
        workingDir: /root
      restartPolicy: Never

After you create the application, choose Workloads > Pods in the left-side navigation pane of the cluster details page. On the Pods page, you can find that a pod is scheduled to a node equipped with the specified GPU model.

Avoid scheduling applications to specific GPU models

You can avoid scheduling applications to specific GPU models by configuring node affinity and anti-affinity. This section provides an example.

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage and choose Workloads > Jobs in the left-side navigation pane.

On the Jobs page, click Create from YAML in the upper-right corner. The following page appears.

Show sample code

apiVersion: batch/v1
kind: Job
metadata:
  name: tensorflow-mnist
spec:
  parallelism: 1
  template:
    metadata:
      labels:
        app: tensorflow-mnist
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: aliyun.accelerator/nvidia_name  # The application is not scheduled to GPU-accelerated nodes that have the aliyun.accelerator/nvidia_name label. 
                operator: DoesNotExist
      containers:
      - name: tensorflow-mnist
        image: registry.cn-beijing.aliyuncs.com/acs/tensorflow-mnist-sample:v1.5
        command:
        - python
        - tensorflow-sample-code/tfjob/docker/mnist/main.py
        - --max_steps=1000
        - --data_dir=tensorflow-sample-code/data
        resources:
          limits:
            nvidia.com/gpu: 1
        workingDir: /root
      restartPolicy: Never

References

After you install the scheduling component ack-ai-installer provided by the cloud-native AI suite, you can add a label to a GPU-accelerated node to enable a scheduling policy, such as GPU sharing or topology-aware GPU scheduling. For more information, see Labels for enabling GPU scheduling policies and methods for changing label values.