All Products
Search
Document Center

Container Service for Kubernetes:Labels used to specify GPU models

Last Updated:Apr 22, 2024

When you use Container Service for Kubernetes (ACK) clusters for GPU computing, you can use labels to schedule applications to specific GPU-accelerated nodes. This topic describes the labels that are used to specify GPU models. This topic also describes how to schedule applications to specific GPU models and how to avoid scheduling applications to specific GPU models.

Labels used to specify GPU models

After a GPU-accelerated node is added to an ACK cluster, the following labels are automatically added to the node:

Label

Description

aliyun.accelerator/nvidia_name

The GPU model.

aliyun.accelerator/nvidia_mem

The memory size of each GPU.

aliyun.accelerator/nvidia_count

The number of GPUs provided by the node.

You can run the nvidia-smi command-line tool on a GPU-accelerated node to query the values of the preceding labels.

Query type

Command

Query the GPU model

nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 | sed -e 's/ /-/g'

Query the memory size of each GPU

nvidia-smi --id=0 --query-gpu=memory.total --format=csv,noheader | sed -e 's/ //g'

Query the number of GPUs provided by the node

nvidia-smi -L | wc -l

Run the following command to query the GPU models provided by all nodes in a cluster:

kubectl get nodes -L  aliyun.accelerator/nvidia_name
NAME                        STATUS   ROLES    AGE   VERSION            NVIDIA_NAME
cn-shanghai.192.XX.XX.176   Ready    <none>   17d   v1.26.3-aliyun.1   Tesla-V100-SXM2-32GB
cn-shanghai.192.XX.XX.177   Ready    <none>   17d   v1.26.3-aliyun.1   Tesla-V100-SXM2-32GB
cn-shanghai.192.XX.XX.130   Ready    <none>   18d   v1.26.3-aliyun.1   Tesla-V100-SXM2-32GB
cn-shanghai.192.XX.XX.131   Ready    <none>   17d   v1.26.3-aliyun.1   Tesla-V100-SXM2-32GB
cn-shanghai.192.XX.XX.132   Ready    <none>   17d   v1.26.3-aliyun.1   Tesla-V100-SXM2-32GB

Schedule applications to specific GPU models

You can schedule applications to specific GPU models by using the preceding labels. This section provides an example.

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the name of the cluster that you want to manage and choose Workloads > Jobs in the left-side navigation pane.

  3. On the Jobs page, click Create from YAML in the upper-right corner. The following page appears.1.jpg

    Show sample code

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: tensorflow-mnist
    spec:
      parallelism: 1
      template:
        metadata:
          labels:
            app: tensorflow-mnist
        spec:
          nodeSelector:
            aliyun.accelerator/nvidia_name: "Tesla-V100-SXM2-32GB" # Schedule the application to Tesla V100-SXM2-32GB. 
          containers:
          - name: tensorflow-mnist
            image: registry.cn-beijing.aliyuncs.com/acs/tensorflow-mnist-sample:v1.5
            command:
            - python
            - tensorflow-sample-code/tfjob/docker/mnist/main.py
            - --max_steps=1000
            - --data_dir=tensorflow-sample-code/data
            resources:
              limits:
                nvidia.com/gpu: 1
            workingDir: /root
          restartPolicy: Never

    After you create the application, choose Workloads > Pods in the left-side navigation pane of the cluster details page. On the Pods page, you can find that a pod is scheduled to a node equipped with the specified GPU model.

Avoid scheduling applications to specific GPU models

You can avoid scheduling applications to specific GPU models by configuring node affinity and anti-affinity. This section provides an example.

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the name of the cluster that you want to manage and choose Workloads > Jobs in the left-side navigation pane.

  3. On the Jobs page, click Create from YAML in the upper-right corner. The following page appears.2.jpg

    Show sample code

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: tensorflow-mnist
    spec:
      parallelism: 1
      template:
        metadata:
          labels:
            app: tensorflow-mnist
        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: aliyun.accelerator/nvidia_name  # The application is not scheduled to GPU-accelerated nodes that have the aliyun.accelerator/nvidia_name label. 
                    operator: DoesNotExist
          containers:
          - name: tensorflow-mnist
            image: registry.cn-beijing.aliyuncs.com/acs/tensorflow-mnist-sample:v1.5
            command:
            - python
            - tensorflow-sample-code/tfjob/docker/mnist/main.py
            - --max_steps=1000
            - --data_dir=tensorflow-sample-code/data
            resources:
              limits:
                nvidia.com/gpu: 1
            workingDir: /root
          restartPolicy: Never

    After you create the application, choose Workloads > Pods in the left-side navigation pane of the cluster details page. On the Pods page, you can find that a pod is scheduled to a node that does not have the aliyun.accelerator/nvidia_name label.

References

After you install the scheduling component ack-ai-installer provided by the cloud-native AI suite, you can add a label to a GPU-accelerated node to enable a scheduling policy, such as GPU sharing or topology-aware GPU scheduling. For more information, see Labels for enabling GPU scheduling policies and methods for changing label values.