How to specify an NVIDIA driver version for nodes by adding a label - Container Service for Kubernetes

By default, the version of the NVIDIA driver installed in a Container Service for Kubernetes (ACK) cluster varies based on the type and version of the cluster. If the Compute Unified Device Architecture (CUDA) toolkit that you use requires an NVIDIA driver update, you need to manually install the NVIDIA driver on cluster nodes. This topic describes how to specify an NVIDIA driver version for GPU-accelerated nodes in a node pool by adding a label to the node pool.

Precautions

ACK does not guarantee the compatibility of NVIDIA drivers with the CUDA toolkit. You need to verify their compatibility.
For custom OS images that are installed with the NVIDIA driver and GPU components such as the NVIDIA Container Runtime, ACK does not guarantee the compatibility of the NVIDIA driver with other GPU components, such as the monitoring components.
If you add a label to a node pool to specify an NVIDIA driver version for GPU-accelerated nodes, the specified NVIDIA driver is installed only when a new node is added to the node pool. The NVIDIA driver is not installed on the existing nodes in the node pool. If you want to install the NVIDIA driver on the existing nodes, you need to remove these nodes from the node pool and re-add them to the node pool. For more information, see Remove a node and Add existing ECS instances to an ACK cluster.
The ecs.gn7.xxxxx and ecs.ebmgn7.xxxx instance types are incompatible with NVIDIA driver versions 510.xxx and 515.xxx. For the ecs.gn7.xxxxx and ecs.ebmgn7.xxxx instance types, we recommend that you use driver versions that are earlier than 510.xxx and have the GPU System Processor (GSP) disabled, such as 470.xxx.xxxx, or 525.125.06 or later versions.
For more information about the NVIDIA driver versions required by different NVIDIA models, including P100, T4, V100, and A10, see NVIDIA official documentation.

Step 1: Query the NVIDIA driver version

Select an NVIDIA driver version that is compatible with your applications from the NVIDIA driver versions supported by ACK list. For more information, see Select an NVIDIA driver version for nodes.

Step 2: Create a node pool and specify an NVIDIA driver version

In this example, the version of the NVIDIA driver is 418.181.07.

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage and choose Nodes > Node Pools in the left-side navigation pane.
Click Create Node Pool in the upper-right corner. In the Create Node Pool dialog box, configure node pool parameters.
The following table describes the key parameters. For more information about the parameters, see Create an ACK managed cluster.
1. Click Show Advanced Options.
2. In the Node Label section, click the icon. Set Key to ack.aliyun.com/nvidia-driver-version, and then set Value to 418.181.07.
  For more information about the NVIDIA driver versions supported by ACK, see NVIDIA driver versions supported by ACK.
  Important
  The Elastic Compute Service (ECS) instance types ecs.ebmgn7 and ecs.ebmgn7e support only NVIDIA driver versions later than 460.32.03.
3. After you set the parameters, click Confirm Order.

Step 3: Check whether the specified NVIDIA driver version is installed

Log on to the ACK console. In the left-side navigation pane, click Clusters.
In the ACK console, select the cluster that you want to manage and choose More > Open Cloud Shell in the Actions column.

Run the following command to query pods that have the component: nvidia-device-plugin label:

kubectl get po -n kube-system -l component=nvidia-device-plugin -o wide

Expected output:

NAME                                            READY   STATUS    RESTARTS   AGE   IP              NODE                       NOMINATED NODE   READINESS GATES
nvidia-device-plugin-cn-beijing.192.168.1.127   1/1     Running   0          6d    192.168.1.127   cn-beijing.192.168.1.127   <none>           <none>
nvidia-device-plugin-cn-beijing.192.168.1.128   1/1     Running   0          17m   192.168.1.128   cn-beijing.192.168.1.128   <none>           <none>
nvidia-device-plugin-cn-beijing.192.168.8.12    1/1     Running   0          9d    192.168.8.12    cn-beijing.192.168.8.12    <none>           <none>
nvidia-device-plugin-cn-beijing.192.168.8.13    1/1     Running   0          9d    192.168.8.13    cn-beijing.192.168.8.13    <none>           <none>
nvidia-device-plugin-cn-beijing.192.168.8.14    1/1     Running   0          9d    192.168.8.14    cn-beijing.192.168.8.14    <none>           <none>

You can check the NODE column to find the node that is newly added to the cluster. The name of the pod that runs on the node is nvidia-device-plugin-cn-beijing.192.168.1.128.

Run the following command to query the NVIDIA driver version of the node:

kubectl exec -ti nvidia-device-plugin-cn-beijing.192.168.1.128 -n kube-system -- nvidia-smi

Expected output:

Sun Feb  7 04:09:01 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.181.07   Driver Version: 418.181.07   CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:07.0 Off |                    0 |
| N/A   27C    P0    40W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:08.0 Off |                    0 |
| N/A   27C    P0    40W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:09.0 Off |                    0 |
| N/A   31C    P0    39W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:0A.0 Off |                    0 |
| N/A   27C    P0    41W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The output shows that the NVIDIA driver version is 418.181.07. This indicates that the NVIDIA driver is updated.

Other methods

When you call the API to create or scale out an ACK cluster, you can add a label to the node pool configuration to specify an NVIDIA driver version. Sample code:

{
  // Other fields are not shown.
  ......
    "tags": [
        {
            "key": "ack.aliyun.com/nvidia-driver-version",
            "value": "418.181.07"
        }
    ],
  // Other fields are not shown.
  ......
}