how to manually update the NVIDIA driver of a node - Container Service for Kubernetes

If your CUDA libraries require a new NVIDIA driver version, you can update the NVIDIA driver of the nodes by uninstalling the current version and then installing the new version. This topic describes how to update the NVIDIA driver of a node.

Prerequisites

The kubeconfig file of your cluster is obtained and a kubectl client is connected to your cluster.

Procedure

Step 1: Disconnect the node from the cluster and drain the node

Run the following command to set the GPU-accelerated node whose driver you want to update to unschedulable:
```
kubectl cordon <NODE_NAME>
```
Replace <NODE_NAME> with the name of the node.
Expected output:
```
node/<NODE_NAME> cordoned
```

Run the following command to evict the pods on the GPU-accelerated node:

kubectl drain <NODE_NAME> --grace-period=120 --ignore-daemonsets=true   # Set the grace period for evicting pods to 120 seconds.

Expected output:

There are pending nodes to be drained:
 <NODE_NAME>

Step 2: Uninstall the current NVIDIA driver of the GPU-accelerated node

Log on to the node and run the following command to stop the kubelet and containerd services on the node. Some DaemonSet pods use GPU resources. The kubectl drain command cannot evict the DaemonSet pods. You need to stop the kubelet and containerd services and then evict these pods.
```
sudo systemctl stop kubelet containerd
```
Log on to the node and run the fuser command to check whether processes that use GPU resources exist: If yes, run the kill command to terminate the process. During the NVIDIA driver update process, processes are not allowed to use GPU resources.
```
sudo fuser -v /dev/nvidia*
```
If no output is displayed, no process is using GPU resources. In the following example, process 3781 is using GPU resources.
```
                     USER        PID ACCESS COMMAND
/dev/nvidia0:        root       3781 F.... dcgm-exporter
/dev/nvidiactl:      root       3781 F...m dcgm-exporter
```
In this case, you need to terminate the process.
```
sudo kill 3781
```
Run the fuser command again to check whether processes that use GPU resources exist and then repeat the preceding step until all processes that use GPU resources are terminated.
Log on to the node and uninstall the NVIDIA driver.
```
sudo nvidia-uninstall
```
(Optional) Uninstall nvidia fabric manager.
Run the following command to check whether nvidia fabric manager is installed on the node:
```
sudo rpm -qa | grep ^nvidia-fabric-manager
```
If no output is displayed, nvidia fabric manager is not installed. Otherwise, run the following command to uninstall nvidia fabric manager:
```
yum remove nvidia-fabric-manager
```

Step 3: Install the new NVIDIA driver version on the node

Download the new driver version from the NVIDIA official site to the node and run the following command to install it. In this example, NVIDIA-Linux-x86_64-510.108.03.run is installed.
```
sudo bash NVIDIA-Linux-x86_64-510.108.03.run -a -s -q
```

Run the following command to check whether the new driver version is installed:

sudo nvidia-smi

The output indicates that the driver version is 510.108.03.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:07.0 Off |                    0 |
| N/A   35C    P0    40W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Configure the following settings:

sudo nvidia-smi -pm 1 || true                            # Enable the Persistence mode. 
sudo nvidia-smi -acp 0 || true                           # Set the permission requirement to UNRESTRICTED. 
sudo nvidia-smi --auto-boost-default=0 || true           # Disable the auto boost mode. A value of 0 indicates that the auto boost mode is disabled. 
sudo nvidia-smi --auto-boost-permission=0 || true        # Allow users other than administrators to control the auto boost mode. A value of 0 indicates that users other than administrators are allowed to control the auto boost mode. A value of 1 indicates that users other than administrators are not allowed to control the auto boost mode. 
sudo nvidia-modprobe -u -c=0 -m || true                  # Load the NVIDIA kernel module with a configuration file of the specified file number.

(Optional) If you want to automatically load the NVIDIA driver on boot, make sure that the /etc/rc.d/rc.local file contains the following configuration:

sudo nvidia-smi -pm 1 || true
sudo nvidia-smi -acp 0 || true
sudo nvidia-smi --auto-boost-default=0 || true
sudo nvidia-smi --auto-boost-permission=0 || true
sudo nvidia-modprobe -u -c=0 -m || true

Run the following command to check whether you need to install nvidia fabric manager on the node:
```
sudo lspci | grep -i 'Bridge:.*NVIDIA'
```
If no output is displayed, you do not need to install nvidia fabric manager. Download nvidia fabric manager from the NVIDIA YUM repository to the node. Make sure that the version of nvidia fabric manager is the same as that of the new NVIDIA driver.
Run the following command to install and launch nvidia fabric manager:
```
# Install nvidia fabric manager.
sudo yum localinstall nvidia-fabric-manager-510.108.03-1.x86_64.rpm 

# Enable nvidia fabric manager to start on boot. 
systemctl enable nvidia-fabricmanager.service
    
# Start nvidia fabric manager.
systemctl start nvidia-fabricmanager.service
```
Enable the kubelet and containerd services.
```
sudo systemctl restart containerd kubelet
```

Step 4: Connect the node to the cluster

Run the following command to connect the node to the cluster: Replace <NODE_NAME> with the name of the node.

sudo kubectl uncordon <NODE_NAME>