If your CUDA libraries require a new NVIDIA driver version, you can update the NVIDIA driver of the nodes by uninstalling the current version and then installing the new version. This topic describes how to update the NVIDIA driver of a node.
Prerequisites
The kubeconfig file of your cluster is obtained and a kubectl client is connected to your cluster.
Procedure
Step 1: Disconnect the node from the cluster and drain the node
Run the following command to set the GPU-accelerated node whose driver you want to update to unschedulable:
kubectl cordon <NODE_NAME>
Replace <NODE_NAME> with the name of the node.
Expected output:
node/<NODE_NAME> cordoned
Run the following command to evict the pods on the GPU-accelerated node:
kubectl drain <NODE_NAME> --grace-period=120 --ignore-daemonsets=true # Set the grace period for evicting pods to 120 seconds.
Expected output:
There are pending nodes to be drained: <NODE_NAME>
Step 2: Uninstall the current NVIDIA driver of the GPU-accelerated node
Log on to the node and run the following command to stop the kubelet and containerd services on the node. Some DaemonSet pods use GPU resources. The kubectl drain command cannot evict the DaemonSet pods. You need to stop the kubelet and containerd services and then evict these pods.
sudo systemctl stop kubelet containerd
Log on to the node and run the fuser command to check whether processes that use GPU resources exist: If yes, run the kill command to terminate the process. During the NVIDIA driver update process, processes are not allowed to use GPU resources.
sudo fuser -v /dev/nvidia*
If no output is displayed, no process is using GPU resources. In the following example, process 3781 is using GPU resources.
USER PID ACCESS COMMAND /dev/nvidia0: root 3781 F.... dcgm-exporter /dev/nvidiactl: root 3781 F...m dcgm-exporter
In this case, you need to terminate the process.
sudo kill 3781
Run the fuser command again to check whether processes that use GPU resources exist and then repeat the preceding step until all processes that use GPU resources are terminated.
Log on to the node and uninstall the NVIDIA driver.
sudo nvidia-uninstall
(Optional) Uninstall nvidia fabric manager.
Run the following command to check whether nvidia fabric manager is installed on the node:
sudo rpm -qa | grep ^nvidia-fabric-manager
If no output is displayed, nvidia fabric manager is not installed. Otherwise, run the following command to uninstall nvidia fabric manager:
yum remove nvidia-fabric-manager
Step 3: Install the new NVIDIA driver version on the node
Download the new driver version from the NVIDIA official site to the node and run the following command to install it. In this example, NVIDIA-Linux-x86_64-510.108.03.run is installed.
sudo bash NVIDIA-Linux-x86_64-510.108.03.run -a -s -q
Run the following command to check whether the new driver version is installed:
sudo nvidia-smi
The output indicates that the driver version is 510.108.03.
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... Off | 00000000:00:07.0 Off | 0 | | N/A 35C P0 40W / 300W | 0MiB / 32768MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Configure the following settings:
sudo nvidia-smi -pm 1 || true # Enable the Persistence mode. sudo nvidia-smi -acp 0 || true # Set the permission requirement to UNRESTRICTED. sudo nvidia-smi --auto-boost-default=0 || true # Disable the auto boost mode. A value of 0 indicates that the auto boost mode is disabled. sudo nvidia-smi --auto-boost-permission=0 || true # Allow users other than administrators to control the auto boost mode. A value of 0 indicates that users other than administrators are allowed to control the auto boost mode. A value of 1 indicates that users other than administrators are not allowed to control the auto boost mode. sudo nvidia-modprobe -u -c=0 -m || true # Load the NVIDIA kernel module with a configuration file of the specified file number.
(Optional) If you want to automatically load the NVIDIA driver on boot, make sure that the /etc/rc.d/rc.local file contains the following configuration:
sudo nvidia-smi -pm 1 || true sudo nvidia-smi -acp 0 || true sudo nvidia-smi --auto-boost-default=0 || true sudo nvidia-smi --auto-boost-permission=0 || true sudo nvidia-modprobe -u -c=0 -m || true
Run the following command to check whether you need to install nvidia fabric manager on the node:
sudo lspci | grep -i 'Bridge:.*NVIDIA'
If no output is displayed, you do not need to install nvidia fabric manager. Download nvidia fabric manager from the NVIDIA YUM repository to the node. Make sure that the version of nvidia fabric manager is the same as that of the new NVIDIA driver.
Run the following command to install and launch nvidia fabric manager:
# Install nvidia fabric manager. sudo yum localinstall nvidia-fabric-manager-510.108.03-1.x86_64.rpm # Enable nvidia fabric manager to start on boot. systemctl enable nvidia-fabricmanager.service # Start nvidia fabric manager. systemctl start nvidia-fabricmanager.service
Enable the kubelet and containerd services.
sudo systemctl restart containerd kubelet
Step 4: Connect the node to the cluster
Run the following command to connect the node to the cluster: Replace <NODE_NAME> with the name of the node.
sudo kubectl uncordon <NODE_NAME>