GPU Device Plugin is a component used to manage nodes that are equipped with GPUs in Kubernetes clusters. It enables Kubernetes to manage GPU resources more conveniently and efficiently. This topic describes how to perform operations such as restarting the GPU Device Plugin on a node, isolating GPU devices, and querying and updating the version of GPU Device Plugin in scenarios that involve exclusive GPU scheduling.
Restart GPU Device Plugin
In Container Service for Kubernetes (ACK) exclusive GPU scheduling scenarios, GPU Device Plugin is deployed as a static pod by default. Therefore, the restart process needs to be performed on the node where the static pod runs. Run the following commands to restart GPU Device Plugin:
mv /etc/kubernetes/manifests/nvidia-device-plugin.yml /etc/kubernetes/
# Wait a few seconds for the system to delete the original pod.
mv /etc/kubernetes/nvidia-device-plugin.yml /etc/kubernetes/manifests/
Isolate GPU devices
The GPU device isolation operation is supported only by nvidia-device-plugin 0.9.1 and later. For more information about how to query the nvidia-device-plugin version, see Query and update the Device-Plugin version.
In ACK exclusive GPU scheduling scenarios, you may need to isolate a specific GPU device on a node for certain reasons (such as GPU device failures). ACK provides a mechanism to allow you to manually isolate a faulty GPU device on a node in order to avoid scheduling new pods to the faulty GPU device. To isolate GPU devices, perform the following operations.
Find the unhealthyDevices.json
file in the /etc/nvidia-device-plugin/
directory of the desired node. You can create a file named unhealthyDevices.json if it does not exist. The unhealthyDevices.json
file must be in the following JSON format.
{
"index": ["x", "x" ..],
"uuid": ["xxx", "xxx" ..]
}
You can specify the index
or uuid
parameter of the device to be isolated in the JSON string on demand. For each device, you need only to specify one of the two parameters. After you save the file, the file takes effect automatically.
You can check the isolation result by querying the amount of the nvidia.com/gpu
resources reported to the cluster.
Query and update the GPU Device Plugin version
You can find the tag of the GPU Device Plugin in the /etc/kubernetes/manifests/nvidia-device-plugin.yml
file on the desired node. The version number indicated by this tag is the version of the GPU Device Plugin.
The latest version supported by ACK is v0.9.3-0dd4d5f5-aliyun
. If you want to update the nvidia-device-plugin on a node to the latest version, you can modify the nvidia-device-plugin static YAML file /etc/kubernetes/manifests/nvidia-device-plugin.yml
based on the following content (including the image tag, startup command, resources
, volumeMounts
, and volumes
):
apiVersion: v1
kind: Pod
... ...
hostNetwork: true
containers:
- image: registry-<REGION-ID>-vpc.ack.aliyuncs.com/acs/k8s-device-plugin:v0.9.1-576cfc77-aliyun
# Replace <REGION-ID> in Image with the ID of the region where your node resides, such as cn-beijing or cn-hangzhou.
name: nvidia-device-plugin-ctr
args: ["--fail-on-init-error=false","--pass-device-specs=true","--device-id-strategy=index"]
resources:
requests:
memory: "1Mi"
cpu: "1m"
limits:
memory: "200Mi"
cpu: "500m"
... ...
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: device-plugin-config
mountPath: /etc/nvidia-device-plugin
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: device-plugin-config
hostPath:
path: /etc/nvidia-device-plugin
type: DirectoryOrCreate
Modify the key of the device checkpoint in the GPU Device Plugin
Device Plugin creates a checkpoint file on the node during devices allocation for pods to record and store devices that were allocated and their corresponding pod information. In the NVIDIA GPU Device Plugin, by default, the checkpoint file uses the universally unique identifier (UUID) of each GPU device as its unique identifier (key). Perform the following steps to change this key to the device Index to resolve issues such as UUID loss during VM cold migration.
Check the Device Plugin version:
View the image tag of the Device Plugin on the
/etc/kubernetes/manifests/nvidia-device-plugin.yml
file of the target node. This tag indicates the version of the Device Plugin. You do not need to modify the version number if it is 0.9.3 or later. Otherwise, change it to the latest versionv0.9.3-0dd4d5f5-aliyun
.Modify the environment variables of the static pod:
Use the following code to add a new environment variable
CHECKPOINT_DEVICE_ID_STRATEGY
to the static pod configuration by editing the/etc/kubernetes/manifests/nvidia-device-plugin.yml
file:env: - name: CHECKPOINT_DEVICE_ID_STRATEGY value: index
Restart the GPU Device Plugin:
For more information about how to restart the GPU Device Plugin on the node for the modifications to take effect, see Restart GPU Device Plugin.
References
For more information about GPU-related issues, see Diagnose GPU-accelerated nodes and FAQ.
For more information about GPU sharing, see GPU sharing overview.