Restart GPU Device Plugin and isolate GPU devices - Container Service for Kubernetes

GPU Device Plugin is a component used to manage nodes that are equipped with GPUs in Kubernetes clusters. It enables Kubernetes to manage GPU resources more conveniently and efficiently. This topic describes how to perform operations such as restarting the GPU Device Plugin on a node, isolating GPU devices, and querying and updating the version of GPU Device Plugin in scenarios that involve exclusive GPU scheduling.

Restart GPU Device Plugin

In Container Service for Kubernetes (ACK) exclusive GPU scheduling scenarios, GPU Device Plugin is deployed as a static pod by default. Therefore, the restart process needs to be performed on the node where the static pod runs. Run the following commands to restart GPU Device Plugin:

mv /etc/kubernetes/manifests/nvidia-device-plugin.yml /etc/kubernetes/
# Wait a few seconds for the system to delete the original pod. 
mv /etc/kubernetes/nvidia-device-plugin.yml /etc/kubernetes/manifests/

Isolate GPU devices

Important

The GPU device isolation operation is supported only by nvidia-device-plugin 0.9.1 and later. For more information about how to query the nvidia-device-plugin version, see Query and update the Device-Plugin version.

In ACK exclusive GPU scheduling scenarios, you may need to isolate a specific GPU device on a node for certain reasons (such as GPU device failures). ACK provides a mechanism to allow you to manually isolate a faulty GPU device on a node in order to avoid scheduling new pods to the faulty GPU device. To isolate GPU devices, perform the following operations.

Find the unhealthyDevices.json file in the /etc/nvidia-device-plugin/ directory of the desired node. You can create a file named unhealthyDevices.json if it does not exist. The unhealthyDevices.json file must be in the following JSON format.

{
 "index": ["x", "x" ..],
 "uuid": ["xxx", "xxx" ..]
}

You can specify the index or uuid parameter of the device to be isolated in the JSON string on demand. For each device, you need only to specify one of the two parameters. After you save the file, the file takes effect automatically.

You can check the isolation result by querying the amount of the nvidia.com/gpu resources reported to the cluster.

Query and update the GPU Device Plugin version

You can find the tag of the GPU Device Plugin in the /etc/kubernetes/manifests/nvidia-device-plugin.yml file on the desired node. The version number indicated by this tag is the version of the GPU Device Plugin.

The latest version supported by ACK is v0.9.3-0dd4d5f5-aliyun. If you want to update the nvidia-device-plugin on a node to the latest version, you can modify the nvidia-device-plugin static YAML file /etc/kubernetes/manifests/nvidia-device-plugin.yml based on the following content (including the image tag, startup command, resources, volumeMounts, and volumes):

apiVersion: v1
kind: Pod

  ... ...

  hostNetwork: true
  containers:
  - image: registry-<REGION-ID>-vpc.ack.aliyuncs.com/acs/k8s-device-plugin:v0.9.1-576cfc77-aliyun
    # Replace <REGION-ID> in Image with the ID of the region where your node resides, such as cn-beijing or cn-hangzhou. 
    name: nvidia-device-plugin-ctr
    args: ["--fail-on-init-error=false","--pass-device-specs=true","--device-id-strategy=index"]
    resources:
      requests:
        memory: "1Mi"
        cpu: "1m"
      limits:
        memory: "200Mi"
        cpu: "500m"
    
    ... ...

    volumeMounts:
      - name: device-plugin
        mountPath: /var/lib/kubelet/device-plugins
      - name: device-plugin-config
        mountPath: /etc/nvidia-device-plugin
  volumes:
    - name: device-plugin
      hostPath:
        path: /var/lib/kubelet/device-plugins
    - name: device-plugin-config
      hostPath:
        path: /etc/nvidia-device-plugin
        type: DirectoryOrCreate

Modify the key of the device checkpoint in the GPU Device Plugin

Device Plugin creates a checkpoint file on the node during devices allocation for pods to record and store devices that were allocated and their corresponding pod information. In the NVIDIA GPU Device Plugin, by default, the checkpoint file uses the universally unique identifier (UUID) of each GPU device as its unique identifier (key). Perform the following steps to change this key to the device Index to resolve issues such as UUID loss during VM cold migration.

Check the Device Plugin version:
View the image tag of the Device Plugin on the/etc/kubernetes/manifests/nvidia-device-plugin.ymlfile of the target node. This tag indicates the version of the Device Plugin. You do not need to modify the version number if it is 0.9.3 or later. Otherwise, change it to the latest versionv0.9.3-0dd4d5f5-aliyun.
Modify the environment variables of the static pod:
Use the following code to add a new environment variable CHECKPOINT_DEVICE_ID_STRATEGY to the static pod configuration by editing the /etc/kubernetes/manifests/nvidia-device-plugin.yml file:
```
    env:
      - name: CHECKPOINT_DEVICE_ID_STRATEGY
        value: index
```
Restart the GPU Device Plugin:
For more information about how to restart the GPU Device Plugin on the node for the modifications to take effect, see Restart GPU Device Plugin.

References

For more information about GPU-related issues, see Diagnose GPU-accelerated nodes and FAQ.

For more information about GPU sharing, see GPU sharing overview.