Fix the issue that the IDs of GPUs are changed after a GPU-accelerated ECS instance is restarted or replaced - Container Service for Kubernetes

After a GPU-accelerated instance fails, the IDs of the GPUs on the instance may be changed. If the GPU IDs are changed, the containers cannot be launched as normal. GPUOps is used to detect whether the IDs of the GPUs on a GPU-accelerated instance are the same as those stored in the /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint file. If the GPU IDs are not the same, GPUOps deletes the checkpoint file. Then, Kubelet generates another checkpoint file. This ensures that the GPU IDs stored in the checkpoint file are the same as the actual GPU IDs. This topic describes how to fix the issue that the IDs of the GPUs on a GPU-accelerated instance are changed after the instance fails.

Prerequisites

Create an ACK managed cluster with GPU-accelerated nodes or Create an ACK dedicated cluster with GPU-accelerated nodes.
A jump server that uses SSH is added to the cluster. The jump server has access to the Internet. For more information, see Configure SNAT entries for existing ACK clusters.

Background information

After a failed GPU-accelerated instance is restarted or replaced, the IDs of the GPUs on the instance may be changed. If the GPU IDs are different from those stored in the /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint file, containers on the GPU-accelerated instance cannot be launched as normal.
Note
- GPU-accelerated instances refer to Elastic Compute Service (ECS) instances that are equipped with GPUs.
- The ID of a GPU may be changed in the following scenarios:
  - A failed GPU-accelerated instance is restarted or replaced.
  - You manually restart a GPU-accelerated instance.
GPUOps is a binary program that can run on Linux. For more information about how to download GPUOps, see GPUOps.
A GPU-accelerated instance can be equipped with multiple GPUs. Each GPU has a unique ID.

Step 1: Deploy GPUOps

Deploy GPUOps on a single node

Run the following command to copy GPUOps to the /usr/local/bin/ directory and grant executable permissions to GPUOps:
```
cp ./gpuops /usr/local/bin/
chmod +x gpuops
```

Run the following command to enable GPUOps to automatically start up with the instance:

cat > /etc/systemd/system/gpuops.service <<EOF
[Unit]
Description=Gpuops: check kubelet checkpoint gpu status
After=network-online.target
Wants=network-online.target

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/local/bin/gpuops check

[Install]
WantedBy=multi-user.target
EOF

systemctl enable gpuops.service

Deploy GPUOps on multiple nodes at a time

In the Kubernetes cluster with GPU-accelerated instances, deploy a DaemonSet by using the following YAML template:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: gpuops-deploy
  labels:
    k8s-app: gpuops-deploy
spec:
  selector:
    matchLabels:
      name: gpuops-deploy
  template:
    metadata:
      labels:
        name: gpuops-deploy
    spec:
      hostPID: true
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: aliyun.accelerator/nvidia_name
                    operator: Exists
      containers:
        - name: gpuops
          image: registry.cn-beijing.aliyuncs.com/acs/gpuops:latest
          command:
          securityContext:
            privileged: true
          volumeMounts:
            - name: hostbin
              mountPath: /workspace/host/usr/local/bin
            - name: hostsystem
              mountPath: /workspace/host/etc/systemd/system
      terminationGracePeriodSeconds: 30
      volumes:
        - name: hostbin
          hostPath:
            path: /usr/local/bin
        - name: hostsystem
          hostPath:
            path: /etc/systemd/system

After the DaemonSet is deployed, the Kubernetes cluster with GPU-accelerated instances performs the following operations on all of the worker nodes:

Copies GPUOps to the /usr/local/bin/ directory and makes GPUOps executable.
Enables GPUOps to automatically start up with the instance.

When a failed GPU-accelerated instance in the cluster is restarted or replaced, GPUOps ensures that the GPU IDs stored in the checkpoint file are the same as the actual GPU IDs.

Step 2: Verify that GPUOps has fixed the GPU ID issue

After GPUOps is deployed, create a pod in the Kubernetes cluster with GPU-accelerated instances.

The following YAML template is used to create the pod:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: app-3g-v1
  labels:
    app: app-3g-v1
spec:
  replicas: 1
  serviceName: "app-3g-v1"
  podManagementPolicy: "Parallel"
  selector: # define how the deployment finds the pods it manages
    matchLabels:
      app: app-3g-v1
  template: # define the pods specifications
    metadata:
      labels:
        app: app-3g-v1
    spec:
      containers:
      - name: app-3g-v1
        image: registry.cn-shanghai.aliyuncs.com/tensorflow-samples/cuda-malloc:3G
        resources:
          limits:
            nvidia.com/gpu: 1

Run the following command to query the status of the pod:

kubectl get pod

Expected output:

NAME                                  READY   STATUS       RESTARTS   AGE
app-3g-v1-0                           1/1     Running      0          22s

Log on to the GPU-accelerated instance where the pod runs and run the following command to restart the instance.
Note For more information about how to log on to a GPU-accelerated instance, see Connect to a Linux instance by using a password.
```
sudo reboot
```

Run the following command to print the log of GPUOps:

journalctl -u gpuops

If the following output is returned, the GPU ID stored in the checkpoint file is the same as the actual GPU ID. You do not need to perform any actions.

June 28 14:49:00 iZ2vc1mysgx8bqdv3oyji9Z gpuops[21976]: {"level":"info","msg":"check gpu start...","time":"2020-06-28T14:49:00+08:00"}
June 28 14:49:00 iZ2vc1mysgx8bqdv3oyji9Z gpuops[21976]: {"level":"info","msg":"find nvidia gpu: [\"GPU-383cd6a5-00a6-b455-62c4-c163d164b837\"]","time":"2020-06-28T14:49:00+08:00"}
June 28 14:49:00 iZ2vc1mysgx8bqdv3oyji9Z gpuops[21976]: {"level":"info","msg":"read checkpoint: {\"Data\":{\"PodDeviceEntries\":[{\"PodUID\":\"300f47eb-e5c3-4d0b-9a87-925837b67732\",\"ContainerName\":\"app-3g-v1\",\"ResourceName\":\"aliyun.com/g
June 28 14:49:00 iZ2vc1mysgx8bqdv3oyji9Z systemd[1]: Started Gpuops: check kubelet checkpoint gpu status.
June 28 14:49:00 iZ2vc1mysgx8bqdv3oyji9Z gpuops[21976]: {"level":"info","msg":"find registered gpu: [\"GPU-383cd6a5-00a6-b455-62c4-c163d164b837\"]","time":"2020-06-28T14:49:00+08:00"}
June 28 14:49:00 iZ2vc1mysgx8bqdv3oyji9Z gpuops[21976]: {"level":"info","msg":"find pod gpu: null","time":"2020-06-28T14:49:00+08:00"}
June 28 14:49:00 iZ2vc1mysgx8bqdv3oyji9Z gpuops[21976]: {"level":"info","msg":"cached gpu info in checkpoint is up to date, check gpu finished","time":"2020-06-28T14:49:00+08:00"}

If the output contains warning messages, the GPU ID is changed. In this case, GPUOps deletes the checkpoint file and Kubernetes generates another checkpoint file to store the new GPU ID.

June 28 14:41:16 iZ2vc1mysgx8bqdv3oyji9Z systemd[1]: Starting Gpuops: check kubelet checkpoint gpu status...
June 28 14:41:16 iZ2vc1mysgx8bqdv3oyji9Z gpuops[951]: {"level":"info","msg":"check gpu start...","time":"2020-06-28T14:41:16+08:00"}
June 28 14:41:17 iZ2vc1mysgx8bqdv3oyji9Z gpuops[951]: {"level":"info","msg":"find nvidia gpu: [\"GPU-68ce64fb-ea68-5ad6-7e72-31f3f07378df\"]","time":"2020-06-28T14:41:17+08:00"}
June 28 14:41:17 iZ2vc1mysgx8bqdv3oyji9Z gpuops[951]: {"level":"info","msg":"read checkpoint: {\"Data\":{\"PodDeviceEntries\":[{\"PodUID\":\"300f47eb-e5c3-4d0b-9a87-925837b67732\",\"ContainerName\":\"app-3g-v1\",\"ResourceName\":\"aliyun.com/gpu
June 28 14:41:17 iZ2vc1mysgx8bqdv3oyji9Z gpuops[951]: {"level":"info","msg":"find registered gpu: [\"GPU-7fd52bfc-364d-1129-a142-c3f10e053ccb\"]","time":"2020-06-28T14:41:17+08:00"}
June 28 14:41:17 iZ2vc1mysgx8bqdv3oyji9Z gpuops[951]: {"level":"info","msg":"find pod gpu: [\"GPU-7fd52bfc-364d-1129-a142-c3f10e053ccb\"]","time":"2020-06-28T14:41:17+08:00"}
June 28 14:41:17 iZ2vc1mysgx8bqdv3oyji9Z gpuops[951]: {"level":"warning","msg":"the registered gpu uuid not equal with nvidia gpu","time":"2020-06-28T14:41:17+08:00"}
June 28 14:41:17 iZ2vc1mysgx8bqdv3oyji9Z gpuops[951]: {"level":"warning","msg":"cached gpu info in checkpoint is out of date","time":"2020-06-28T14:41:17+08:00"}
June 28 14:41:17 iZ2vc1mysgx8bqdv3oyji9Z gpuops[951]: {"level":"info","msg":"delete checkpoint file success","time":"2020-06-28T14:41:17+08:00"}
June 28 14:41:17 iZ2vc1mysgx8bqdv3oyji9Z gpuops[951]: {"level":"info","msg":"kubelet restart success","time":"2020-06-28T14:41:17+08:00"}
June 28 14:41:17 iZ2vc1mysgx8bqdv3oyji9Z gpuops[951]: {"level":"info","msg":"check gpu finished","time":"2020-06-28T14:41:17+08:00"}
June 28 14:41:17 iZ2vc1mysgx8bqdv3oyji9Z systemd[1]: Started Gpuops: check kubelet checkpoint gpu status.

Run the following command to query the actual GPU ID:

nvidia-smi -L

Expected output:

GPU 0: Tesla T4 (UUID: GPU-0650a168-e770-3ea8-8ac3-8a1d419763e0)

Run the following command to query the GPU ID stored in the checkpoint file:
```
cat /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint
```
Expected output:
```
{"Data":{"PodDeviceEntries":null,"RegisteredDevices":{"nvidia.com/gpu":["GPU-0650a168-e770-3ea8-8ac3-8a1d419763e0"]}},"Checksum":3952659280}
```
The preceding output shows that the ID stored in the checkpoint file is the same as the actual GPU ID. This indicates that GPUOps has fixed the issue of inconsistent GPU IDs.