修複GPU執行個體重啟或被置換後裝置ID變更問題 - Container Service for Kubernetes

GPU執行個體宕機後，GPU裝置ID可能會變化，會導致容器無法正常啟動。GPUOps檢測GPU執行個體的GPU裝置ID與/var/lib/kubelet/device-plugins/kubelet_internal_checkpoint中儲存的GPU裝置ID是否一致。如果不一致，GPUOps會刪除checkpoint檔案，由Kubelet產生新的checkpoint檔案，保證儲存的GPU裝置ID與真實的GPU裝置ID一致。本文介紹如何修複由GPU執行個體宕機引起的裝置ID變更問題。

前提條件

建立GPU叢集或建立專有GPU叢集。
叢集有一台啟動SSH，且公網可訪問的跳板機。具體操作，請參見如何為已有叢集開啟SNAT。

背景資訊

GPU執行個體宕機後，重啟或置換GPU執行個體時，GPU裝置ID可能會變化。該ID如果與/var/lib/kubelet/device-plugins/kubelet_internal_checkpoint中儲存的GPU裝置ID資訊不一致，會導致容器無法正常啟動。
說明
- 重啟或置換GPU執行個體指重啟或置換帶有GPU裝置的ECS執行個體。
- 以下情境都有可能引起GPU裝置ID變化：
  - GPU執行個體宕機後，重啟或置換GPU執行個體時。
  - 手動重啟GPU執行個體時。
GPUOps是一個可執行檔二進位程式，在Linux上可直接執行。具體下載地址，請參見GPUOps。
一個GPU執行個體上可以有多個GPU卡，每個GPU卡都有一個GPU裝置ID。

步驟一：部署GPUOps

單機部署GPUOps

執行以下命令，將GPUOps複製到/usr/local/bin/目錄下，並賦予可執行許可權。
```
sudo cp ./gpuops /usr/local/bin/
sudo chmod +x gpuops
```

執行以下命令，把GPUOps添加到開機自動啟動的服務。

sudo cat > /etc/systemd/system/gpuops.service <<EOF
[Unit]
Description=Gpuops: check kubelet checkpoint gpu status
After=network-online.target
Wants=network-online.target

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/local/bin/gpuops check

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl enable gpuops.service

批量部署GPUOps

在GPU叢集中，使用以下YAML檔案範例部署DaemonSet。

展開查看YAML詳情

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: gpuops-deploy
  labels:
    k8s-app: gpuops-deploy
spec:
  selector:
    matchLabels:
      name: gpuops-deploy
  template:
    metadata:
      labels:
        name: gpuops-deploy
    spec:
      hostPID: true
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: aliyun.accelerator/nvidia_name
                    operator: Exists
      containers:
        - name: gpuops
          image: registry.cn-beijing.aliyuncs.com/acs/gpuops:latest
          command:
          securityContext:
            privileged: true
          volumeMounts:
            - name: hostbin
              mountPath: /workspace/host/usr/local/bin
            - name: hostsystem
              mountPath: /workspace/host/etc/systemd/system
      terminationGracePeriodSeconds: 30
      volumes:
        - name: hostbin
          hostPath:
            path: /usr/local/bin
        - name: hostsystem
          hostPath:
            path: /etc/systemd/system

成功部署該DaemonSet後，GPU叢集會自動在所有Worker節點執行以下操作：

將GPUOps複製到/usr/local/bin/目錄下，並賦予可執行許可權。
把GPUOps添加到開機自動啟動的服務。

以後叢集中的GPU執行個體宕機後，重啟或置換GPU執行個體時，GPUOps會保證儲存的GPU裝置ID與真實的GPU裝置ID一致。

步驟二：驗證GPUOps修複GPU裝置ID變更情況

GPUOps安裝完成後，在GPU叢集部署1個Pod。

部署Pod的YAML檔案範例如下：

展開查看YAML詳情

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: app-3g-v1
  labels:
    app: app-3g-v1
spec:
  replicas: 1
  serviceName: "app-3g-v1"
  podManagementPolicy: "Parallel"
  selector: # define how the deployment finds the pods it manages
    matchLabels:
      app: app-3g-v1
  template: # define the pods specifications
    metadata:
      labels:
        app: app-3g-v1
    spec:
      containers:
      - name: app-3g-v1
        image: registry.cn-shanghai.aliyuncs.com/tensorflow-samples/cuda-malloc:3G
        resources:
          limits:
            nvidia.com/gpu: 1

執行以下命令，驗證Pod部署情況。

kubectl get pod

預期輸出：

NAME                                  READY   STATUS       RESTARTS   AGE
app-3g-v1-0                           1/1     Running      0          22s

登入到該GPU執行個體，執行以下命令，重啟GPU節點。
說明
關於如何登入到GPU執行個體，請參見使用VNC登入執行個體。
```
sudo reboot
```

執行以下命令，查看GPUOps的執行日誌資訊。

sudo journalctl -u gpuops

如果輸出的日誌內容如下，說明此時checkpoint檔案中的GPU裝置ID和真實的GPU裝置ID是一致的，無需處理。

6月 28 14:49:00 iZ2vc1mysgx8bqdv3oyji9Z gpuops[21976]: {"level":"info","msg":"check gpu start...","time":"2020-06-28T14:49:00+08:00"}
6月 28 14:49:00 iZ2vc1mysgx8bqdv3oyji9Z gpuops[21976]: {"level":"info","msg":"find nvidia gpu: [\"GPU-383cd6a5-00a6-b455-62c4-c163d164b837\"]","time":"2020-06-28T14:49:00+08:00"}
6月 28 14:49:00 iZ2vc1mysgx8bqdv3oyji9Z gpuops[21976]: {"level":"info","msg":"read checkpoint: {\"Data\":{\"PodDeviceEntries\":[{\"PodUID\":\"300f47eb-e5c3-4d0b-9a87-925837b67732\",\"ContainerName\":\"app-3g-v1\",\"ResourceName\":\"aliyun.com/g
6月 28 14:49:00 iZ2vc1mysgx8bqdv3oyji9Z systemd[1]: Started Gpuops: check kubelet checkpoint gpu status.
6月 28 14:49:00 iZ2vc1mysgx8bqdv3oyji9Z gpuops[21976]: {"level":"info","msg":"find registered gpu: [\"GPU-383cd6a5-00a6-b455-62c4-c163d164b837\"]","time":"2020-06-28T14:49:00+08:00"}
6月 28 14:49:00 iZ2vc1mysgx8bqdv3oyji9Z gpuops[21976]: {"level":"info","msg":"find pod gpu: null","time":"2020-06-28T14:49:00+08:00"}
6月 28 14:49:00 iZ2vc1mysgx8bqdv3oyji9Z gpuops[21976]: {"level":"info","msg":"cached gpu info in checkpoint is up to date, check gpu finished","time":"2020-06-28T14:49:00+08:00"}

如果輸出的日誌有以下warning層級的日誌時，則說明GPU裝置ID發生了變化，GPUOps會刪除checkpoint，Kubernetes會重建checkpoint檔案，保證裝置資訊同步。

6月 28 14:41:16 iZ2vc1mysgx8bqdv3oyji9Z systemd[1]: Starting Gpuops: check kubelet checkpoint gpu status...
6月 28 14:41:16 iZ2vc1mysgx8bqdv3oyji9Z gpuops[951]: {"level":"info","msg":"check gpu start...","time":"2020-06-28T14:41:16+08:00"}
6月 28 14:41:17 iZ2vc1mysgx8bqdv3oyji9Z gpuops[951]: {"level":"info","msg":"find nvidia gpu: [\"GPU-68ce64fb-ea68-5ad6-7e72-31f3f07378df\"]","time":"2020-06-28T14:41:17+08:00"}
6月 28 14:41:17 iZ2vc1mysgx8bqdv3oyji9Z gpuops[951]: {"level":"info","msg":"read checkpoint: {\"Data\":{\"PodDeviceEntries\":[{\"PodUID\":\"300f47eb-e5c3-4d0b-9a87-925837b67732\",\"ContainerName\":\"app-3g-v1\",\"ResourceName\":\"aliyun.com/gpu
6月 28 14:41:17 iZ2vc1mysgx8bqdv3oyji9Z gpuops[951]: {"level":"info","msg":"find registered gpu: [\"GPU-7fd52bfc-364d-1129-a142-c3f10e053ccb\"]","time":"2020-06-28T14:41:17+08:00"}
6月 28 14:41:17 iZ2vc1mysgx8bqdv3oyji9Z gpuops[951]: {"level":"info","msg":"find pod gpu: [\"GPU-7fd52bfc-364d-1129-a142-c3f10e053ccb\"]","time":"2020-06-28T14:41:17+08:00"}
6月 28 14:41:17 iZ2vc1mysgx8bqdv3oyji9Z gpuops[951]: {"level":"warning","msg":"the registered gpu uuid not equal with nvidia gpu","time":"2020-06-28T14:41:17+08:00"}
6月 28 14:41:17 iZ2vc1mysgx8bqdv3oyji9Z gpuops[951]: {"level":"warning","msg":"cached gpu info in checkpoint is out of date","time":"2020-06-28T14:41:17+08:00"}
6月 28 14:41:17 iZ2vc1mysgx8bqdv3oyji9Z gpuops[951]: {"level":"info","msg":"delete checkpoint file success","time":"2020-06-28T14:41:17+08:00"}
6月 28 14:41:17 iZ2vc1mysgx8bqdv3oyji9Z gpuops[951]: {"level":"info","msg":"kubelet restart success","time":"2020-06-28T14:41:17+08:00"}
6月 28 14:41:17 iZ2vc1mysgx8bqdv3oyji9Z gpuops[951]: {"level":"info","msg":"check gpu finished","time":"2020-06-28T14:41:17+08:00"}
6月 28 14:41:17 iZ2vc1mysgx8bqdv3oyji9Z systemd[1]: Started Gpuops: check kubelet checkpoint gpu status.

執行以下命令，查看真實的GPU裝置ID。

sudo nvidia-smi -L

預期輸出：

GPU 0: Tesla T4 (UUID: GPU-0650a168-e770-3ea8-8ac3-8a1d419763e0)

執行以下命令，查看checkpoint檔案裡儲存的GPU裝置ID。
```
sudo cat /var/lib/kubelet/device-plugins/kubelet_internal_checkpoint
```
預期輸出：
```
{"Data":{"PodDeviceEntries":null,"RegisteredDevices":{"nvidia.com/gpu":["GPU-0650a168-e770-3ea8-8ac3-8a1d419763e0"]}},"Checksum":3952659280}
```
從上述輸出資訊，可知checkpoint檔案裡儲存的GPU裝置ID與真實的GPU裝置ID一致，說明GPUOps成功修複了裝置ID不一致的問題。