如何在Windows容器中使用基於DirectX的GPU加速 - Container Service for Kubernetes

對於Windows節點的工作負載，GPU相比於CPU可提供更大規模的並行計算能力，且能夠將操作速度提高几個數量級，從而提高計算輸送量。Windows容器支援對基於DirectX構建的架構進行GPU加速。本文介紹在Windows節點如何安裝DirectX裝置外掛程式以及在Windows容器中如何使用基於DirectX構建的GPU加速功能。

前提條件

已建立ACK託管叢集，且叢集版本為1.20.4及以上，請參見建立ACK託管叢集。
已通過kubectl串連ACK叢集。具體操作，請參見擷取叢集KubeConfig並通過kubectl工具串連叢集。

DirectX介紹

DirectX是一種應用程式介面（API）集合。DirectX可以使以Windows為平台的遊戲和多媒體程式獲得更高的執行效率，加強3D圖形和聲音效果，並向設計人員提供一個共同的硬體驅動標準，降低安裝及設定硬體的複雜性。基於DirectX，您可以使用GPU處理並行化的計算密集型任務，同時減輕CPU過載的情況，更好地將GPU作為平行處理器使用。

步驟一：建立支援GPU的彈性Windows節點池

普通Windows節點池

啟用License的GRID驅動。您可以通過以下兩種方式擷取GRID驅動：
- 如果您已經是NVIDIA的企業使用者，可以通過NVIDIA的企業許可網站下載並安裝對應的GRID驅動。
- 如果您尚未是NVIDIA的企業使用者，可以使用阿里雲提供的通過預裝驅動的鏡像社區鏡像載入GRID驅動。
建立Windows節點池，且滿足如下需求。注意事項和操作步驟，請參見建立Windows節點池。
- 執行個體規格：GPU計算型或GPU虛擬化型執行個體規格。關於支援的執行個體規格，請參見GPU計算型（gn/ebm/scc系列）或GPU虛擬化型（vgn/sgn系列）。
- 作業系統：按需選擇，例如Windows Server 2022。

支援彈性的Windows節點池

目前，ACK預設只支援使用ECS公用鏡像作為節點鏡像。如需建立支援彈性的Windows 節點，需通過自訂鏡像的方式實現。操作流程如下。

提交工單申請共用已啟用License的GRID驅動的Windows鏡像，當前預設支援Windows Server 2019和Windows Server 2022。若對Windows版本有特殊需求，請在工單中申請。
建立Windows節點池，且節點池需滿足以下要求。具體操作，請參見建立Windows節點池。
1. 執行個體規格：GPU計算型或GPU虛擬化型執行個體規格。關於支援的執行個體規格，請參見GPU計算型（gn/ebm/scc系列）或GPU虛擬化型（vgn/sgn系列）。
2. 作業系統：按需選擇，例如Windows Server 2022。
3. 自訂鏡像：選擇申請的鏡像。

步驟二：為Windows節點安裝DirectX裝置外掛程式

將DirectX裝置外掛程式以DaemonSet方式部署到Windows節點上。

使用以下內容建立directx-device-plugin-windows.yaml檔案。

apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    k8s-app: directx-device-plugin-windows
  name: directx-device-plugin-windows
  namespace: kube-system
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: directx-device-plugin-windows
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        k8s-app: directx-device-plugin-windows
    spec:
      tolerations:
        - operator: Exists
      # since 1.18, we can specify "hostNetwork: true" for Windows workloads, so we can deploy an application without NetworkReady.
      hostNetwork: true
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: type
                    operator: NotIn
                    values:
                      - virtual-kubelet
                  - key: beta.kubernetes.io/os
                    operator: In
                    values:
                      - windows
                  - key: windows.alibabacloud.com/deployment-topology
                    operator: In
                    values:
                      - "2.0"
                  - key: windows.alibabacloud.com/directx-supported
                    operator: In
                    values:
                      - "true"
              - matchExpressions:
                  - key: type
                    operator: NotIn
                    values:
                      - virtual-kubelet
                  - key: kubernetes.io/os
                    operator: In
                    values:
                      - windows
                  - key: windows.alibabacloud.com/deployment-topology
                    operator: In
                    values:
                      - "2.0"
                  - key: windows.alibabacloud.com/directx-supported
                    operator: In
                    values:
                      - "true"
      containers:
        - name: directx
          command:
            - pwsh.exe
            - -NoLogo
            - -NonInteractive
            - -File
            - entrypoint.ps1
          # 根據不同叢集的地區，您需修改以下鏡像地址中的地區<cn-hangzhou>資訊。
          image: registry-cn-hangzhou-vpc.ack.aliyuncs.com/acs/directx-device-plugin-windows:v1.0.0
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - name: host-binary
              mountPath: c:/host/opt/bin
            - name: wins-pipe
              mountPath: \\.\pipe\rancher_wins
      volumes:
        - name: host-binary
          hostPath:
            path: c:/opt/bin
            type: DirectoryOrCreate
        - name: wins-pipe
          hostPath:
            path: \\.\pipe\rancher_wins

執行以下命令，部署directx-device-plugin-windows.yaml檔案，安裝DirectX裝置外掛程式。
```
kubectl create -f directx-device-plugin-windows.yaml
```

步驟三：部署使用基於DirectX的GPU加速的Windows工作負載

DirectX裝置外掛程式可以為Windows容器自動添加class/<interface class GUID>裝置，以支援調用ECS執行個體主機的DirectX服務。更多資訊，請參見Windows上的容器中的裝置。

請在需要使用GPU加速的Windows工作負載內添加以下resources資源資訊並重新部署：

spec:
  ...
  template:
    ...
    spec:
      ...
      containers:
        - name: gpu-user
          ...
+         resources:
+           limits:
+             windows.alibabacloud.com/directx: "1"
+           requests:
+             windows.alibabacloud.com/directx: "1"

重要

上述配置不會將整個ECS執行個體主機的GPU資源專門分配給容器，也不會阻止ECS執行個體主機上的GPU被其他應用訪問。相反，GPU資源會在ECS執行個體主機和容器之間動態調度，即支援在主機上運行多個Windows容器，且每個容器都可以使用支援硬體加速的DirectX功能。

關於Windows容器中的GPU加速的更多資訊，請參見Windows容器中的GPU加速。

步驟四：驗證在Windows工作負載是否成功使用GPU加速功能

在Windows節點上添加DirectX裝置外掛程式後，使用以下樣本應用驗證DirectX裝置外掛程式是否成功部署到Windows節點。

使用以下內容建立gpu-job-windows.yaml檔案。

apiVersion: batch/v1
kind: Job
metadata:
  labels:
    k8s-app: gpu-job-windows
  name: gpu-job-windows
  namespace: default
spec:
  parallelism: 1
  completions: 1
  backoffLimit: 3
  manualSelector: true
  selector:
    matchLabels:
      k8s-app: gpu-job-windows
  template:
    metadata:
      labels:
        k8s-app: gpu-job-windows
    spec:
      restartPolicy: Never
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: type
                    operator: NotIn
                    values:
                      - virtual-kubelet
                  - key: beta.kubernetes.io/os
                    operator: In
                    values:
                      - windows
              - matchExpressions:
                  - key: type
                    operator: NotIn
                    values:
                      - virtual-kubelet
                  - key: kubernetes.io/os
                    operator: In
                    values:
                      - windows
      tolerations:
        - key: os
          value: windows
      containers:
        - name: gpu
          # 根據不同叢集的地區，您需修改以下鏡像地址中的地區<cn-hangzhou>資訊。
          image: registry-cn-hangzhou-vpc.ack.aliyuncs.com/acs/sample-gpu-windows:v1.0.0
          imagePullPolicy: IfNotPresent
          resources:
            limits:
              windows.alibabacloud.com/directx: "1"
            requests:
              windows.alibabacloud.com/directx: "1"

說明

鏡像registry-{region}-vpc.ack.aliyuncs.com/acs/sample-gpu-windows是阿里雲Container Service提供的Windows GPU加速容器樣本鏡像。該鏡像基於Microsoft Windows製作。更多資訊，請參見microsoft-windows。
該樣本通過WinMLRunner產生類比輸入資料，對gpu-job-windows任務使用GPU加速後，通過Tiny YOLOv2模型進行100次評估，最終輸出相應的效能測量資料。實際結果請以您的作業環境為準。
鏡像檔案較大（檔案大小為15.3 GB），部署應用時拉取鏡像時間較長，請耐心等待。

執行以下命令，部署gpu-job-windows.yaml，建立樣本應用。
```
kubectl create -f gpu-job-windows.yaml
```

執行以下命令，查看樣本應用gpu-job-windows的日誌資訊。

kubectl logs -f gpu-job-windows

預期輸出：

INFO: Executing model of "tinyyolov2-7" 100 times within GPU driver ...

Created LearningModelDevice with GPU: NVIDIA GRID T4-8Q
Loading model (path = c:\data\tinyyolov2-7\model.onnx)...
=================================================================
Name: Example Model
Author: OnnxMLTools
Version: 0
Domain: onnxconverter-common
Description: The Tiny YOLO network from the paper 'YOLO9000: Better, Faster, Stronger' (2016), arXiv:1612.08242
Path: c:\data\tinyyolov2-7\model.onnx
Support FP16: false

Input Feature Info:
Name: image
Feature Kind: Image (Height: 416, Width:  416)

Output Feature Info:
Name: grid
Feature Kind: Float

預期輸出表明，樣本應用gpu-job-windows已成功使用GPU加速功能。