如何在Windows容器中使用基于DirectX的GPU加速 - 容器服务 Kubernetes 版 ACK

对于Windows节点的工作负载，GPU相比于CPU可提供更大规模的并行计算能力，且能够将操作速度提高几个数量级，从而提高计算吞吐量。Windows容器支持对基于DirectX构建的框架进行GPU加速。本文介绍在Windows节点如何安装DirectX设备插件以及在Windows容器中如何使用基于DirectX构建的GPU加速功能。

前提条件

已创建ACK托管集群，且集群版本为1.20.4及以上，请参见创建ACK托管集群。
已通过kubectl连接ACK集群。具体操作，请参见获取集群KubeConfig并通过kubectl工具连接集群。

DirectX介绍

DirectX是一种应用程序接口（API）集合。DirectX可以使以Windows为平台的游戏和多媒体程序获得更高的执行效率，加强3D图形和声音效果，并向设计人员提供一个共同的硬件驱动标准，降低安装及设置硬件的复杂性。基于DirectX，您可以使用GPU处理并行化的计算密集型任务，同时减轻CPU过载的情况，更好地将GPU作为并行处理器使用。

步骤一：创建支持GPU的弹性Windows节点池

普通Windows节点池

激活License的GRID驱动。您可以通过以下两种方式获取GRID驱动：
- 如果您已经是NVIDIA的企业用户，可以通过NVIDIA的企业许可站点下载并安装对应的GRID驱动。
- 如果您尚未是NVIDIA的企业用户，可以使用阿里云提供的通过预装驱动的社区镜像加载GRID驱动。
创建Windows节点池，且满足如下需求。注意事项和操作步骤，请参见创建Windows节点池。
- 实例规格：GPU计算型或GPU虚拟化型实例规格。关于支持的实例规格，请参见GPU计算型（gn/ebm/scc系列）或GPU虚拟化型（vgn/sgn系列）。
- 操作系统：按需选择，例如Windows Server 2022。

支持弹性的Windows节点池

目前，ACK默认只支持使用ECS公共镜像作为节点镜像。如需创建支持弹性的Windows 节点，需通过自定义镜像的方式实现。操作流程如下。

提交工单申请共享已激活License的GRID驱动的Windows镜像，当前默认支持Windows Server 2019和Windows Server 2022。若对Windows版本有特殊需求，请在工单中申请。
创建Windows节点池，且节点池需满足以下要求。具体操作，请参见创建Windows节点池。
1. 实例规格：GPU计算型或GPU虚拟化型实例规格。关于支持的实例规格，请参见GPU计算型（gn/ebm/scc系列）或GPU虚拟化型（vgn/sgn系列）。
2. 操作系统：按需选择，例如Windows Server 2022。
3. 自定义镜像：选择申请的镜像。

步骤二：为Windows节点安装DirectX设备插件

将DirectX设备插件以DaemonSet方式部署到Windows节点上。

使用以下内容创建directx-device-plugin-windows.yaml文件。

apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    k8s-app: directx-device-plugin-windows
  name: directx-device-plugin-windows
  namespace: kube-system
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: directx-device-plugin-windows
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        k8s-app: directx-device-plugin-windows
    spec:
      tolerations:
        - operator: Exists
      # since 1.18, we can specify "hostNetwork: true" for Windows workloads, so we can deploy an application without NetworkReady.
      hostNetwork: true
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: type
                    operator: NotIn
                    values:
                      - virtual-kubelet
                  - key: beta.kubernetes.io/os
                    operator: In
                    values:
                      - windows
                  - key: windows.alibabacloud.com/deployment-topology
                    operator: In
                    values:
                      - "2.0"
                  - key: windows.alibabacloud.com/directx-supported
                    operator: In
                    values:
                      - "true"
              - matchExpressions:
                  - key: type
                    operator: NotIn
                    values:
                      - virtual-kubelet
                  - key: kubernetes.io/os
                    operator: In
                    values:
                      - windows
                  - key: windows.alibabacloud.com/deployment-topology
                    operator: In
                    values:
                      - "2.0"
                  - key: windows.alibabacloud.com/directx-supported
                    operator: In
                    values:
                      - "true"
      containers:
        - name: directx
          command:
            - pwsh.exe
            - -NoLogo
            - -NonInteractive
            - -File
            - entrypoint.ps1
          # 根据不同集群的地域，您需修改以下镜像地址中的地域<cn-hangzhou>信息。
          image: registry-cn-hangzhou-vpc.ack.aliyuncs.com/acs/directx-device-plugin-windows:v1.0.0
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - name: host-binary
              mountPath: c:/host/opt/bin
            - name: wins-pipe
              mountPath: \\.\pipe\rancher_wins
      volumes:
        - name: host-binary
          hostPath:
            path: c:/opt/bin
            type: DirectoryOrCreate
        - name: wins-pipe
          hostPath:
            path: \\.\pipe\rancher_wins

执行以下命令，部署directx-device-plugin-windows.yaml文件，安装DirectX设备插件。
```
kubectl create -f directx-device-plugin-windows.yaml
```

步骤三：部署使用基于DirectX的GPU加速的Windows工作负载

DirectX设备插件可以为Windows容器自动添加class/<interface class GUID>设备，以支持调用ECS实例主机的DirectX服务。更多信息，请参见Windows上的容器中的设备。

请在需要使用GPU加速的Windows工作负载内添加以下resources资源信息并重新部署：

spec:
  ...
  template:
    ...
    spec:
      ...
      containers:
        - name: gpu-user
          ...
+         resources:
+           limits:
+             windows.alibabacloud.com/directx: "1"
+           requests:
+             windows.alibabacloud.com/directx: "1"

重要

上述配置不会将整个ECS实例主机的GPU资源专门分配给容器，也不会阻止ECS实例主机上的GPU被其他应用访问。相反，GPU资源会在ECS实例主机和容器之间动态调度，即支持在主机上运行多个Windows容器，且每个容器都可以使用支持硬件加速的DirectX功能。

关于Windows容器中的GPU加速的更多信息，请参见Windows容器中的GPU加速。

步骤四：验证在Windows工作负载是否成功使用GPU加速功能

在Windows节点上添加DirectX设备插件后，使用以下示例应用验证DirectX设备插件是否成功部署到Windows节点。

使用以下内容创建gpu-job-windows.yaml文件。

apiVersion: batch/v1
kind: Job
metadata:
  labels:
    k8s-app: gpu-job-windows
  name: gpu-job-windows
  namespace: default
spec:
  parallelism: 1
  completions: 1
  backoffLimit: 3
  manualSelector: true
  selector:
    matchLabels:
      k8s-app: gpu-job-windows
  template:
    metadata:
      labels:
        k8s-app: gpu-job-windows
    spec:
      restartPolicy: Never
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: type
                    operator: NotIn
                    values:
                      - virtual-kubelet
                  - key: beta.kubernetes.io/os
                    operator: In
                    values:
                      - windows
              - matchExpressions:
                  - key: type
                    operator: NotIn
                    values:
                      - virtual-kubelet
                  - key: kubernetes.io/os
                    operator: In
                    values:
                      - windows
      tolerations:
        - key: os
          value: windows
      containers:
        - name: gpu
          # 根据不同集群的地域，您需修改以下镜像地址中的地域<cn-hangzhou>信息。
          image: registry-cn-hangzhou-vpc.ack.aliyuncs.com/acs/sample-gpu-windows:v1.0.0
          imagePullPolicy: IfNotPresent
          resources:
            limits:
              windows.alibabacloud.com/directx: "1"
            requests:
              windows.alibabacloud.com/directx: "1"

说明

镜像registry-{region}-vpc.ack.aliyuncs.com/acs/sample-gpu-windows是阿里云容器服务提供的Windows GPU加速容器示例镜像。该镜像基于Microsoft Windows制作。更多信息，请参见microsoft-windows。
该示例通过WinMLRunner生成模拟输入数据，对gpu-job-windows任务使用GPU加速后，通过Tiny YOLOv2模型进行100次评估，最终输出相应的性能测量数据。实际结果请以您的操作环境为准。
镜像文件较大（文件大小为15.3 GB），部署应用时拉取镜像时间较长，请耐心等待。

执行以下命令，部署gpu-job-windows.yaml，创建示例应用。
```
kubectl create -f gpu-job-windows.yaml
```

执行以下命令，查看示例应用gpu-job-windows的日志信息。

kubectl logs -f gpu-job-windows

预期输出：

INFO: Executing model of "tinyyolov2-7" 100 times within GPU driver ...

Created LearningModelDevice with GPU: NVIDIA GRID T4-8Q
Loading model (path = c:\data\tinyyolov2-7\model.onnx)...
=================================================================
Name: Example Model
Author: OnnxMLTools
Version: 0
Domain: onnxconverter-common
Description: The Tiny YOLO network from the paper 'YOLO9000: Better, Faster, Stronger' (2016), arXiv:1612.08242
Path: c:\data\tinyyolov2-7\model.onnx
Support FP16: false

Input Feature Info:
Name: image
Feature Kind: Image (Height: 416, Width:  416)

Output Feature Info:
Name: grid
Feature Kind: Float

预期输出表明，示例应用gpu-job-windows已成功使用GPU加速功能。