使用GPU拓撲感知調度（Pytorch版） - Container Service for Kubernetes

ACK基於Scheduling Framework機制，實現GPU拓撲感知調度，即在節點的GPU組合中選擇具有最優訓練速度的組合。本文介紹如何使用GPU拓撲感知調度來提升Pytorch分布式訓練的訓練速度。

前提條件

已建立ACK Pro叢集，且叢集的執行個體規格類型選擇為GPU雲端服務器。更多資訊，請參見建立Kubernetes託管版叢集。
已安裝Arena。
已安裝GPU拓撲感知調度組件。

系統組件版本滿足以下要求。

組件	版本要求
Kubernetes	1.18.8及以上版本
Nvidia	418.87.01及以上版本
訓練架構NCCL版本	2.7+
作業系統	CentOS 7.6 CentOS 7.7 Ubuntu 16.04 Ubuntu 18.04 Alibaba Cloud Linux 2 Alibaba Cloud Linux 3
顯卡	V100

注意事項

僅支援MPI作業的分布式訓練。
只有當提交作業的所有Pod對資源請求都滿足條件時，才能建立Pod並啟動作業，否則請求會處於資源等待狀態。

操作步驟

節點配置

您需執行以下命令，設定節點Label，顯式啟用節點GPU拓撲感知調度。

kubectl label node <Your Node Name> ack.node.gpu.schedule=topology

說明

當節點啟用GPU拓撲感知調度後，不再支援普通GPU資源調度。可執行以下命令更改Label，恢複普通GPU資源調度功能。

kubectl label node <Your Node Name> ack.node.gpu.schedule=default --overwrite

提交作業

您在提交MPI作業時，執行以下命令設定--gputopology為true。

arena submit --gputopology=true --gang ***

樣本一：訓練Vgg16

說明

本樣本測試叢集有2台8卡V100機器。

使用GPU拓撲感知調度訓練Vgg16

執行以下命令，向叢集提交作業。

arena submit mpi \
  --name=pytorch-topo-4-vgg16 \
  --gpus=1 \
  --workers=4 \
  --gang \
  --gputopology=true \
  --image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/pytorch-benchmark:torch1.6.0-py3.7-cuda10.1 \
  "mpirun --allow-run-as-root -np "4" -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /examples/pytorch_synthetic_benchmark.py --model=vgg16 --batch-size=64"

執行以下命令，查看當前作業運行情況。

arena get pytorch-topo-4-vgg16 --type mpijob

預期輸出：

Name:      pytorch-topo-4-vgg16
Status:    RUNNING
Namespace: default
Priority:  N/A
Trainer:   MPIJOB
Duration:  11s

Instances:
  NAME                                 STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
  ----                                 ------   ---  --------  --------------  ----
  pytorch-topo-4-vgg16-launcher-mnjzr  Running  11s  true      0               cn-shanghai.192.168.16.173
  pytorch-topo-4-vgg16-worker-0        Running  11s  false     1               cn-shanghai.192.168.16.173
  pytorch-topo-4-vgg16-worker-1        Running  11s  false     1               cn-shanghai.192.168.16.173
  pytorch-topo-4-vgg16-worker-2        Running  11s  false     1               cn-shanghai.192.168.16.173
  pytorch-topo-4-vgg16-worker-3        Running  11s  false     1               cn-shanghai.192.168.16.173

執行以下命令，查看當前日誌資訊。

arena logs -f pytorch-topo-4-vgg16

預期輸出：

Model: vgg16
Batch size: 64
Number of GPUs: 4
Running warmup...
Running benchmark...
Iter #0: 205.5 img/sec per GPU
Iter #1: 205.2 img/sec per GPU
Iter #2: 205.1 img/sec per GPU
Iter #3: 205.5 img/sec per GPU
Iter #4: 205.1 img/sec per GPU
Iter #5: 205.1 img/sec per GPU
Iter #6: 205.3 img/sec per GPU
Iter #7: 204.3 img/sec per GPU
Iter #8: 205.0 img/sec per GPU
Iter #9: 204.9 img/sec per GPU
Img/sec per GPU: 205.1 +-0.6
Total img/sec on 4 GPU(s): 820.5 +-2.5

使用普通GPU調度訓練Vgg16

執行以下命令，向叢集提交作業。

arena submit mpi \
  --name=pytorch-4-vgg16 \
  --gpus=1 \
  --workers=4 \
  --image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/pytorch-benchmark:torch1.6.0-py3.7-cuda10.1 \
  "mpirun --allow-run-as-root -np "4" -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /examples/pytorch_synthetic_benchmark.py --model=vgg16 --batch-size=64"

執行以下命令，查看當前作業運行情況。

arena get pytorch-4-vgg16 --type mpijob

預期輸出：

Name:      pytorch-4-vgg16
Status:    RUNNING
Namespace: default
Priority:  N/A
Trainer:   MPIJOB
Duration:  10s

Instances:
  NAME                            STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
  ----                            ------   ---  --------  --------------  ----
  pytorch-4-vgg16-launcher-qhnxl  Running  10s  true      0               cn-shanghai.192.168.16.173
  pytorch-4-vgg16-worker-0        Running  10s  false     1               cn-shanghai.192.168.16.173
  pytorch-4-vgg16-worker-1        Running  10s  false     1               cn-shanghai.192.168.16.173
  pytorch-4-vgg16-worker-2        Running  10s  false     1               cn-shanghai.192.168.16.173
  pytorch-4-vgg16-worker-3        Running  10s  false     1               cn-shanghai.192.168.16.173

執行以下命令，查看當前日誌資訊。

arena logs -f pytorch-4-vgg16

預期輸出：

Model: vgg16
Batch size: 64
Number of GPUs: 4
Running warmup...
Running benchmark...
Iter #0: 113.1 img/sec per GPU
Iter #1: 109.5 img/sec per GPU
Iter #2: 106.5 img/sec per GPU
Iter #3: 108.5 img/sec per GPU
Iter #4: 108.1 img/sec per GPU
Iter #5: 111.2 img/sec per GPU
Iter #6: 110.7 img/sec per GPU
Iter #7: 109.8 img/sec per GPU
Iter #8: 102.8 img/sec per GPU
Iter #9: 107.9 img/sec per GPU
Img/sec per GPU: 108.8 +-5.3
Total img/sec on 4 GPU(s): 435.2 +-21.1

樣本二：訓練Resnet50

使用GPU拓撲感知調度訓練Resnet50

執行以下命令，向叢集提交作業。

arena submit mpi \
  --name=pytorch-topo-4-resnet50 \
  --gpus=1 \
  --workers=4 \
  --gang \
  --gputopology=true \
  --image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/pytorch-benchmark:torch1.6.0-py3.7-cuda10.1 \
  "mpirun --allow-run-as-root -np "4" -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /examples/pytorch_synthetic_benchmark.py --model=resnet50 --batch-size=64"

執行以下命令，查看當前作業運行情況。

arena get pytorch-topo-4-resnet50 --type mpijob

預期輸出：

Name:      pytorch-topo-4-resnet50
Status:    RUNNING
Namespace: default
Priority:  N/A
Trainer:   MPIJOB
Duration:  8s

Instances:
  NAME                                    STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
  ----                                    ------   ---  --------  --------------  ----
  pytorch-topo-4-resnet50-launcher-x7r2n  Running  8s   true      0               cn-shanghai.192.168.16.173
  pytorch-topo-4-resnet50-worker-0        Running  8s   false     1               cn-shanghai.192.168.16.173
  pytorch-topo-4-resnet50-worker-1        Running  8s   false     1               cn-shanghai.192.168.16.173
  pytorch-topo-4-resnet50-worker-2        Running  8s   false     1               cn-shanghai.192.168.16.173
  pytorch-topo-4-resnet50-worker-3        Running  8s   false     1               cn-shanghai.192.168.16.173

執行以下命令，查看當前日誌資訊。

arena logs -f pytorch-topo-4-resnet50

預期輸出：

Model: resnet50
Batch size: 64
Number of GPUs: 4
Running warmup...
Running benchmark...
Iter #0: 331.0 img/sec per GPU
Iter #1: 330.6 img/sec per GPU
Iter #2: 330.9 img/sec per GPU
Iter #3: 330.4 img/sec per GPU
Iter #4: 330.7 img/sec per GPU
Iter #5: 330.8 img/sec per GPU
Iter #6: 329.9 img/sec per GPU
Iter #7: 330.5 img/sec per GPU
Iter #8: 330.4 img/sec per GPU
Iter #9: 329.7 img/sec per GPU
Img/sec per GPU: 330.5 +-0.8
Total img/sec on 4 GPU(s): 1321.9 +-3.2

使用普通GPU調度訓練Resnet50

執行以下命令，向叢集提交作業。

arena submit mpi \
  --name=pytorch-4-resnet50 \
  --gpus=1 \
  --workers=4 \
  --image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/pytorch-benchmark:torch1.6.0-py3.7-cuda10.1 \
  "mpirun --allow-run-as-root -np "4" -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /examples/pytorch_synthetic_benchmark.py --model=resnet50 --batch-size=64"

執行以下命令，查看當前作業運行情況。

arena get pytorch-4-resnet50 --type mpijob

預期輸出：

Name:      pytorch-4-resnet50
Status:    RUNNING
Namespace: default
Priority:  N/A
Trainer:   MPIJOB
Duration:  10s

Instances:
  NAME                               STATUS   AGE  IS_CHIEF  GPU(Requested)  NODE
  ----                               ------   ---  --------  --------------  ----
  pytorch-4-resnet50-launcher-qw5k6  Running  10s  true      0               cn-shanghai.192.168.16.173
  pytorch-4-resnet50-worker-0        Running  10s  false     1               cn-shanghai.192.168.16.173
  pytorch-4-resnet50-worker-1        Running  10s  false     1               cn-shanghai.192.168.16.173
  pytorch-4-resnet50-worker-2        Running  10s  false     1               cn-shanghai.192.168.16.173
  pytorch-4-resnet50-worker-3        Running  10s  false     1               cn-shanghai.192.168.16.173

執行以下命令，查看當前日誌資訊。

arena logs -f pytorch-4-resnet50

預期輸出：

Model: resnet50
Batch size: 64
Number of GPUs: 4
Running warmup...
Running benchmark...
Iter #0: 313.1 img/sec per GPU
Iter #1: 312.8 img/sec per GPU
Iter #2: 313.0 img/sec per GPU
Iter #3: 312.2 img/sec per GPU
Iter #4: 313.7 img/sec per GPU
Iter #5: 313.2 img/sec per GPU
Iter #6: 313.6 img/sec per GPU
Iter #7: 313.0 img/sec per GPU
Iter #8: 311.3 img/sec per GPU
Iter #9: 313.6 img/sec per GPU
Img/sec per GPU: 313.0 +-1.3
Total img/sec on 4 GPU(s): 1251.8 +-5.3

效能對比

基於如上4個測試案例效能對比結果如下： gpu32

基於上圖效能對比，可知經過GPU拓撲感知調度後，Pytorch分布式訓練的效果有了很大的提升。

重要

本文提供的效能資料僅為理論值，GPU拓撲感知調度提升結果與您使用的模型以及叢集的環境有一定關係，實際資料以您的作業環境為準。您可以參考上述使用樣本，評測自己的模型。