Container Service for Kubernetes (ACK) supports topology-aware GPU scheduling based on the scheduling framework. This feature selects a combination of GPUs from GPU-accelerated nodes to achieve optimal GPU acceleration for training jobs. This topic describes how to use topology-aware GPU scheduling to achieve optimal GPU acceleration for TensorFlow distributed jobs.
Prerequisites
A Container Service for Kubernetes (ACK) Pro cluster is created and the instance type of the cluster is set to Elastic GPU Service. For more information, see Create an ACK managed cluster.
Arena is installed.
The versions of the system components meet the following requirements.
Component
Version
Kubernetes
1.18.8 and later
Nvidia
418.87.01 and later
NVIDIA Collective Communications Library (NCCL)
2.7+
Operating system
CentOS 7.6
CentOS 7.7
Ubuntu 16.04
Ubuntu 18.04
Alibaba Cloud Linux 2
Alibaba Cloud Linux 3
GPU
V100
Usage notes
Topology-aware GPU scheduling is applicable to only Message Passing Interface (MPI) jobs that are trained by using a distributed framework.
The resources that are requested by pods must meet specific requirements before the pods can be created to submit and run jobs. Otherwise, the requests remain pending for resources.
Procedure
Configure nodes
Run the following command to set node labels and explicitly enable topology-aware GPU scheduling for nodes:
kubectl label node <Your Node Name> ack.node.gpu.schedule=topology
After topology-aware GPU scheduling is enabled on nodes, regular GPU scheduling cannot be enabled. You can run the following command to change the label and enable regular GPU scheduling:
kubectl label node <Your Node Name> ack.node.gpu.schedule=default --overwrite
Submit jobs
Submit an MPI job and set --gputopology to true
.
arena submit --gputopology=true --gang ***
Example 1: Train VGG16
The cluster used in this example consists of two nodes. Each node provides eight V100 GPUs.
Use topology-aware GPU scheduling to train VGG16
Run the following command to submit a job to the cluster:
arena submit mpi \ --name=tensorflow-topo-4-vgg16 \ --gpus=1 \ --workers=4 \ --gang \ --gputopology=true \ --image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/tensorflow-benchmark:tf2.3.0-py3.7-cuda10.1 \ "mpirun --allow-run-as-root -np "4" -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /tensorflow/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=vgg16 --batch_size=64 --variable_update=horovod"
Run the following command to query the status of the job:
arena get tensorflow-topo-4-vgg16 --type mpijob
Expected output:
Name: tensorflow-topo-4-vgg16 Status: RUNNINGNamespace: default Priority: N/A Trainer: MPIJOB Duration: 2m Instances: NAME STATUS AGE IS_CHIEF GPU(Requested) NODE ---- ------ --- -------- -------------- ---- tensorflow-topo-4-vgg16-launcher-lmhjl Running 2m true 0 cn-shanghai.192.168.16.172 tensorflow-topo-4-vgg16-worker-0 Running 2m false 1 cn-shanghai.192.168.16.173 tensorflow-topo-4-vgg16-worker-1 Running 2m false 1 cn-shanghai.192.168.16.173 tensorflow-topo-4-vgg16-worker-2 Running 2m false 1 cn-shanghai.192.168.16.173 tensorflow-topo-4-vgg16-worker-3 Running 2m false 1 cn-shanghai.192.168.16.173
Run the following command to print the job log:
arena logs -f tensorflow-topo-4-vgg16
Expected output:
total images/sec: 991.92
Use regular GPU scheduling to train VGG16
Run the following command to submit a job to the cluster:
arena submit mpi \ --name=tensorflow-4-vgg16 \ --gpus=1 \ --workers=4 \ --image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/tensorflow-benchmark:tf2.3.0-py3.7-cuda10.1 \ "mpirun --allow-run-as-root -np "4" -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /tensorflow/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=vgg16 --batch_size=64 --variable_update=horovod"
Run the following command to query the status of the job:
arena get tensorflow-4-vgg16 --type mpijob
Expected output:
Name: tensorflow-4-vgg16 Status: RUNNING Namespace: default Priority: N/A Trainer: MPIJOB Duration: 9s Instances: NAME STATUS AGE IS_CHIEF GPU(Requested) NODE ---- ------ --- -------- -------------- ---- tensorflow-4-vgg16-launcher-xc28k Running 9s true 0 cn-shanghai.192.168.16.172 tensorflow-4-vgg16-worker-0 Running 9s false 1 cn-shanghai.192.168.16.172 tensorflow-4-vgg16-worker-1 Running 9s false 1 cn-shanghai.192.168.16.173 tensorflow-4-vgg16-worker-2 Running 9s false 1 cn-shanghai.192.168.16.172 tensorflow-4-vgg16-worker-3 Running 9s false 1 cn-shanghai.192.168.16.173
Run the following command to print the job log:
arena logs -f tensorflow-4-vgg16
Expected output:
total images/sec: 200.47
Example 2: Train ResNet50
Use topology-aware GPU scheduling to train ResNet50
Run the following command to submit a job to the cluster:
arena submit mpi \ --name=tensorflow-topo-4-resnet50 \ --gpus=1 \ --workers=4 \ --gang \ --gputopology=true \ --image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/tensorflow-benchmark:tf2.3.0-py3.7-cuda10.1 \ "mpirun --allow-run-as-root -np "4" -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /tensorflow/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=resnet50 --batch_size=64 --variable_update=horovod"
Run the following command to query the status of the job:
arena get tensorflow-topo-4-resnet50 --type mpijob
Expected output:
Name: tensorflow-topo-4-resnet50 Status: RUNNING Namespace: default Priority: N/A Trainer: MPIJOB Duration: 8s Instances: NAME STATUS AGE IS_CHIEF GPU(Requested) NODE ---- ------ --- -------- -------------- ---- tensorflow-topo-4-resnet50-launcher-7ln8j Running 8s true 0 cn-shanghai.192.168.16.172 tensorflow-topo-4-resnet50-worker-0 Running 8s false 1 cn-shanghai.192.168.16.173 tensorflow-topo-4-resnet50-worker-1 Running 8s false 1 cn-shanghai.192.168.16.173 tensorflow-topo-4-resnet50-worker-2 Running 8s false 1 cn-shanghai.192.168.16.173 tensorflow-topo-4-resnet50-worker-3 Running 8s false 1 cn-shanghai.192.168.16.173
Run the following command to print the job log:
arena logs -f tensorflow-topo-4-resnet50
Expected output:
total images/sec: 1471.55
Use regular GPU scheduling to train ResNet50
Run the following command to submit a job to the cluster:
arena submit mpi \ --name=tensorflow-4-resnet50 \ --gpus=1 \ --workers=4 \ --image=registry.cn-hangzhou.aliyuncs.com/kubernetes-image-hub/tensorflow-benchmark:tf2.3.0-py3.7-cuda10.1 \ "mpirun --allow-run-as-root -np "4" -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x PATH --mca pml ob1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 --mca orte_keep_fqdn_hostnames t --mca btl ^openib python /tensorflow/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model=resnet50 --batch_size=64 --variable_update=horovod"
Run the following command to query the status of the job:
arena get tensorflow-4-resnet50 --type mpijob
Expected output:
Name: tensorflow-4-resnet50 Status: RUNNING Namespace: default Priority: N/A Trainer: MPIJOB Duration: 9s Instances: NAME STATUS AGE IS_CHIEF GPU(Requested) NODE ---- ------ --- -------- -------------- ---- tensorflow-4-resnet50-launcher-q24hv Running 9s true 0 cn-shanghai.192.168.16.172 tensorflow-4-resnet50-worker-0 Running 9s false 1 cn-shanghai.192.168.16.172 tensorflow-4-resnet50-worker-1 Running 9s false 1 cn-shanghai.192.168.16.173 tensorflow-4-resnet50-worker-2 Running 9s false 1 cn-shanghai.192.168.16.172 tensorflow-4-resnet50-worker-3 Running 9s false 1 cn-shanghai.192.168.16.173
Run the following command to print the job log:
arena logs -f tensorflow-4-resnet50
Expected output:
total images/sec: 745.38
Performance comparison
The following figure shows the performance difference between topology-aware GPU scheduling and regular GPU scheduling based on the preceding examples.
The figure shows that after topology-aware GPU scheduling is enabled, the TensorFlow distributed jobs are accelerated.
The performance values in this topic are theoretical values. The performance of topology-aware GPU scheduling varies based on your model and cluster environment. The actual performance statistics shall prevail. You can repeat the preceding steps to evaluate your models.