Container Service for Kubernetes (ACK) provides elastic model trainings that integrate Elastic Horovod. This allows Horovod to dynamically adjust the number of workers for distributed training tasks that run in ACK clusters. You can enable real-time elastic training for a cluster of preemptible instances to make full use of idle computing resources and reduce training costs. This topic describes how to deploy elastic model training tasks and how to scale resources for training tasks.
Prerequisites
The cloud-native AI suite is deployed in your ACK cluster. Elastic Training and Arena is selected when you deploy the cloud-native AI suite. For more information, see Deploy the cloud-native AI suite.
Horovod is used as the distributed training framework.
The Arena client is installed. For more information, see Configure the Arena client.
Background information
The number of workers cannot be dynamically adjusted after a traditional distributed deep learning task is submitted.
Model training is a key step in deep learning. The training of complex models can take a long time and require large amounts of computing power. Elastic model training allows you to dynamically adjust the number of workers for model training tasks.
Deploy an elastic model training task
Submit a training task
Run the following command to submit a training task:
arena submit etjob \
--name=elastic-training \
--gpus=1 \
--workers=3 \
--max-workers=9 \
--min-workers=1 \
--image=registry.cn-hangzhou.aliyuncs.com/ai-samples/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1 \
--working-dir=/examples \
"horovodrun \
-np \$((\${workers}*\${gpus})) \
--min-np \$((\${minWorkers}*\${gpus})) \
--max-np \$((\${maxWorkers}*\${gpus})) \
--host-discovery-script /etc/edl/discover_hosts.sh \
python /examples/elastic/tensorflow2_mnist_elastic.py
"
In this example, the horovodrun wrapper is used to run the elastic training task by using Horovod. The np, max-np, and min-np parameters are required to run the task. Arena writes the parameter values to environment variables. You can specify the environment variables when you submit the task.
The following table describes the parameters.
Parameter | Description |
--name | The name of the training task. The name must be globally unique. |
--gpus | The number of GPUs per worker. |
--max-workers | The maximum number of workers that run the training task. |
--min-workers | The minimum number of workers that run the training task. |
--image | The container image that is used to run the training task. |
--working-dir | The directory in which the command is executed. |
--np | The number of workers to be used for the task. |
--max-np | The maximum number of workers to be used for the task. |
--min-np | The minimum number of workers to be used for the task. |
--host-discovery-script | The host-discovery-script parameter specifies the path of the host discovery script that is created by the et-operator component in /etc/edl/discover_hosts.sh. |
Expected output:
configmap/elastic-training-etjob created
configmap/elastic-training-etjob labeled
trainingjob.kai.alibabacloud.com/elastic-training created
INFO[0000] The Job elastic-training has been submitted successfully
INFO[0000] You can run `arena get elastic-training --type etjob` to check the job status
Query the training task
Run the following command to query the training task:
arena get elastic-training
Expected output:
Name: elastic-training
Status: RUNNING
Namespace: default
Priority: N/A
Trainer: ETJOB
Duration: 13s
Instances:
NAME STATUS AGE IS_CHIEF GPU(Requested) NODE
---- ------ --- -------- -------------- ----
elastic-training-launcher Running 13s true 0 cn-huhehaote.192.168.0.173
elastic-training-worker-0 Running 13s false 1 cn-huhehaote.192.168.0.174
elastic-training-worker-1 Running 13s false 1 cn-huhehaote.192.168.0.174
Query training logs
Run the following command to query training logs:
arena logs elastic-training --tail 10
Expected output:
[0]<stdout>:Step #340 Loss: 0.047924
[1]<stdout>:Step #340 Loss: 0.116303
[0]<stdout>:Step #350 Loss: 0.068762
[1]<stdout>:Step #350 Loss: 0.040847
[0]<stdout>:Step #360 Loss: 0.057501
[1]<stdout>:Step #360 Loss: 0.111952
[0]<stdout>:Step #370 Loss: 0.085895
[1]<stdout>:Step #370 Loss: 0.075529
[0]<stdout>:Step #380 Loss: 0.063450
[1]<stdout>:Step #380 Loss: 0.054253
Add workers for the training task
Submit a scale-out task
Run the following command to submit a scale-out task:
arena scaleout etjob --name="elastic-training" --count=1 --timeout=10m
--name: the name of the training task for which you want to add workers.
--count: the number of workers that you want to add for the training task.
--timeout: the timeout period of the scale-out operation.
If workers are not created before the timeout period ends, the scheduler rolls back the scale-out operation.
Expected output:
configmap/elastic-training-1609914643-scaleout created
configmap/elastic-training-1609914643-scaleout labeled
scaleout.kai.alibabacloud.com/elastic-training-1609914643 created
INFO[0003] The scaleout job elastic-training-1609914643 has been submitted successfully
Query the training task
Run the following command to query the training task:
arena get elastic-training
Expected output:
Name: elastic-training
Status: RUNNING
Namespace: default
Priority: N/A
Trainer: ETJOB
Duration: 3m
Instances:
NAME STATUS AGE IS_CHIEF GPU(Requested) NODE
---- ------ --- -------- -------------- ----
elastic-training-launcher Running 3m true 0 cn-huhehaote.192.168.0.173
elastic-training-worker-0 Running 3m false 1 cn-huhehaote.192.168.0.174
elastic-training-worker-1 Running 3m false 1 cn-huhehaote.192.168.0.174
elastic-training-worker-2 Running 1m false 1 cn-huhehaote.192.168.0.173
The preceding output shows that a worker named elastic-training-worker-2
is deployed to run the training task.
Query training logs
Run the following command to query training logs:
arena logs elastic-training --tail 10
Expected output:
[1]<stdout>:Step #1670 Loss: 0.131210
[2]<stdout>:Step #1680 Loss: 0.020876
[0]<stdout>:Step #1680 Loss: 0.030605
[1]<stdout>:Step #1680 Loss: 0.074515
[2]<stdout>:Step #1690 Loss: 0.029105
[0]<stdout>:Step #1690 Loss: 0.015216
[1]<stdout>:Step #1690 Loss: 0.022670
[0]<stdout>:Step #1700 Loss: 0.105407
[1]<stdout>:Step #1700 Loss: 0.037623
[2]<stdout>:Step #1700 Loss: 0.032874
The preceding output shows that three workers are running the task.
Remove workers from the training task
Submit a scale-in task
Run the following command to submit a scale-in task:
arena scalein etjob --name="elastic-training" --count=1 --timeout=10m
--name: the name of the training task from which you want to remove workers.
--count: the number of workers that you want to remove from the training task.
--timeout: the timeout period of the scale-in operation.
Expected output:
configmap/elastic-training-1609914720-scalein created
configmap/elastic-training-1609914720-scalein labeled
scalein.kai.alibabacloud.com/elastic-training-1609914720 created
INFO[0002] The scalein job elastic-training-1609914720 has been submitted successfully
Query the training task
Run the following command to query the training task:
arena get elastic-training
Expected output:
Name: elastic-training
Status: RUNNING
Namespace: default
Priority: N/A
Trainer: ETJOB
Duration: 3m
Instances:
NAME STATUS AGE IS_CHIEF GPU(Requested) NODE
---- ------ --- -------- -------------- ----
elastic-training-launcher Running 3m true 0 cn-huhehaote.192.168.0.173
elastic-training-worker-0 Running 3m false 1 cn-huhehaote.192.168.0.174
elastic-training-worker-1 Running 3m false 1 cn-huhehaote.192.168.0.174
The preceding output shows that the worker named elastic-training-worker-2
is removed.
Query training logs
Run the following command to query training logs:
arena logs elastic-training --tail 10
Expected output:
[1]<stdout>:Step #2180 Loss: 0.001739
[0]<stdout>:Step #2180 Loss: 0.004853
[0]<stdout>:Step #2190 Loss: 0.000846
[1]<stdout>:Step #2190 Loss: 0.007900
[0]<stdout>:Step #2200 Loss: 0.039376
[1]<stdout>:Step #2200 Loss: 0.024672
[0]<stdout>:Step #2210 Loss: 0.012985
[1]<stdout>:Step #2210 Loss: 0.010956
[0]<stdout>:Step #2220 Loss: 0.009604
[1]<stdout>:Step #2220 Loss: 0.002531
The preceding output shows that only two workers are running the task.