Learning about AIACC-ACSpeed | AIACC-ACSpeed Training Demo

This article describes how to quickly run distributed model training by using ACSpeed and shows the performance improvement effect.

AIACC 2.0-AIACC Communication Speeding (AIACC-ACSpeed), also known as ACSpeed, is an AI training accelerator that is developed by Alibaba Cloud to help you improve training efficiency and reduce usage costs. You can use ACSpeed to optimize the performance of distributed communication without service interruptions. The ACSpeed software package contains a sample of the adapted code for PyTorch DistributedDataParallel (DDP). This article describes how to quickly run distributed model training by using ACSpeed and shows the performance improvement effect.

Procedure

This section uses native PyTorch DDP to start and run the Pytorch_ddp_benchmark.py file and perform the Automatic Mixed Precision (AMP) training of the ResNet50 model. By adapting ACSpeed V1.0.2, you can experience the training process and performance effects in a single-machine and eight-GPU environment and a multi-machine and multi-GPU environment.

Note:

Before you perform the training, make sure that Alibaba Cloud GPU-accelerated instances with an initial Python environment are created.

1. Run the following command to install a specific version of PyTorch:

In this example, torch 1.9.1 is used.

pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

2. Install and enable ACSpeed.

For more information, see Install and use AIACC-ACSpeed.

3. Run the following command to go to the sample code directory of ACSpeed:

cd `python -c "import acspeed; print(acspeed.__path__[0]+'/examples/')"`

Note:

Compared with the PyTorch DDP code, you need to only add import acspeed to the code to use ACSpeed.

4. Start training.

The following sections compare model training in a single-machine and eight-GPU environment and a multi-machine and multi-GPU environment. For the single-machine and eight-GPU environment, the performance is significantly improved on instances of the ecs.ebmgn7t.32xlarge and ecs.ebmgn6t.24xlarge instance types.

Single-Machine and Eight-GPU Environment

Startup commands

NP=8
ADDR=localhost
PORT=6006
model=resnet50
python -m torch.distributed.run --nnodes 1 --node_rank 0 --nproc_per_node ${NP} --master_addr ${ADDR} --master_port ${PORT} Pytorch_ddp_benchmark.py --model ${model} --precision amp

Training results

The following sample code shows the result of training by using the PyTorch DDP code:

8 GPUs --    1M/8G:  p50:  0.073s     440/s  p75:  0.073s     438/s  p90:  0.073s     437/s  p95:  0.073s     436/s

The following sample code shows the result of training by using the ACSpeed code:

8 GPUs --    1M/8G:  p50:  0.054s     597/s  p75:  0.054s     592/s  p90:  0.056s     569/s  p95:  0.056s     568/s

Training summary

Compared with PyTorch DDP, ACSpeed improves the model training performance by 35%. You can calculate the improvement rate by the following equation: (597 - 440)/440 = 35%.

Multi-Machine and Multi-GPU Environment

Startup commands

# node0
python -m torch.distributed.run --nnodes 2 --node_rank 0 --nproc_per_node 8 --master_addr <node1_ip> --master_port <port> Pytorch_ddp_benchmark.py --model resnet50 --precision amp

# node1
python -m torch.distributed.run --nnodes 2 --node_rank 1 --nproc_per_node 8 --master_addr <node1_ip> --master_port <port> Pytorch_ddp_benchmark.py --model resnet50 --precision amp

Training results

The following sample code shows the result of training by using the PyTorch DDP code:

16 GPUs --    2M/16G:  p50:  0.091s     351/s  p75:  0.091s     349/s  p90:  0.092s     348/s  p95:  0.092s     347/s

The following sample code shows the result of training by using the ACSpeed code:

16 GPUs --    2M/16G:  p50:  0.071s     449/s  p75:  0.072s     442/s  p90:  0.073s     436/s  p95:  0.074s     432/s

Training summary

Compared with PyTorch DDP, ACSpeed improves the model training performance by 27%. You can calculate the improvement rate by the following equation: (449 - 351)/351 = 27%.

References

ACSpeed can significantly improve the training performance of multiple models. For more information about the performance test effects of other models, see AIACC-ACSpeed performance data.

0 1 0

Share on

Community

Learning about AIACC-ACSpeed | AIACC-ACSpeed Training Demo

Procedure

Single-Machine and Eight-GPU Environment

Multi-Machine and Multi-GPU Environment

References

Read previous post:

Read next post:

Alibaba Cloud Community

You may also like

Comments

Alibaba Cloud Community

Related Products

GPU(Elastic GPU Service)

EasyDispatch for Field Service Management

Network Intelligence Service

Conversational AI Service

A Free Trial That Lets You Build Big!