AIACC 2.0-AIACC Communication Speeding (AIACC-ACSpeed), also known as ACSpeed, is an AI training accelerator that is developed by Alibaba Cloud to help you improve training efficiency and reduce usage costs. You can use ACSpeed to optimize the performance of distributed communication without service interruptions. The ACSpeed software package contains a sample of the adapted code for PyTorch DistributedDataParallel (DDP). This article describes how to quickly run distributed model training by using ACSpeed and shows the performance improvement effect.
This section uses native PyTorch DDP to start and run the Pytorch_ddp_benchmark.py file and perform the Automatic Mixed Precision (AMP) training of the ResNet50 model. By adapting ACSpeed V1.0.2, you can experience the training process and performance effects in a single-machine and eight-GPU environment and a multi-machine and multi-GPU environment.
Note:
Before you perform the training, make sure that Alibaba Cloud GPU-accelerated instances with an initial Python environment are created.
1. Run the following command to install a specific version of PyTorch:
In this example, torch 1.9.1 is used.
pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html
2. Install and enable ACSpeed.
For more information, see Install and use AIACC-ACSpeed.
3. Run the following command to go to the sample code directory of ACSpeed:
cd `python -c "import acspeed; print(acspeed.__path__[0]+'/examples/')"`
Note:
Compared with the PyTorch DDP code, you need to only add import acspeed to the code to use ACSpeed.
4. Start training.
The following sections compare model training in a single-machine and eight-GPU environment and a multi-machine and multi-GPU environment. For the single-machine and eight-GPU environment, the performance is significantly improved on instances of the ecs.ebmgn7t.32xlarge and ecs.ebmgn6t.24xlarge instance types.
NP=8
ADDR=localhost
PORT=6006
model=resnet50
python -m torch.distributed.run --nnodes 1 --node_rank 0 --nproc_per_node ${NP} --master_addr ${ADDR} --master_port ${PORT} Pytorch_ddp_benchmark.py --model ${model} --precision amp
The following sample code shows the result of training by using the PyTorch DDP code:
8 GPUs -- 1M/8G: p50: 0.073s 440/s p75: 0.073s 438/s p90: 0.073s 437/s p95: 0.073s 436/s
The following sample code shows the result of training by using the ACSpeed code:
8 GPUs -- 1M/8G: p50: 0.054s 597/s p75: 0.054s 592/s p90: 0.056s 569/s p95: 0.056s 568/s
Compared with PyTorch DDP, ACSpeed improves the model training performance by 35%. You can calculate the improvement rate by the following equation: (597 - 440)/440 = 35%.
# node0
python -m torch.distributed.run --nnodes 2 --node_rank 0 --nproc_per_node 8 --master_addr <node1_ip> --master_port <port> Pytorch_ddp_benchmark.py --model resnet50 --precision amp
# node1
python -m torch.distributed.run --nnodes 2 --node_rank 1 --nproc_per_node 8 --master_addr <node1_ip> --master_port <port> Pytorch_ddp_benchmark.py --model resnet50 --precision amp
The following sample code shows the result of training by using the PyTorch DDP code:
16 GPUs -- 2M/16G: p50: 0.091s 351/s p75: 0.091s 349/s p90: 0.092s 348/s p95: 0.092s 347/s
The following sample code shows the result of training by using the ACSpeed code:
16 GPUs -- 2M/16G: p50: 0.071s 449/s p75: 0.072s 442/s p90: 0.073s 436/s p95: 0.074s 432/s
Compared with PyTorch DDP, ACSpeed improves the model training performance by 27%. You can calculate the improvement rate by the following equation: (449 - 351)/351 = 27%.
ACSpeed can significantly improve the training performance of multiple models. For more information about the performance test effects of other models, see AIACC-ACSpeed performance data.
Learning about AIACC-ACSpeed | AIACC-ACSpeed Performance Data
1,095 posts | 321 followers
FollowAlibaba Cloud Community - April 7, 2024
Alibaba Cloud Community - April 7, 2024
Alibaba Cloud Native Community - February 9, 2023
Alibaba Clouder - October 21, 2020
Alex - February 14, 2020
Alibaba Cloud Data Intelligence - December 5, 2023
1,095 posts | 321 followers
FollowPowerful parallel computing capabilities based on GPU technology.
Learn MoreApply the latest Reinforcement Learning AI technology to your Field Service Management (FSM) to obtain real-time AI-informed decision support.
Learn MoreSelf-service network O&M service that features network status visualization and intelligent diagnostics capabilities
Learn MoreThis solution provides you with Artificial Intelligence services and allows you to build AI-powered, human-like, conversational, multilingual chatbots over omnichannel to quickly respond to your customers 24/7.
Learn MoreMore Posts by Alibaba Cloud Community
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
Get Started for Free Get Started for Free