Compared with the models trained by using native PyTorch DistributedDataParallel (DDP), the models trained by using AIACC-ACSpeed (ACSpeed) provide significantly improved performance. This topic shows the performance data of ACSpeed in model training.
Test configurations
ACSpeed: ACSpeed V1.0.2
CUDA: CUDA V11.1
Torch: Torch 1.8.1 + cu111
Instance specifications: GPU-accelerated instance equipped with eight GPUs
Test results
This topic shows the performance data of ACSpeed when ACSpeed is used on Alibaba Cloud GPU-accelerated instances to perform model training in different scenarios. In this topic, ACSpeed V1.0.2 is used. Each GPU-accelerated instance is equipped with eight GPUs.
The test results show that ACSpeed improves performance for multiple models, with an overall performance improvement of 5% to 200%. ACSpeed provides significant performance improvements in scenarios where native PyTorch DDP provides poor scalability. The performance of ACSpeed also remains consistent. The following figure shows the test results.
The following table describes the elements in the figure.
Element | Description |
ddp_acc (x-axis) | The scalability of native PyTorch DDP in multi-server and multi-GPU training. The scalability of PyTorch DDP in multi-server and multi-GPU training is indicated by multi-server linearity. A lower linearity value indicates poorer scalability. Multi-server linearity is calculated based on the following formula: Multi-server linearity = Multi-server performance/Single-server performance/Number of servers. |
acc_ratio (y-axis) | The improvement ratio of ACSpeed over PyTorch DDP measured by performance metrics. For example, the value 1.25 indicates that the performance provided by ACSpeed is 1.25 times the performance provided by PyTorch DDP. This means that ACSpeed improves performance by 25% compared with PyTorch DDP. |
DDP Scalability VS AcSpeed Acc (dot) | A dot corresponds to the model performance provided by PyTorch DDP and the improved model performance provided by ACSpeed. Different colors indicate different number of servers:
|
Model performance data
This section shows only the performance data of common models that are tested. The performance improvements vary based on the communication computing proportions of models in different scenarios.
If you want to obtain the performance test data of other servers, join the DingTalk group 33617640
for technical support. You can download DingTalk from the download page.
Scenario 1: Train an alexnet model
Model: alexnet
Domain: COMPUTER_VISION
Subdomain: CLASSIFICATION
Batch_size: 128
Precision: automatic mixed precision (AMP)
The following figure shows the performance data of the alexnet model in the training scenario.
Scenario 2: Train a resnet18 model
Model: resnet18
Domain: COMPUTER_VISION
Subdomain: CLASSIFICATION
Batch_size: 16
Precision: AMP
The following figure shows the performance data of the resnet18 model in the training scenario.
Scenario 3: Train a resnet50 model
Model: resnet50
Domain: COMPUTER_VISION
Subdomain: CLASSIFICATION
Batch_size: 32
Precision: AMP
The following figure shows the performance data of the resnet50 model in the training scenario.
Scenario 4: Train a vgg16 model
Model: vgg16
Domain: COMPUTER_VISION
Subdomain: CLASSIFICATION
Batch_size: 64
Precision: AMP
The following figure shows the performance data of the vgg16 model in the training scenario.
Scenario 5: Train a timm_vovnet model
Model: timm_vovnet
Domain: COMPUTER_VISION
Subdomain: CLASSIFICATION
Batch_size: 32
Precision: AMP
The following figure shows the performance data of the timm_vovnet model in the training scenario.
Scenario 6: Train a timm_vision_transformer model
Model: timm_vision_transformer
Domain: COMPUTER_VISION
Subdomain: CLASSIFICATION
Batch_size: 8
Precision: AMP
The following figure shows the performance data of the timm_vision_transformer model in the training scenario.
Scenario 7: Train a pytorch_unet model
Model: pytorch_unet
Domain: COMPUTER_VISION
Subdomain: CLASSIFICATION
Batch_size: 1
Precision: AMP
The following figure shows the performance data of the pytorch_unet model in the training scenario.
Scenario 8: Train an hf_Bart model
Model: hf_Bart
Domain: NLP
Subdomain: LANGUAGE_MODELING
Batch_size: 4
Precision: AMP
The following figure shows the performance data of the hf_Bart model in the training scenario.
Scenario 9: Train an hf_Bert model
Model: hf_Bert
Domain: NLP
Subdomain: LANGUAGE_MODELING
Batch_size: 4
Precision: AMP
The following figure shows the performance data of the hf_Bert model in the training scenario.
Scenario 10: Train a speech_transformer model
Model: speech_transformer
Domain: SPEECH
Subdomain: RECOGNITION
Batch_size: 32
Precision: AMP
The following figure shows the performance data of the speech_transformer model in the training scenario.
Scenario 11: Train a tts_angular model
Model: tts_angular
Domain: SPEECH
Subdomain: SYNTHESIS
Batch_size: 64
Precision: AMP
The following figure shows the performance data of the tts_angular model in the training scenario.