overview of AIACC-ACSpeed performance data - Elastic GPU Service

Compared with the models trained by using native PyTorch DistributedDataParallel (DDP), the models trained by using AIACC-ACSpeed (ACSpeed) provide significantly improved performance. This topic shows the performance data of ACSpeed in model training.

Test configurations

ACSpeed: ACSpeed V1.0.2
CUDA: CUDA V11.1
Torch: Torch 1.8.1 + cu111
Instance specifications: GPU-accelerated instance equipped with eight GPUs

Test results

This topic shows the performance data of ACSpeed when ACSpeed is used on Alibaba Cloud GPU-accelerated instances to perform model training in different scenarios. In this topic, ACSpeed V1.0.2 is used. Each GPU-accelerated instance is equipped with eight GPUs.

The test results show that ACSpeed improves performance for multiple models, with an overall performance improvement of 5% to 200%. ACSpeed provides significant performance improvements in scenarios where native PyTorch DDP provides poor scalability. The performance of ACSpeed also remains consistent. The following figure shows the test results. 性能提升

The following table describes the elements in the figure.

Element	Description
ddp_acc (x-axis)	The scalability of native PyTorch DDP in multi-server and multi-GPU training. The scalability of PyTorch DDP in multi-server and multi-GPU training is indicated by multi-server linearity. A lower linearity value indicates poorer scalability. Multi-server linearity is calculated based on the following formula: Multi-server linearity = Multi-server performance/Single-server performance/Number of servers.
acc_ratio (y-axis)	The improvement ratio of ACSpeed over PyTorch DDP measured by performance metrics. For example, the value 1.25 indicates that the performance provided by ACSpeed is 1.25 times the performance provided by PyTorch DDP. This means that ACSpeed improves performance by 25% compared with PyTorch DDP.
DDP Scalability VS AcSpeed Acc (dot)	A dot corresponds to the model performance provided by PyTorch DDP and the improved model performance provided by ACSpeed. Different colors indicate different number of servers: : one server : two servers : four servers : eight servers

Model performance data

This section shows only the performance data of common models that are tested. The performance improvements vary based on the communication computing proportions of models in different scenarios.

Note

If you want to obtain the performance test data of other servers, join the DingTalk group 33617640 for technical support. You can download DingTalk from the download page.

Scenario 1: Train an alexnet model

Model: alexnet
Domain: COMPUTER_VISION
Subdomain: CLASSIFICATION
Batch_size: 128
Precision: automatic mixed precision (AMP)

The following figure shows the performance data of the alexnet model in the training scenario. 模型1

Scenario 2: Train a resnet18 model

Model: resnet18
Domain: COMPUTER_VISION
Subdomain: CLASSIFICATION
Batch_size: 16
Precision: AMP

The following figure shows the performance data of the resnet18 model in the training scenario. 模型2

Scenario 3: Train a resnet50 model

Model: resnet50
Domain: COMPUTER_VISION
Subdomain: CLASSIFICATION
Batch_size: 32
Precision: AMP

The following figure shows the performance data of the resnet50 model in the training scenario. 模型3

Scenario 4: Train a vgg16 model

Model: vgg16
Domain: COMPUTER_VISION
Subdomain: CLASSIFICATION
Batch_size: 64
Precision: AMP

The following figure shows the performance data of the vgg16 model in the training scenario. 模型4

Scenario 5: Train a timm_vovnet model

Model: timm_vovnet
Domain: COMPUTER_VISION
Subdomain: CLASSIFICATION
Batch_size: 32
Precision: AMP

The following figure shows the performance data of the timm_vovnet model in the training scenario. 模型5

Scenario 6: Train a timm_vision_transformer model

Model: timm_vision_transformer
Domain: COMPUTER_VISION
Subdomain: CLASSIFICATION
Batch_size: 8
Precision: AMP

The following figure shows the performance data of the timm_vision_transformer model in the training scenario. 模型6

Scenario 7: Train a pytorch_unet model

Model: pytorch_unet
Domain: COMPUTER_VISION
Subdomain: CLASSIFICATION
Batch_size: 1
Precision: AMP

The following figure shows the performance data of the pytorch_unet model in the training scenario. 模型7

Scenario 8: Train an hf_Bart model

Model: hf_Bart
Domain: NLP
Subdomain: LANGUAGE_MODELING
Batch_size: 4
Precision: AMP

The following figure shows the performance data of the hf_Bart model in the training scenario. 模型8

Scenario 9: Train an hf_Bert model

Model: hf_Bert
Domain: NLP
Subdomain: LANGUAGE_MODELING
Batch_size: 4
Precision: AMP

The following figure shows the performance data of the hf_Bert model in the training scenario. 模型9

Scenario 10: Train a speech_transformer model

Model: speech_transformer
Domain: SPEECH
Subdomain: RECOGNITION
Batch_size: 32
Precision: AMP

The following figure shows the performance data of the speech_transformer model in the training scenario. 模型10

Scenario 11: Train a tts_angular model

Model: tts_angular
Domain: SPEECH
Subdomain: SYNTHESIS
Batch_size: 64
Precision: AMP

The following figure shows the performance data of the tts_angular model in the training scenario. 模型11