All Products
Search
Document Center

Elastic GPU Service:AIACC-ACSpeed performance data

Last Updated:May 10, 2024

Compared with the models trained by using native PyTorch DistributedDataParallel (DDP), the models trained by using AIACC-ACSpeed (ACSpeed) provide significantly improved performance. This topic shows the performance data of ACSpeed in model training.

Test configurations

  • ACSpeed: ACSpeed V1.0.2

  • CUDA: CUDA V11.1

  • Torch: Torch 1.8.1 + cu111

  • Instance specifications: GPU-accelerated instance equipped with eight GPUs

Test results

This topic shows the performance data of ACSpeed when ACSpeed is used on Alibaba Cloud GPU-accelerated instances to perform model training in different scenarios. In this topic, ACSpeed V1.0.2 is used. Each GPU-accelerated instance is equipped with eight GPUs.

The test results show that ACSpeed improves performance for multiple models, with an overall performance improvement of 5% to 200%. ACSpeed provides significant performance improvements in scenarios where native PyTorch DDP provides poor scalability. The performance of ACSpeed also remains consistent. The following figure shows the test results.性能提升

The following table describes the elements in the figure.

Element

Description

ddp_acc (x-axis)

The scalability of native PyTorch DDP in multi-server and multi-GPU training.

The scalability of PyTorch DDP in multi-server and multi-GPU training is indicated by multi-server linearity. A lower linearity value indicates poorer scalability. Multi-server linearity is calculated based on the following formula: Multi-server linearity = Multi-server performance/Single-server performance/Number of servers.

acc_ratio (y-axis)

The improvement ratio of ACSpeed over PyTorch DDP measured by performance metrics. For example, the value 1.25 indicates that the performance provided by ACSpeed is 1.25 times the performance provided by PyTorch DDP. This means that ACSpeed improves performance by 25% compared with PyTorch DDP.

DDP Scalability VS AcSpeed Acc (dot)

A dot corresponds to the model performance provided by PyTorch DDP and the improved model performance provided by ACSpeed. Different colors indicate different number of servers:

  • 蓝色.jpg: one server

  • 橙色.jpg: two servers

  • 红色.jpg: four servers

  • 绿色.jpg: eight servers

Model performance data

This section shows only the performance data of common models that are tested. The performance improvements vary based on the communication computing proportions of models in different scenarios.

Note

If you want to obtain the performance test data of other servers, join the DingTalk group 33617640 for technical support. You can download DingTalk from the download page.

Scenario 1: Train an alexnet model

  • Model: alexnet

  • Domain: COMPUTER_VISION

  • Subdomain: CLASSIFICATION

  • Batch_size: 128

  • Precision: automatic mixed precision (AMP)

The following figure shows the performance data of the alexnet model in the training scenario.模型1

Scenario 2: Train a resnet18 model

  • Model: resnet18

  • Domain: COMPUTER_VISION

  • Subdomain: CLASSIFICATION

  • Batch_size: 16

  • Precision: AMP

The following figure shows the performance data of the resnet18 model in the training scenario.模型2

Scenario 3: Train a resnet50 model

  • Model: resnet50

  • Domain: COMPUTER_VISION

  • Subdomain: CLASSIFICATION

  • Batch_size: 32

  • Precision: AMP

The following figure shows the performance data of the resnet50 model in the training scenario.模型3

Scenario 4: Train a vgg16 model

  • Model: vgg16

  • Domain: COMPUTER_VISION

  • Subdomain: CLASSIFICATION

  • Batch_size: 64

  • Precision: AMP

The following figure shows the performance data of the vgg16 model in the training scenario.模型4

Scenario 5: Train a timm_vovnet model

  • Model: timm_vovnet

  • Domain: COMPUTER_VISION

  • Subdomain: CLASSIFICATION

  • Batch_size: 32

  • Precision: AMP

The following figure shows the performance data of the timm_vovnet model in the training scenario.模型5

Scenario 6: Train a timm_vision_transformer model

  • Model: timm_vision_transformer

  • Domain: COMPUTER_VISION

  • Subdomain: CLASSIFICATION

  • Batch_size: 8

  • Precision: AMP

The following figure shows the performance data of the timm_vision_transformer model in the training scenario.模型6

Scenario 7: Train a pytorch_unet model

  • Model: pytorch_unet

  • Domain: COMPUTER_VISION

  • Subdomain: CLASSIFICATION

  • Batch_size: 1

  • Precision: AMP

The following figure shows the performance data of the pytorch_unet model in the training scenario.模型7

Scenario 8: Train an hf_Bart model

  • Model: hf_Bart

  • Domain: NLP

  • Subdomain: LANGUAGE_MODELING

  • Batch_size: 4

  • Precision: AMP

The following figure shows the performance data of the hf_Bart model in the training scenario.模型8

Scenario 9: Train an hf_Bert model

  • Model: hf_Bert

  • Domain: NLP

  • Subdomain: LANGUAGE_MODELING

  • Batch_size: 4

  • Precision: AMP

The following figure shows the performance data of the hf_Bert model in the training scenario.模型9

Scenario 10: Train a speech_transformer model

  • Model: speech_transformer

  • Domain: SPEECH

  • Subdomain: RECOGNITION

  • Batch_size: 32

  • Precision: AMP

The following figure shows the performance data of the speech_transformer model in the training scenario.模型10

Scenario 11: Train a tts_angular model

  • Model: tts_angular

  • Domain: SPEECH

  • Subdomain: SYNTHESIS

  • Batch_size: 64

  • Precision: AMP

The following figure shows the performance data of the tts_angular model in the training scenario.模型11