All Products
Search
Document Center

Platform For AI:References: Benchmarks of training performance

Last Updated:Dec 22, 2023

This topic describes the benchmarks of training performance when you use Pai-Megatron-Patch to optimize the training of PyTorch transformers.

Mixed-precision training

Training environment: Pre-training environment of BERT developed by Hugging Face in the English language.

  • num-layers 12

  • hidden-size 768

  • num-attention-heads 12

  • num-params 110106428

  • local-rank 4

  • seq-length 512

  • micro-batch-size 16

  • global-batch-size 64

Solution

Throughput (samples/s)

Peak memory (MB)

Single-precision training

103.07 +/- 1.03

17,025

Mixed-precision training

178.15 +/- 2.10

12,698

Distributed memory optimization: Model state partitioning

Training environment: Pre-training environment of GPT developed by Megatron in the English language.

  • num-layers 24

  • hidden-size 2048

  • num-attention-heads 32

  • num-params 1313722368 (1.3 billion)

  • local-rank 4

  • seq-length 1024

  • micro-batch-size 1

  • global-batch-size 4

PyTorch native distributed data parallelism may cause an out of memory (OOM) exception. This is because the model cannot be run based on 32 GB memory. The configurations of the Adam Optimizer can consume as much as 16 GB memory. break

Solution

Throughput (samples/s)

Peak memory (MB)

No acceleration technologies

OOM

OOM

Mixed-precision training

9.57 +/- 0.26

25,061

Mixed-precision training, and model state partitioning by using Optimizer State Sharding (OSS)

6.02 +/- 0.06

22,077

Mixed-precision training, and model state partitioning by using OSS or Sharded Data Parallel (SDP)

7.01 +/- 0.07

17,113

Mixed-precision training, and model state partitioning by using Fully Sharded Data Parallel (FSDP)

N/A

N/A

Mixed-precision training, and optimizer state partitioning by using Zero Redundancy Optimizer (ZeRO)

12.88 +/- 0.10

15,709

Mixed-precision training, and the partitioning of optimizer states and gradients by using ZeRO

10.27 +/- 0.08

15,693

Mixed-precision training, and the partitioning of optimizer states, gradients, and parameters by using ZeRO

N/A

N/A

3D parallelism

Training environment: Pre-training environment of GPT developed by Megatron in the English language.

  • num-layers 24

  • hidden-size 2048

  • num-attention-heads 32

  • num-params 1313722368 (1.3 billion)

  • local-rank 4

  • seq-length 1024

  • micro-batch-size 1

  • global-batch-size 4

The following table describes the benchmarks when you enable 3D parallelism and mixed-precision training at the same time.

Operator splitting

Pipeline parallelism

Throughput (samples/s)

Peak memory (MB)

1

1

9.63 +/- 0.29

25,061

2

1

7.59 +/- 0.14

11,300

4

1

6.16 +/- 0.06

5,673

1

2

8.46 +/- 0.17

12,375

1

4

8.03 +/- 0.12

8,141

2

2

7.37 +/- 0.11

6,211

4

4

6.24 +/- 0.08

5,673

Graph optimization of ONNX Runtime

Training environment: Fine-tuning of BERT developed by Hugging Face in the English language.

  • num-layers 12

  • hidden-size 768

  • num-attention-heads 12

  • num-params 110106428

  • local-rank 4

  • seq-length 512

  • micro-batch-size 16

  • global-batch-size 64

The following table describes the benchmarks when the performance of graphs is optimized by 15.6%.

Solution

Throughput (samples/s)

Peak memory (MB)

Single-precision training

479.15 +/- 1.67

2,112

Mixed-precision training

589.66 +/- 4.79

2,127

Graph optimization of ONNX Runtime

554.24 +/- 1.98

2,430

Graph optimization of ONNX Runtime and mixed-precision training

614.70 +/- 8.69

2,289