This topic describes the benchmarks of training performance when you use Pai-Megatron-Patch to optimize the training of PyTorch transformers.
Mixed-precision training
Training environment: Pre-training environment of BERT developed by Hugging Face in the English language.
num-layers 12
hidden-size 768
num-attention-heads 12
num-params 110106428
local-rank 4
seq-length 512
micro-batch-size 16
global-batch-size 64
Solution | Throughput (samples/s) | Peak memory (MB) |
Single-precision training | 103.07 +/- 1.03 | 17,025 |
Mixed-precision training | 178.15 +/- 2.10 | 12,698 |
Distributed memory optimization: Model state partitioning
Training environment: Pre-training environment of GPT developed by Megatron in the English language.
num-layers 24
hidden-size 2048
num-attention-heads 32
num-params 1313722368 (1.3 billion)
local-rank 4
seq-length 1024
micro-batch-size 1
global-batch-size 4
PyTorch native distributed data parallelism may cause an out of memory (OOM) exception. This is because the model cannot be run based on 32 GB memory. The configurations of the Adam Optimizer can consume as much as 16 GB memory.
Solution | Throughput (samples/s) | Peak memory (MB) |
No acceleration technologies | OOM | OOM |
Mixed-precision training | 9.57 +/- 0.26 | 25,061 |
Mixed-precision training, and model state partitioning by using Optimizer State Sharding (OSS) | 6.02 +/- 0.06 | 22,077 |
Mixed-precision training, and model state partitioning by using OSS or Sharded Data Parallel (SDP) | 7.01 +/- 0.07 | 17,113 |
Mixed-precision training, and model state partitioning by using Fully Sharded Data Parallel (FSDP) | N/A | N/A |
Mixed-precision training, and optimizer state partitioning by using Zero Redundancy Optimizer (ZeRO) | 12.88 +/- 0.10 | 15,709 |
Mixed-precision training, and the partitioning of optimizer states and gradients by using ZeRO | 10.27 +/- 0.08 | 15,693 |
Mixed-precision training, and the partitioning of optimizer states, gradients, and parameters by using ZeRO | N/A | N/A |
3D parallelism
Training environment: Pre-training environment of GPT developed by Megatron in the English language.
num-layers 24
hidden-size 2048
num-attention-heads 32
num-params 1313722368 (1.3 billion)
local-rank 4
seq-length 1024
micro-batch-size 1
global-batch-size 4
The following table describes the benchmarks when you enable 3D parallelism and mixed-precision training at the same time.
Operator splitting | Pipeline parallelism | Throughput (samples/s) | Peak memory (MB) |
1 | 1 | 9.63 +/- 0.29 | 25,061 |
2 | 1 | 7.59 +/- 0.14 | 11,300 |
4 | 1 | 6.16 +/- 0.06 | 5,673 |
1 | 2 | 8.46 +/- 0.17 | 12,375 |
1 | 4 | 8.03 +/- 0.12 | 8,141 |
2 | 2 | 7.37 +/- 0.11 | 6,211 |
4 | 4 | 6.24 +/- 0.08 | 5,673 |
Graph optimization of ONNX Runtime
Training environment: Fine-tuning of BERT developed by Hugging Face in the English language.
num-layers 12
hidden-size 768
num-attention-heads 12
num-params 110106428
local-rank 4
seq-length 512
micro-batch-size 16
global-batch-size 64
The following table describes the benchmarks when the performance of graphs is optimized by 15.6%.
Solution | Throughput (samples/s) | Peak memory (MB) |
Single-precision training | 479.15 +/- 1.67 | 2,112 |
Mixed-precision training | 589.66 +/- 4.79 | 2,127 |
Graph optimization of ONNX Runtime | 554.24 +/- 1.98 | 2,430 |
Graph optimization of ONNX Runtime and mixed-precision training | 614.70 +/- 8.69 | 2,289 |