This topic describes the scenarios in which you can use Pai-Megatron-Patch and the parameters that you can set when you use PAI-Rapidformer to accelerate the training of transformers. We recommend that you read this topic before you use PAI-Rapidformer.
Acceleration switch settings: Parameter on sparsely-gated MoEs
Acceleration switch settings: Parameter on mixed-precision training
Acceleration switch settings: Parameters on 3D parallelism (data, tensor, and pipeline parallelism)
Acceleration switch settings: Parameter on graph optimization
Acceleration switch settings: Parameter on CPU load training
Acceleration switch settings: Parameter on checkpoint activation
Acceleration switch settings: Parameters on gradient accumulation
Acceleration switch settings: Parameter on the data iterator of Dynamic Shape
Acceleration switch settings: Parameter on the operation-based fusion optimizer
Scenarios
Accelerate the fine-tuning of transformers in a black-box manner
Accelerate the pre-training of transformers in a black-box manner
Accelerate the fine-tuning of transformers in a white-box manner by using the Finetuner code template
Accelerate the pre-training of transformers in a white-box manner by using the Pretrainer code template
Regular training settings: Parameters on data
Parameter | Type | Required | Default value | Enumeration value | Description |
--micro-batch-size | Integer | Yes | None | N/A | The batch size of each GPU. |
--global-batch-size | Integer | Yes | None | N/A | The batch size of all GPUs in distributed training. |
--tokenizer-type | String | No | None |
| The type of the analyzer. |
--split | String | No | 969, 30, 1 | N/A | The division of the pre-training, training, validation, and test sets. |
--data-impl | String | No | mmap |
| The way in which an indexed dataset is pre-trained. |
--data-path | String | Yes | None | N/A | The path of the file in which the pre-training set is stored. |
--data-dir | String | No | None | N/A | The path of the file in which the dataset for fine-tuning is stored. |
--data-name | String | Yes | None | N/A | The name of the file in which the dataset for fine-tuning is stored. |
Regular training settings: Parameters on the transformer
Parameter | Type | Required | Default value | Enumeration value | Description |
--pretrained-model-name-or-path | String | Yes | None | N/A | The name or the path of the transformer to be pre-trained. |
--num-layers | Integer | Yes | None | N/A | The number of layers. |
--hidden-size | Integer | Yes | None | N/A | The dimension of the hidden layer. |
--num-attention-heads | Integer | Yes | None | N/A | The number of heads in the self-attention layer. |
--max-position-embeddings | Integer | Yes | None | N/A | The length of the sequence for which position embedding is performed. |
--seq-length | Integer | Yes | None | N/A | The length of the sequence. |
Regular training settings: Parameters on training
Parameter | Type | Required | Default value | Enumeration value | Description |
--task | String | Yes | None |
| The name of the training task. |
--save | String | Yes | None | N/A | The path of the file in which the transformer is stored. |
--lr | Float | Yes | None | N/A | The learning rate. |
--lr-decay-style | String | Yes | linear |
| The scheme of learning rate decay. |
--weight-decay | Float | Yes | 0.01 | N/A | The weight decay value. |
--clip-grad | Float | No | 1 | N/A | The gradient clipping value. |
--lr-warmup-fraction | Float | No | None | N/A | The learning rate warmup. |
--train-iters | Integer | Yes | None | N/A | The number of iterations. |
--epochs | Integer | No | None | N/A | The number of epochs. |
--log-interval | Integer | No | 100 | N/A | The interval at which logs are displayed. |
Acceleration switch settings: Parameter on sparsely-gated MoEs
Parameter | Type | Required | Default value | Enumeration value | Description |
--num-experts | Integer | No | None | N/A | The number of mixture-of-experts (MoE) layers. |
Acceleration switch settings: Parameter on mixed-precision training
Parameter | Type | Required | Default value | Enumeration value | Description |
--mixed-precision | Boolean | No | None | N/A | Specifies whether to enable FP16 data transfers. |
Note: You can enable mixed-precision training only if you use the Trainer, Pretrainer, or Finetuner code template provided by PAI-Rapidformer.
Acceleration switch settings: Parameters on model state partitioning by using ZeRO, OSS, SDP, or FSDP
Parameter | Type | Required | Default value | Enumeration value | Description |
--oss-memory-optimization | Boolean | No | N/A | N/A | Specifies whether to use Optimizer State Sharding (OSS) to partition optimizer states. |
--oss-sdp-memory-optimization | Boolean | No | N/A | N/A | Specifies whether to use Sharded Data Parallel (SDP) to partition optimizer states and gradients. |
--fsdp-memory-optimization | Boolean | No | N/A | N/A | Specifies whether to use Fully Sharded Data Parallel (FSDP) to partition optimizer states, gradients, and parameters. |
--zero-1-memory-optimization | Boolean | No | N/A | N/A | Specifies whether to use Zero Redundancy Optimizer (ZeRO) to partition optimizer states. |
--zero-2-memory-optimization | Boolean | No | N/A | N/A | Specifies whether to use ZeRO to partition optimizer states and gradients. |
--zero-3-memory-optimization | Boolean | No | N/A | N/A | Specifies whether to use ZeRO to partition optimizer states, gradients, and parameters. |
Note: You can use ZeRO only if you use the Trainer code template. You can use OSS, SDP, or FSDP if you do not use the Trainer code template.
Acceleration switch settings: Parameters on 3D parallelism (data, tensor, and pipeline parallelism)
Parameter | Type | Required | Default value | Enumeration value | Description |
--tensor-model-parallel-size | Integer | No | 1 | N/A | The size of tensor parallelism. |
--pipeline-model-parallel-size | Integer | No | 1 | N/A | The size of pipeline parallelism. |
Note:
To enable 3D parallelism and model state partitioning at the same time, you can use only ZeRO to partition optimizer states and gradients, or partition only optimizer states.
You can enable 3D parallelism only if you use the Trainer code template.
Acceleration switch settings: Parameter on graph optimization
Parameter | Type | Required | Default value | Enumeration value | Description |
--onnx-runtime-training | Boolean | No | None | N/A | Specifies whether to enable graph optimization provided by ONNX Runtime. |
Acceleration switch settings: Parameter on CPU load training
Parameter | Type | Required | Default value | Enumeration value | Description |
--cpu-offload | Boolean | No | None | N/A | Specifies whether to enable CPU load training. |
Note: To enable CPU load training and model state partitioning at the same time, you can use only ZeRO for model state partitioning.
Acceleration switch settings: Parameter on checkpoint activation
Parameter | Type | Required | Default value | Enumeration value | Description |
--checkpoint-activations | Boolean | No | None | N/A | Specifies whether to activate checkpoints. |
Acceleration switch settings: Parameters on gradient accumulation
Parameter | Type | Required | Default value | Enumeration value | Description |
--micro-batch-size | Integer | Yes | 1 | N/A | The size of the mini-batch. |
--global-batch-size | Integer | Yes | 1 | N/A | The size of the global batch. |
Note:
You can enable gradient accumulation only if you use the Pretrainer or Finetuner code template to perform iteration-based pre-training. You cannot enable gradient accumulation if you perform epoch-based fine-tuning.
The gradient accumulation result can be automatically calculated based on the values of the parameters in the preceding table and the value of the rank parameter.
Acceleration switch settings: Parameter on the data iterator of Dynamic Shape
Parameter | Type | Required | Default value | Enumeration value | Description |
--data-iterator dynamic-shape | String | No | None |
| The data iterator of Dynamic Shape. |
Note: You can enable the data iterator of Dynamic Shape only if you use the Pretrainer code template for pre-training.
Acceleration switch settings: Parameter on the operation-based fusion optimizer
Parameter | Type | Required | Default value | Enumeration value | Description |
--optimizers | String | Yes | apex_adam |
| Apex fused adam/lamb |