Parameter settings - Platform For AI - Alibaba Cloud Documentation Center

This topic describes the scenarios in which you can use Pai-Megatron-Patch and the parameters that you can set when you use PAI-Rapidformer to accelerate the training of transformers. We recommend that you read this topic before you use PAI-Rapidformer.

Scenarios
Regular training settings: Parameters on data
Regular training settings: Parameters on the transformer
Regular training settings: Parameters on training
Acceleration switch settings: Parameter on sparsely-gated MoEs
Acceleration switch settings: Parameter on mixed-precision training
Acceleration switch settings: Parameters on model state partitioning by using ZeRO, OSS, SDP, or FSDP
Acceleration switch settings: Parameters on 3D parallelism (data, tensor, and pipeline parallelism)
Acceleration switch settings: Parameter on graph optimization
Acceleration switch settings: Parameter on CPU load training
Acceleration switch settings: Parameter on checkpoint activation
Acceleration switch settings: Parameters on gradient accumulation
Acceleration switch settings: Parameter on the data iterator of Dynamic Shape
Acceleration switch settings: Parameter on the operation-based fusion optimizer

Scenarios

Accelerate the fine-tuning of transformers in a black-box manner
Accelerate the pre-training of transformers in a black-box manner
Accelerate the fine-tuning of transformers in a white-box manner by using the Finetuner code template
Accelerate the pre-training of transformers in a white-box manner by using the Pretrainer code template

Regular training settings: Parameters on data

Parameter	Type	Required	Default value	Enumeration value	Description
--micro-batch-size	Integer	Yes	None	N/A	The batch size of each GPU.
--global-batch-size	Integer	Yes	None	N/A	The batch size of all GPUs in distributed training.
--tokenizer-type	String	No	None	BertWordPieceLowerCase BertWordPieceCase GPT2BPETokenizer	The type of the analyzer.
--split	String	No	969, 30, 1	N/A	The division of the pre-training, training, validation, and test sets.
--data-impl	String	No	mmap	lazy cached mmap infer	The way in which an indexed dataset is pre-trained.
--data-path	String	Yes	None	N/A	The path of the file in which the pre-training set is stored.
--data-dir	String	No	None	N/A	The path of the file in which the dataset for fine-tuning is stored.
--data-name	String	Yes	None	N/A	The name of the file in which the dataset for fine-tuning is stored.

Regular training settings: Parameters on the transformer

Parameter	Type	Required	Default value	Enumeration value	Description
--pretrained-model-name-or-path	String	Yes	None	N/A	The name or the path of the transformer to be pre-trained.
--num-layers	Integer	Yes	None	N/A	The number of layers.
--hidden-size	Integer	Yes	None	N/A	The dimension of the hidden layer.
--num-attention-heads	Integer	Yes	None	N/A	The number of heads in the self-attention layer.
--max-position-embeddings	Integer	Yes	None	N/A	The length of the sequence for which position embedding is performed.
--seq-length	Integer	Yes	None	N/A	The length of the sequence.

Regular training settings: Parameters on training

Parameter	Type	Required	Default value	Enumeration value	Description
--task	String	Yes	None	sequence_classification token_classification question_answering masked_lm casual_lm seq2seq_lm pretraining	The name of the training task.
--save	String	Yes	None	N/A	The path of the file in which the transformer is stored.
--lr	Float	Yes	None	N/A	The learning rate.
--lr-decay-style	String	Yes	linear	constant linear cosine	The scheme of learning rate decay.
--weight-decay	Float	Yes	0.01	N/A	The weight decay value.
--clip-grad	Float	No	1	N/A	The gradient clipping value.
--lr-warmup-fraction	Float	No	None	N/A	The learning rate warmup.
--train-iters	Integer	Yes	None	N/A	The number of iterations.
--epochs	Integer	No	None	N/A	The number of epochs.
--log-interval	Integer	No	100	N/A	The interval at which logs are displayed.

Acceleration switch settings: Parameter on sparsely-gated MoEs

Parameter	Type	Required	Default value	Enumeration value	Description
--num-experts	Integer	No	None	N/A	The number of mixture-of-experts (MoE) layers.

Acceleration switch settings: Parameter on mixed-precision training

Parameter	Type	Required	Default value	Enumeration value	Description
--mixed-precision	Boolean	No	None	N/A	Specifies whether to enable FP16 data transfers.

Note: You can enable mixed-precision training only if you use the Trainer, Pretrainer, or Finetuner code template provided by PAI-Rapidformer.

Acceleration switch settings: Parameters on model state partitioning by using ZeRO, OSS, SDP, or FSDP

Parameter	Type	Required	Default value	Enumeration value	Description
--oss-memory-optimization	Boolean	No	N/A	N/A	Specifies whether to use Optimizer State Sharding (OSS) to partition optimizer states.
--oss-sdp-memory-optimization	Boolean	No	N/A	N/A	Specifies whether to use Sharded Data Parallel (SDP) to partition optimizer states and gradients.
--fsdp-memory-optimization	Boolean	No	N/A	N/A	Specifies whether to use Fully Sharded Data Parallel (FSDP) to partition optimizer states, gradients, and parameters.
--zero-1-memory-optimization	Boolean	No	N/A	N/A	Specifies whether to use Zero Redundancy Optimizer (ZeRO) to partition optimizer states.
--zero-2-memory-optimization	Boolean	No	N/A	N/A	Specifies whether to use ZeRO to partition optimizer states and gradients.
--zero-3-memory-optimization	Boolean	No	N/A	N/A	Specifies whether to use ZeRO to partition optimizer states, gradients, and parameters.

Note: You can use ZeRO only if you use the Trainer code template. You can use OSS, SDP, or FSDP if you do not use the Trainer code template.

Acceleration switch settings: Parameters on 3D parallelism (data, tensor, and pipeline parallelism)

Parameter	Type	Required	Default value	Enumeration value	Description
--tensor-model-parallel-size	Integer	No	1	N/A	The size of tensor parallelism.
--pipeline-model-parallel-size	Integer	No	1	N/A	The size of pipeline parallelism.

Note:

To enable 3D parallelism and model state partitioning at the same time, you can use only ZeRO to partition optimizer states and gradients, or partition only optimizer states.
You can enable 3D parallelism only if you use the Trainer code template.

Acceleration switch settings: Parameter on graph optimization

Parameter	Type	Required	Default value	Enumeration value	Description
--onnx-runtime-training	Boolean	No	None	N/A	Specifies whether to enable graph optimization provided by ONNX Runtime.

Acceleration switch settings: Parameter on CPU load training

Parameter	Type	Required	Default value	Enumeration value	Description
--cpu-offload	Boolean	No	None	N/A	Specifies whether to enable CPU load training.

Note: To enable CPU load training and model state partitioning at the same time, you can use only ZeRO for model state partitioning.

Acceleration switch settings: Parameter on checkpoint activation

Parameter	Type	Required	Default value	Enumeration value	Description
--checkpoint-activations	Boolean	No	None	N/A	Specifies whether to activate checkpoints.

Acceleration switch settings: Parameters on gradient accumulation

Parameter	Type	Required	Default value	Enumeration value	Description
--micro-batch-size	Integer	Yes	1	N/A	The size of the mini-batch.
--global-batch-size	Integer	Yes	1	N/A	The size of the global batch.

Note:

You can enable gradient accumulation only if you use the Pretrainer or Finetuner code template to perform iteration-based pre-training. You cannot enable gradient accumulation if you perform epoch-based fine-tuning.
The gradient accumulation result can be automatically calculated based on the values of the parameters in the preceding table and the value of the rank parameter.

Acceleration switch settings: Parameter on the data iterator of Dynamic Shape

Parameter	Type	Required	Default value	Enumeration value	Description
--data-iterator dynamic-shape	String	No	None	dynamic-shape fixed-shape	The data iterator of Dynamic Shape.

Note: You can enable the data iterator of Dynamic Shape only if you use the Pretrainer code template for pre-training.

Acceleration switch settings: Parameter on the operation-based fusion optimizer

Parameter	Type	Required	Default value	Enumeration value	Description
--optimizers	String	Yes	apex_adam	apex_adam apex_lamb	Apex fused adam/lamb