All Products
Search
Document Center

Platform For AI:Parameter settings

Last Updated:Dec 22, 2023

This topic describes the scenarios in which you can use Pai-Megatron-Patch and the parameters that you can set when you use PAI-Rapidformer to accelerate the training of transformers. We recommend that you read this topic before you use PAI-Rapidformer.

Scenarios

  • Accelerate the fine-tuning of transformers in a black-box manner

  • Accelerate the pre-training of transformers in a black-box manner

  • Accelerate the fine-tuning of transformers in a white-box manner by using the Finetuner code template

  • Accelerate the pre-training of transformers in a white-box manner by using the Pretrainer code template

Regular training settings: Parameters on data

Parameter

Type

Required

Default value

Enumeration value

Description

--micro-batch-size

Integer

Yes

None

N/A

The batch size of each GPU.

--global-batch-size

Integer

Yes

None

N/A

The batch size of all GPUs in distributed training.

--tokenizer-type

String

No

None

  • BertWordPieceLowerCase

  • BertWordPieceCase

  • GPT2BPETokenizer

The type of the analyzer.

--split

String

No

969, 30, 1

N/A

The division of the pre-training, training, validation, and test sets.

--data-impl

String

No

mmap

  • lazy

  • cached

  • mmap

  • infer

The way in which an indexed dataset is pre-trained.

--data-path

String

Yes

None

N/A

The path of the file in which the pre-training set is stored.

--data-dir

String

No

None

N/A

The path of the file in which the dataset for fine-tuning is stored.

--data-name

String

Yes

None

N/A

The name of the file in which the dataset for fine-tuning is stored.

Regular training settings: Parameters on the transformer

Parameter

Type

Required

Default value

Enumeration value

Description

--pretrained-model-name-or-path

String

Yes

None

N/A

The name or the path of the transformer to be pre-trained.

--num-layers

Integer

Yes

None

N/A

The number of layers.

--hidden-size

Integer

Yes

None

N/A

The dimension of the hidden layer.

--num-attention-heads

Integer

Yes

None

N/A

The number of heads in the self-attention layer.

--max-position-embeddings

Integer

Yes

None

N/A

The length of the sequence for which position embedding is performed.

--seq-length

Integer

Yes

None

N/A

The length of the sequence.

Regular training settings: Parameters on training

Parameter

Type

Required

Default value

Enumeration value

Description

--task

String

Yes

None

  • sequence_classification

  • token_classification

  • question_answering

  • masked_lm

  • casual_lm

  • seq2seq_lm

  • pretraining

The name of the training task.

--save

String

Yes

None

N/A

The path of the file in which the transformer is stored.

--lr

Float

Yes

None

N/A

The learning rate.

--lr-decay-style

String

Yes

linear

  • constant

  • linear

  • cosine

The scheme of learning rate decay.

--weight-decay

Float

Yes

0.01

N/A

The weight decay value.

--clip-grad

Float

No

1

N/A

The gradient clipping value.

--lr-warmup-fraction

Float

No

None

N/A

The learning rate warmup.

--train-iters

Integer

Yes

None

N/A

The number of iterations.

--epochs

Integer

No

None

N/A

The number of epochs.

--log-interval

Integer

No

100

N/A

The interval at which logs are displayed.

Acceleration switch settings: Parameter on sparsely-gated MoEs

Parameter

Type

Required

Default value

Enumeration value

Description

--num-experts

Integer

No

None

N/A

The number of mixture-of-experts (MoE) layers.

Acceleration switch settings: Parameter on mixed-precision training

Parameter

Type

Required

Default value

Enumeration value

Description

--mixed-precision

Boolean

No

None

N/A

Specifies whether to enable FP16 data transfers.

Note: You can enable mixed-precision training only if you use the Trainer, Pretrainer, or Finetuner code template provided by PAI-Rapidformer.

Acceleration switch settings: Parameters on model state partitioning by using ZeRO, OSS, SDP, or FSDP

Parameter

Type

Required

Default value

Enumeration value

Description

--oss-memory-optimization

Boolean

No

N/A

N/A

Specifies whether to use Optimizer State Sharding (OSS) to partition optimizer states.

--oss-sdp-memory-optimization

Boolean

No

N/A

N/A

Specifies whether to use Sharded Data Parallel (SDP) to partition optimizer states and gradients.

--fsdp-memory-optimization

Boolean

No

N/A

N/A

Specifies whether to use Fully Sharded Data Parallel (FSDP) to partition optimizer states, gradients, and parameters.

--zero-1-memory-optimization

Boolean

No

N/A

N/A

Specifies whether to use Zero Redundancy Optimizer (ZeRO) to partition optimizer states.

--zero-2-memory-optimization

Boolean

No

N/A

N/A

Specifies whether to use ZeRO to partition optimizer states and gradients.

--zero-3-memory-optimization

Boolean

No

N/A

N/A

Specifies whether to use ZeRO to partition optimizer states, gradients, and parameters.

Note: You can use ZeRO only if you use the Trainer code template. You can use OSS, SDP, or FSDP if you do not use the Trainer code template.

Acceleration switch settings: Parameters on 3D parallelism (data, tensor, and pipeline parallelism)

Parameter

Type

Required

Default value

Enumeration value

Description

--tensor-model-parallel-size

Integer

No

1

N/A

The size of tensor parallelism.

--pipeline-model-parallel-size

Integer

No

1

N/A

The size of pipeline parallelism.

Note:

  • To enable 3D parallelism and model state partitioning at the same time, you can use only ZeRO to partition optimizer states and gradients, or partition only optimizer states.

  • You can enable 3D parallelism only if you use the Trainer code template.

Acceleration switch settings: Parameter on graph optimization

Parameter

Type

Required

Default value

Enumeration value

Description

--onnx-runtime-training

Boolean

No

None

N/A

Specifies whether to enable graph optimization provided by ONNX Runtime.

Acceleration switch settings: Parameter on CPU load training

Parameter

Type

Required

Default value

Enumeration value

Description

--cpu-offload

Boolean

No

None

N/A

Specifies whether to enable CPU load training.

Note: To enable CPU load training and model state partitioning at the same time, you can use only ZeRO for model state partitioning.

Acceleration switch settings: Parameter on checkpoint activation

Parameter

Type

Required

Default value

Enumeration value

Description

--checkpoint-activations

Boolean

No

None

N/A

Specifies whether to activate checkpoints.

Acceleration switch settings: Parameters on gradient accumulation

Parameter

Type

Required

Default value

Enumeration value

Description

--micro-batch-size

Integer

Yes

1

N/A

The size of the mini-batch.

--global-batch-size

Integer

Yes

1

N/A

The size of the global batch.

Note:

  • You can enable gradient accumulation only if you use the Pretrainer or Finetuner code template to perform iteration-based pre-training. You cannot enable gradient accumulation if you perform epoch-based fine-tuning.

  • The gradient accumulation result can be automatically calculated based on the values of the parameters in the preceding table and the value of the rank parameter.

Acceleration switch settings: Parameter on the data iterator of Dynamic Shape

Parameter

Type

Required

Default value

Enumeration value

Description

--data-iterator dynamic-shape

String

No

None

  • dynamic-shape

  • fixed-shape

The data iterator of Dynamic Shape.

Note: You can enable the data iterator of Dynamic Shape only if you use the Pretrainer code template for pre-training.

Acceleration switch settings: Parameter on the operation-based fusion optimizer

Parameter

Type

Required

Default value

Enumeration value

Description

--optimizers

String

Yes

apex_adam

  • apex_adam

  • apex_lamb

Apex fused adam/lamb