All Products
Search
Document Center

Platform For AI:Accelerate the training of transformers

Last Updated:Jan 03, 2024

The PAI-Rapidformer tool developed by the Machine Learning Platform for AI (PAI) team provides a variety of methods for you to accelerate the training of transformers. You need to only install a PAI-Rapidformer image to accelerate the training in a black-box or white-box manner. This topic describes how to use PAI-Rapidformer to accelerate the training of PyTorch transformers.

Prerequisites

  • A PAI-Rapidformer image is installed. For more information, see Install a PAI-Rapidformer image.

  • You are familiar with the parameter settings of PAI-Rapidformer. For more information, see Parameter settings.

  • You are familiar with the methods of PAI-Rapidformer. For more information, see API.

Background information

By using PAI-Rapidformer, you can accelerate the training of transformers in a black-box or white-box manner.

Accelerate the fine-tuning of a Hugging Face transformer in a black-box manner

  1. Register your dataset with Hugging Face or search for a desired dataset that has been registered. This way, you can use the --dataset-name parameter to pass a dataset to PAI-Rapidformer in Step 3.

    For more information, see Register datasets to Hugging Face and visit the Datasets page.

  2. Register your transformer with Hugging Face or search for a desired transformer that has been registered. This way, you can use the --pretrained-model-name-or-path parameter to pass a transformer to PAI-Rapidformer in Step 3.

    For more information, see How to add a model to Transformers and visit the Models page.

  3. Configure a startup script in the CLI of PAI-Rapidformer. Sample script:

    #!/bin/bash
    export CUDA_VISIBLE_DEVICES=4,5,6,7
    export MASTER_ADDR=localhost
    export MASTER_PORT=6010
    export NNODES=1
    export NODE_RANK=0
    
    rapidformer --task sequence_classification \ # The task name.
                --pretrained-model-name-or-path 'bert-base-cased' \  # The name of the registered transformer.
                --data-path glue \                      # The path of the registered dataset.
                --data-name mrpc \                      # The file name of the registered dataset.
                --epochs 3 \                               # The number of epochs.
                --micro-batch-size 16 \                    # The size of the mini-batch on a single GPU.
                --global-batch-size 64 \                   # The size of the global batch on all GPUs during distributed training.
                --lr 2e-5 \                                # The learning rate.
                --lr-decay-style linear \                  # The scheme of learning rate decay.
                --lr-warmup-iters 100 \                    # The learning rate warmup.
                --weight-decay 1e-2 \                      # The weight decay value.
                --clip-grad 1.0 \                          # The gradient clipping value.
                --seed 42 \                                # The random seed.
                --mixed-precision \                        # Enables FP16 data transfers.
                --onnx-runtime-training \                  # Enables graph optimization provided by ONNX Runtime.
                --zero-1-memory-optimization \             # Uses Zero Redundancy Optimizer (ZeRO) to partition optimizer states.

    For more information about the parameters in the preceding script, see Parameter settings.

Accelerate the pre-training of a Hugging Face transformer in a black-box manner

  1. Create an .mmap file as the pre-training dataset.

    For more information, see Megatron-LM. Sample code:

    python preprocess_data.py \
      --input book_wiki_owtv2_small.json  \
      --output-prefix gpt_small \
      --vocab gpt2-vocab.json \
      --dataset-impl mmap \
      --tokenizer-type GPT2BPETokenizer \
      --merge-file gpt2-merges.txt \
      --append-eod
  2. Register your transformer with Hugging Face or search for a desired transformer that has been registered. This way, you can use the --pretrained-model-name-or-path parameter to pass a transformer to PAI-Rapidformer in Step 3.

    For more information, see How to add a model to Transformers and visit the Models page.

  3. Configure a startup script in the CLI of PAI-Rapidformer. Sample script:

    #!/bin/bash
    export CUDA_VISIBLE_DEVICES=4,5,6,7
    export MASTER_ADDR=localhost
    export MASTER_PORT=6010
    export NNODES=1
    export NODE_RANK=0
    
    rapidformer --task pretraining \
           --pretrained-model-name-or-path 'bert-base-uncased' \
           --num-layers 12 \
           --hidden-size 768 \
           --num-attention-heads 12 \
           --micro-batch-size 16 \
           --global-batch-size 128 \               # The size of the global batch.
           --seq-length 512 \
           --tokenizer-type BertWordPieceLowerCase \
           --max-position-embeddings 512 \
           --train-iters 100 \
           --data-path book_wiki_owtv2_small_text_sentence \
           --vocab-file bert-en-uncased-vocab.txt  \
           --data-impl mmap \
           --split 980,20 \
           --lr 1e-3 \
           --lr-decay-style linear \
           --min-lr 0.0 \
           --lr-decay-iters 2000 \
           --weight-decay 1e-2 \
           --clip-grad 1.0 \
           --lr-warmup-fraction .01 \
           --mixed-precision \                    # Enables FP16 data transfers.
           --onnx-runtime-training \              # Enables graph optimization provided by ONNX Runtime.
           --fsdp-memory-optimization \           # Uses Fully Sharded Data Parallel (FSDP) to partition optimizer states, gradients, and parameters.

    For more information about the parameters in the preceding script, see Parameter settings.

Accelerate the fine-tuning of a Hugging Face transformer in a white-box manner by using the Finetuner code template

This section describes how to use the Finetuner code template provided by PAI-Rapidformer to accelerate the fine-tuning of a Hugging Face transformer. Pay attention to the following four methods in the code template:

  • train_valid_test_datasets_provider: You can call this method to produce data.

  • model_optimizer_lr_scheduler_provider: You can call this method to construct transformers, optimizers, and learning rate schedulers.

  • run_forward_step: You can call this method to define the logic of forward propagation.

  • run_compute_metrics: You can call this method to train a transformer and calculate the precision at the same time.

For more information about these methods, see API. The following information describes the inputs and outputs of these methods:

class MyFintuner(Finetuner):

    def __init__(self, engine):
        super().__init__(engine=engine)

    # Obtains a training, verification, or test dataset.
    # Input: none.
    # Output: three objects and an object method.
    def train_valid_test_datasets_provider(self):

        return train_dataset, valid_dataset, test_dataset, collate_f

    # Creates a transformer, an optimizer, or a learning rate scheduler.
    # Input: none.
    # Output: three objects.
    def model_optimizer_lr_scheduler_provider(self):

        return model, optimizer, lr_scheduer

    # Defines the logic of forward propagation.
    # Input: a batch or an iterator, and a transformer.
    # Output: loss.
    def run_forward_step(self, batch_or_iterator, model):
        return loss

    # Defines the logic of using the verification dataset. This logic is tailored to the fine-tuning of transformers.
    # Input: a transformer and a data loader for the verification dataset.
    # Output: a metric object.
    def run_compute_metrics(self, model, eval_dataloader):
        return metric
                

After you familiarize yourself with the code template, follow the instructions in the Accelerate the fine-tuning of a Hugging Face transformer in a black-box manner section to prepare a dataset and a transformer. Then, perform the following steps:

  1. Import the methods of PAI-Rapidformer and the Hugging Face transformer.

    from transformers/easytexmier import AutoConfig, BertForSequenceClassification
    from datasets import load_dataset, load_metric
    from rapidformer import RapidformerEngine
    from rapidformer import get_args
    from rapidformer import get_logger
    from rapidformer import get_timers
    from rapidformer import Finetuner
    from rapidformer import Pretrainer
    from rapidformer import build_train_valid_test_datasets_for_huggingface
  2. Complete the configurations of the four methods in the code template. Sample code:

    class MyFintuner(Finetuner):
        def __init__(self,engine):
            super().__init__(engine=engine)
    
        def train_valid_test_datasets_provider(self):
            tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
    
            def tokenize_function(examples):
                # max_length=None => use the model max length (it's actually the default)
                outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
                return outputs
    
            datasets = load_dataset(args.dataset_path, args.dataset_name)
            # Apply the method we just defined to all the examples in all the splits of the dataset
            tokenized_datasets = datasets.map(
                tokenize_function,
                batched=True,
                remove_columns=["idx", "sentence1", "sentence2"],
            )
            tokenized_datasets.rename_column_("label", "labels")
    
            train_dataset = tokenized_datasets["train"]
            valid_dataset = tokenized_datasets['validation']
            test_dataset = tokenized_datasets['test']
    
            def collate_fn(examples):
                return tokenizer.pad(examples, padding="longest", return_tensors="pt")
    
            return train_dataset, valid_dataset, test_dataset, collate_fn
    
        def model_optimizer_lr_scheduler_provider(self):
            args = get_args()
            model = BertForSequenceClassification.from_pretrained(args.load)
            return model, None, None
    
        def run_forward_step(self, batch, model):
            output_tensor = model(**batch)
            return output_tensor.loss
    
        # after each epoch run metric on eval dataset
        def run_compute_metrics(self, model, eval_dataloader):
            model = model[0]
            metric = load_metric(args.dataset_path, args.dataset_name)
            for step, batch in enumerate(eval_dataloader):
                with torch.no_grad():
                    outputs = model(**batch)
                predictions = outputs.logits.argmax(dim=-1)
    
                metric.add_batch(
                    predictions=self.gather(predictions),
                    references=self.gather(batch["labels"]),
                )
    
            eval_metric = metric.compute()
            return eval_metric
                            
  3. Initialize PAI-Rapidformer and create a trainer object. Call the finetune() method and save the output as a file named rapidformer_finetune_huggingface_bert_trainer.py.

    engine = RapidformerEngine()
    trainer = MyFintuner(engine=engine)
    trainer.train()
  4. Configure a startup script in the CLI of PAI-Rapidformer. In the startup script, set the --user-script parameter to rapidformer_finetune_huggingface_bert_trainer.py and configure acceleration switches. Sample script:

    #!/bin/bash
    export CUDA_VISIBLE_DEVICES=4,5,6,7
    export MASTER_ADDR=localhost
    export MASTER_PORT=6010
    export NNODES=1
    export NODE_RANK=0
    
    rapidformer --user-script rapidformer_finetune_huggingface_bert_trainer.py
                --task sequence_classification \
                --pretrained-model-name-or-path 'bert-base-cased' \
                --data-path glue \
                --data-name mrpc \
                --epochs 3 \
                --micro-batch-size 16 \
                --global-batch-size 16 \
                --lr 2e-5 \
                --lr-decay-style linear \
                --lr-warmup-iters 100 \
                --weight-decay 1e-2 \
                --clip-grad 1.0 \
                --mixed-precision                                 # Enables FP16 data transfers.
                --zero-3-memory-optimization \                    # Uses ZeRO to partition optimizer states, gradients, and parameters.
                --onnx-runtime-training \                         # Enables graph optimization provided by ONNX Runtime.

Accelerate the pre-training of a Hugging Face transformer in a white-box manner by using the Pretrainer code template

When you accelerate the pre-training of a Hugging Face transformer in a white-box manner by using the Pretrainer code template provided by PAI-Rapidformer, pay attention to the following methods in the code template:

  • train_valid_test_datasets_provider: You can call this method to produce data.

  • model_optimizer_lr_scheduler_provider: You can call this method to construct transformers, optimizers, and learning rate schedulers.

  • run_forward_step: You can call this method to define the logic of forward propagation.

For more information about these methods, see API. For more information about the inputs and outputs of these methods, see the Accelerate the fine-tuning of a Hugging Face transformer in a white-box manner by using the Finetuner code template section in this topic.

After you familiarize yourself with the code template, follow the instructions in the Accelerate the fine-tuning of a Hugging Face transformer in a black-box manner section to prepare a dataset and a transformer. Then, perform the following steps:

  1. Import the methods of PAI-Rapidformer and the Hugging Face transformer.

    Note

    Because model pre-training relies on iterators to read data, you need to use Micro Processor Units (MPUs) to realize parallel data processing.

    from megatron import mpu
    from transformers import BertConfig, BertForPreTraining
    from rapidformer import RapidformerEngine, get_args, PreTrainer
    from rapidformer import build_train_valid_test_datasets_for_huggingface
  2. Complete the configurations of methods for pre-training acceleration by using a class derived from PreTrainer. Sample code:

    class MyBertPreTrainer(PreTrainer):
    
        def __init__(self,engine):
            super().__init__(engine=engine)
    
        def train_valid_test_datasets_provider(self, train_val_test_num_samples):
            args = get_args()
    
            train_ds, valid_ds, test_ds = build_train_valid_test_datasets_for_huggingface(
                data_prefix=args.data_path,
                data_impl=args.data_impl,
                splits_string=args.split,
                train_valid_test_num_samples=train_val_test_num_samples,
                max_seq_length=args.seq_length,
                masked_lm_prob=args.mask_prob,
                short_seq_prob=args.short_seq_prob,
                seed=args.seed,
                skip_warmup=(not args.mmap_warmup),
                binary_head=True)
    
            return train_ds, valid_ds, test_ds
    
        def model_optimizer_lr_scheduler_provider(self):
            args = get_args()
            model = AutoModelForPreTraining.from_pretrained(args.pretrained_model_name_or_path)
            return model, None, None
    
        def run_forward_step(self, data_iterator, model):
            # Items and their type.
            keys = ['input_ids', 'attention_mask', 'token_type_ids', 'labels', 'next_sentence_label']
            datatype = torch.int64
    
            # Broadcast data.
            if data_iterator is not None:
                data = next(data_iterator)
            else:
                data = None
            data_b = mpu.broadcast_data(keys, data, datatype)
            input_ids = data_b['input_ids'].long()
            attention_mask = data_b['attention_mask'].long()
            token_type_ids = data_b['token_type_ids'].long()
            labels = data_b['labels'].long()
            next_sentence_label = data_b['next_sentence_label'].long()
            output_tensor = model(input_ids=input_ids, attention_mask=attention_mask,
                                  token_type_ids=token_type_ids, labels=labels, next_sentence_label=next_sentence_label)
    
            return output_tensor['loss']
  3. Initialize PAI-Rapidformer and create a trainer object. Call the pretrain() method and save the output as a file named rapidformer_pretrain_huggingface_bert_trainer.py.

    engine = RapidformerEngine()
    trainer = MyBertPreTrainer(engine=engine)
    trainer.train()
  4. Configure a startup script in the CLI of PAI-Rapidformer. In the startup script, configure acceleration switches. Sample script:

    #!/bin/bash
    export CUDA_VISIBLE_DEVICES=4,5,6,7
    export MASTER_ADDR=localhost
    export MASTER_PORT=6010
    export NNODES=1
    export NODE_RANK=0
    
    DATA_PATH=book_wiki_owtv2_small_text_sentence
    
    rapidformer --user-script rapidformer_pretrain_huggingface_bert_trainer.py \
           --pretrained-model-name-or-path 'bert-base-uncased' \
           --num-layers 12 \
           --hidden-size 768 \
           --num-attention-heads 12 \
           --micro-batch-size 16 \
           --global-batch-size 64 \
           --seq-length 512 \
           --tokenizer-type BertWordPieceLowerCase \
           --max-position-embeddings 512 \
           --train-iters 100 \
           --data-path $DATA_PATH \
           --vocab-file bert-en-uncased-vocab.txt  \
           --data-impl mmap \                               # Enables acceleration.
           --split 980,20 \
           --lr 1e-3 \
           --lr-decay-style linear \
           --weight-decay 1e-2 \
           --clip-grad 1.0 \
           --lr-warmup-fraction .01 \
           --zero-3-memory-optimization \                    # Uses ZeRO to partition optimizer states, gradients, and parameters.
           --onnx-runtime-training \                         # Enables graph optimization provided by ONNX Runtime.
           --mixed-precision                                 # Enables FP16 data transfers.

Accelerate the fine-tuning of a Hugging Face transformer in a white-box manner by using a custom trainer

For acceleration based on custom trainers, PAI-Rapidformer provides only a few acceleration features, such as Apex FusedAdam, model state partitioning, and graph optimization. Because mixed-precision training requires complex modifications on the training process, we recommend that you use code templates provided by PAI-Rapidformer to accelerate the training. This section provides an example on how to optimize the code that is used to fine-tune a Hugging Face transformer to perform intrusive acceleration.

Sample code for the fine-tuning of a Hugging Face transformer:

import torch
from datasets import load_dataset, load_metric
from torch.utils.data import DataLoader
from transformers import (
    AdamW,
    AutoModelForSequenceClassification,
    AutoTokenizer,
    get_linear_schedule_with_warmup,
    BertForSequenceClassification,

)

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
datasets = load_dataset("glue", "mrpc")
metric = load_metric("glue", "mrpc")

def tokenize_function(examples):
    # max_length=None => use the model max length (it's actually the default)
    outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=None)
    return outputs

tokenized_datasets = datasets.map(
    tokenize_function,
    batched=True,
    remove_columns=["idx", "sentence1", "sentence2"],
)

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", return_dict=True)

optimizer = AdamW(params=model.parameters(), lr=args.lr, correct_bias=True)

lr_scheduler = get_linear_schedule_with_warmup(
    optimizer=optimizer,
    num_warmup_steps=args.lr_warmup_iters,
    num_training_steps=args.train_iters
)

device = torch.device("cuda", args.local_rank)

for epoch in range(args.epochs):
    model.train()
    for step, batch in enumerate(train_dataloader):
        batch.to(device)
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

    model.eval()
    for step, batch in enumerate(eval_dataloader):
        batch.to(device)
        with torch.no_grad():
            outputs = model(**batch)
            predictions = outputs.logits.argmax(dim=-1)
            metric.add_batch(
                    predictions=engine.gather(predictions),
                    references=engine.gather(batch["labels"]))

     eval_metric = metric.compute()
     print("epoch {}: {}".format(epoch, eval_metric))

The proceeding code does not support parallel data training, high-speed execution of the optimizer, or mixed-precision training. Therefore, the code needs to be optimized. You can use the methods of PAI-Rapidformer to optimize the code. To do so, perform the following steps:

  1. Enable parallel data processing.

    Create a finetuner object. Then, call the finetuner.build_data_loader method to build a data loader. The data loader supports parallel data processing and automatically sends data to GPUs. In this case, you can delete batch.to(device) from the preceding sample code.

    + from rapidformer import RapidformerEngine
    + engine = RapidformerEngine()
    + finetuner = Finetuner(engine=engine)
    
    - train_dataloader = DataLoader(tokenized_datasets["train"])
    - eval_dataloader = DataLoader(tokenized_datasets["train"])
    
    + train_dataloader = finetuner.build_data_loader(tokenized_datasets["train"])
    + eval_dataloader = finetuner.build_data_loader(tokenized_datasets["validation"])
  2. Enable Apex FusedAdam.

    Replace the original optimizer with Apex FusedAdam that is provided by PAI-Rapdiformer. You can replace the optimizer by calling the engine.compose method to encapsulate the transformer, optimizer, and learning rate optimizer.

    + from rapidformer import RapidformerEngine
    + engine = RapidformerEngine()
    + finetuner = Finetuner(engine=engine)
    
    - optimizer = AdamW(params=model.parameters(), lr=args.lr, correct_bias=True)
    - lr_scheduler = get_linear_schedule_with_warmup(optimizer=optimizer,
        num_warmup_steps=args.lr_warmup_iters,
        num_training_steps=args.train_iters
    )
    
    
    + lr_scheduler = partial(
            get_linear_schedule_with_warmup,
            num_warmup_steps=args.lr_warmup_iters,
            num_training_steps=args.train_iters
        )
    
    + model, optimizer, lr_scheduler = engine.compose(model_obj=model,
          lr_scheduler_fn=lr_scheduler)
    Note

    If you enable parallel data processing, Apex FusedAdam, and mixed precision, the mixed-precision training requires you to perform specific operations, such as modifying the training process, enabling FP16 data transfers, and enabling loss scaling. This involves complex frontend modifications. To avoid such modifications, you can implement acceleration by using a trainer. The Finetuner code template provided by PAI-Rapidformer supports various acceleration features, such as parallel data processing, Apex FusedAdam, Pytorch mixed-precision training, mixed-precision training provided by a Megatron optimizer, and acceleration of memory optimization provided by DeepSeed and FairScale.

Accelerate the pre-training of a Megatron transformer in a white-box manner by using the Pretrainer code template

Compared with the acceleration method described in the Accelerate the fine-tuning of a Hugging Face transformer in a white-box manner by using a custom trainer section, this acceleration method is more flexible. You do not need to use Data Hub or Model Hub. Instead, you can call the train_valid_test_datasets_provider, model_optimizer_lr_scheduler_provider, and run_forward_step methods to define the logic of custom data production, custom transformer construction, and forward propagation.

  1. Create an .mmap file as the pre-training dataset.

    For more information, see Megatron-LM. Sample code:

    python preprocess_data.py \
      --input /apsarapangu/disk2/jerry.lp/pretrain_datasets/en/book_wiki_owtv2_small.json  \
      --output-prefix /apsarapangu/disk2/jerry.lp/pretrain_datasets/en/gpt_small \
      --vocab gpt2-vocab.json \
      --dataset-impl mmap \
      --tokenizer-type GPT2BPETokenizer \
      --merge-file gpt2-merges.txt \
      --append-eod
  2. Use a class derived from PreTrainer to configure the train_valid_test_datasets_provider method that is used to produce custom data.

    You can define the logic of custom data production to create the training, verification, and test datasets, without the need to use third-party libraries. The datasets must inherit the torch.utils.data.Dataset class. Sample code:

    from rapidformer import RapidformerEngine, get_args, PreTrainer
    
    class MegatronGPTPreTrainer(PreTrainer):
        def __init__(self,
                     engine,
                     ):
            super().__init__(engine=engine)
    
        def train_valid_test_datasets_provider(self, train_val_test_num_samples):
            args = get_args()
    
            train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
                data_prefix=args.data_path,
                data_impl=args.data_impl,
                splits_string=args.split,
                train_valid_test_num_samples=train_val_test_num_samples,
                seq_length=args.seq_length,
                seed=args.seed,
                skip_warmup=(not args.mmap_warmup))
    
            return train_ds, valid_ds, test_ds
  3. Use a class derived from PreTrainer to configure the model_optimizer_lr_scheduler_provider method that is used to construct a custom transformer.

    You can define the logic of custom transformer construction to construct a custom transformer, without the need to use third-party libraries. The model must inherit the torch.nn.Module class. Sample code:

    from rapidformer import RapidformerEngine, get_args, PreTrainer
    from yourmodel import GPTModel
    
    class MegatronGPTPreTrainer(PreTrainer):
        def __init__(self,
                     engine,
                     ):
            super().__init__(engine=engine)
    
        def model_optimizer_lr_scheduler_provider(self):
            model = GPTModel()
            return model, None, None
  4. Use a class derived from PreTrainer to configure the run_forward_step method that is used to realize forward propagation.

    from rapidformer import RapidformerEngine, get_args, PreTrainer
    
    class MyGPTPreTrainer(PreTrainer):
        def __init__(self,
                     engine,
                     ):
            super().__init__(engine=engine)
    
    
        def run_forward_step(self, data_iterator, model):
            """Forward step."""
            args = get_args()
    
            tokenizer = get_tokenizer()
    
            # Items and their type.
            keys = ['text']
            datatype = torch.int64
    
            # Broadcast data.
            if data_iterator is not None:
                data = next(data_iterator)
            else:
                data = None
            data_b = mpu.broadcast_data(keys, data, datatype)
    
            # Unpack.
            tokens_ = data_b['text'].long()
            labels = tokens_[:, 1:].contiguous()
            tokens = tokens_[:, :-1].contiguous()
    
            # Get the masks and postition ids.
            attention_mask, loss_mask, position_ids = get_ltor_masks_and_position_ids(
                tokens,
                tokenizer.eod,
                args.reset_position_ids,
                args.reset_attention_mask,
                args.eod_mask_loss)
    
            output_tensor = model(tokens, position_ids, attention_mask,
                                  labels=labels)
    
            losses = output_tensor.float()
            loss_mask = loss_mask.view(-1).float()
            loss = torch.sum(losses.view(-1) * loss_mask) / loss_mask.sum()
    
            return loss
    
    
                            
  5. Initialize PAI-Rapidformer and create a trainer object. Then, call the pretrain() method and save the output as a file named rapidformer_pretrain_megatron_gpt_trainer.py.

    engine = RapidformerEngine()
    trainer = MyGPTPreTrainer(engine=engine)
    trainer.train()
  6. Configure a startup script in the CLI of PAI-Rapidformer. In the startup script, configure acceleration switches. Sample script:

    #!/bin/bash
    export CUDA_VISIBLE_DEVICES=4,5,6,7
    export MASTER_ADDR=localhost
    export MASTER_PORT=6010
    export NNODES=1
    export NODE_RANK=0
    
    DATA_PATH=book_wiki_owtv2_small_text_sentence
    PRETRAINED_CHECKPOINT=
    
    rapidformer --user-script rapidformer_pretrain_megatron_gpt_trainer.py \
           --tensor-model-parallel-size 2 \          # The size of tensor parallelism.
           --pipeline-model-parallel-size 2 \        # The size of pipeline parallelism.
           --num-layers 12 \
           --hidden-size 768 \
           --num-attention-heads 12 \
           --micro-batch-size 16 \
           --global-batch-size 128 \                  # The size of the global batch.
           --seq-length 512 \
           --tokenizer-type GPT2BPETokenizer \
           --max-position-embeddings 512 \
           --train-iters 100 \
           --data-path $DATA_PATH \
           --vocab-file gpt2-vocab.json \
           --merge-file gpt2-merges.txt \
           --data-impl mmap \                         # Enables acceleration.
           --split 980,20 \
           --lr 1e-3 \
           --lr-decay-style linear \
           --weight-decay 1e-2 \
           --clip-grad 1.0 \
           --lr-warmup-fraction .01 \
           --log-interval 1 \
           --zero-3-memory-optimization \                    # Uses ZeRO to partition optimizer states and gradients.
           --checkpoint-activations \                  # Enables checkpoints.
           --mixed-precision                           # Enables FP16 data transfers.