If a PyTorch foundation model training fails, you can use the most recent checkpoint that is saved by EasyCkpt to resume the training without the need to repeat calculations. This helps you save time and costs. EasyCkpt is a high-performance checkpoint framework that is used for PyTorch foundation model training scenarios provided by Platform for AI (PAI). You can use EasyCkpt to save model training progress at almost no cost. EasyCkpt provides capabilities to save and resume model training without compromising the entire process of foundation model training. EasyCkpt supports the following foundation model training frameworks: Megatron and DeepSpeed. This topic describes the technical principles of EasyCkpt and how to use EasyCkpt.
Background information
A typical challenge for foundation model training is ensuring that the process remains uninterrupted. In the training process, hardware failures, system problems, connection errors, and other unknown issues may occur. Frequent interruptions can have a negative impact on the training progress of foundation models, which require a significant amount of time and resources to complete. In specific cases, you can perform a checkpoint operation to save and resume the training progress. The time required to complete the checkpoint operation varies based on the size of the model. In most cases, a foundation model that contains tens to hundreds of billions of parameters requires several minutes to dozens of minutes to complete a checkpoint operation. The training task is suspended during the period, which prevents users from frequently performing checkpoint operations. If the training is interrupted, the lost iterations are recalculated when the system recovers, which may take several hours. For example, a foundation model that uses a thousand GPUs can result in significant losses due to the large number of GPUs that are required to run the model.
Therefore, a cost-saving method that can be used to save the latest checkpoints when errors occur is required. Repeated calculation is not required during training recovery. This helps save time and costs.
Principles
The following section describes the characteristics of GPU and deep learning failures based on previous occurrences:
Characteristic 1: Failure impacts specific workers
In most cases, the root cause of a failure can be traced back to one or two machines, which only affects several workers. Not all workers fail in a large-scale distributed training job.
Characteristic 2: Failure impacts specific components of a server
In most cases, the following scenarios occur in the clusters:
When errors occur on GPUs, the CPU and memory of the server work as expected.
The idle memory space of the node is large, usually much larger than the expected model state.
Errors occur only on specific network interfaces on each node. Therefore, the node is still able to communicate even if the node is not working as expected.
Characteristic 3: Failure impacts specific parts of the model
In most cases, foundation model training uses optimization methods such as 3D parallelism or Zero Redundancy Optimizer (ZeRO). Most tasks have more than one parallel data replica. This ensures that the model training parameters have backups on multiple replicas. When a GPU on a machine fails, you can recover the training by using the replicas that are retained on the GPUs of other machines.
Based on the characteristics of checkpoints in foundation model training scenarios, PAI provides the EasyCkpt framework, which provides high-performance checkpoints. EasyCkpt provides the model saving mechanism at almost no cost and the capabilities to save and resume model training without compromising the entire process of the foundation model training by adopting strategies such as asynchronous hierarchical checkpoint, overlapping model checkpoint and computation, and network-aware asynchronous checkpointing. EasyCkpt supports the following foundation model training frameworks: Megatron and DeepSpeed. You can use EasyCkpt with minimal code modifications.
Procedure
Install the SDK for AIMaster
You need to install the SDK for AIMaster to use EasyCkpt. Perform the following operations to install the SDK for AIMaster.
# py36
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp36-cp36m-linux_x86_64.whl
# py38
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp38-cp38-linux_x86_64.whl
# py310
pip install -U http://odps-release.cn-hangzhou.oss.aliyun-inc.com/aimaster/pai_aimaster-1.2.1-cp310-cp310-linux_x86_64.whl
Megatron
Sample code
Modify the code in the training.py file based on the Megatron framework. The following figure provides an example.
You also need to import a line of code in the actual training code file. The pretrain_gpt.py file is used in the example shown in the following figure.
The following code provides an example of the modified training.py file:
from megatron.core.utils import get_model_config
from megatron import print_rank_0
from megatron import print_rank_last
# from megatron.checkpointing import load_checkpoint
from megatron.checkpointing import save_checkpoint
from megatron.model import Float16Module
from megatron.model import GPTModel
from megatron.utils import report_memory
from megatron.model.vision.knn_monitor import compute_feature_bank
from aimaster.python.torch.easyckpt.megatron import (load_checkpoint,
initialize_easyckpt,
save_checkpoint_if_needed)
def print_datetime(string):
"""Note that this call will sync across all ranks."""
timers('interval-time', log_level=0).start(barrier=True)
print_datetime('before the start of training step')
report_memory_flag = True
initialize_easyckpt(save_mem_interval=1, save_storage_interval=5, max_ckpt_num=5, log_file_path='./test.log')
while iteration < args.train_iters:
if args.profile and \
iteration == args.profile_step_start and \
args.micro_batch_size * \
get_num_microbatches()
save_checkpoint_if_needed(iteration, model, optimizer, opt_param_scheduler)
# Logging.
loss_scale = optimizer.get_loss_scale().item()
params_norm = None
The following code provides an example of the modified actual training file. The pretrain_gpt.py file is used in the example.
from megatron.utils import average_losses_across_data_parallel_group
from megatron.arguments import core_transformer_config_from_args
import aimaster.python.torch.easyckpt.megatron.hook
def model_provider(pre_process=True, post_process=True):
"""Build the model."""
Description
The EasyCkpt framework provides the following interfaces for Megatron:
load_checkpoint(model, optimizer, opt_param_scheduler, load_arg='load', strict=True, concat=False): The interface adds the concat parameter based on the signature of the native
load_checkpoint()
function of the Megatron framework. If you use Megatron 2304, you only need to replace the load_checkpoint of Megatron. If you use Megatron 2305 or 2306, take note of the following note.initialize_easyckpt(save_mem_interval, save_storage_interval, max_ckpt_num, log_file_path=None): The interface is used to initialize the Easyckpt framework. Specify the frequency of memory copy by using save_mem_interval. Specify the asynchronous storage frequency by using save_storage_interval. Specify the maximum number of checkpoints that can be stored in the storage device by using max_ckpt_num. Specify the log path by using the log_file_path if you need to save detailed log information.
save_checkpoint_if_needed(iteration, model, optimizer, opt_param_scheduler): The interface is used to call the EasyCkpt framework to perform checkpoints in the memory. The parameters are existing variables in the Megatron code. You do not need to specify the parameters.
Note: If you use Megatron 2305 or 2306 and enabled the distributed-optimizer, you need to set the concat parameter in the load_checkpoint() function of the training.py file to True in one of the following situations: you want to change the number of instances during the load process, or you want to merge parameters for the distributed optimizer.
DeepSpeed
In most cases, the Trainer for Transformer models is used to start DeepSpeed tasks. EasyCkpt supports the method to minimize the required modifications.
Sample code
Startup parameters: The EasyCkpt framework for DeepSpeed reuses the checkpoint parameters of Transformer. The meaning of the parameters is the same as the meaning that is defined in Transformer. For more information, see the following description section. In the sample code, a checkpoint is saved every two mini-batches. Up to two recent checkpoint copies are retained persistently at the same time.
Code after modification (upper-right figure)
--max_steps=10 \
--block_size=2048 \
--num_train_examples=100000 \
--gradient_checkpointing=false \
--save_strategy="steps" \
--save_steps="2" \
--save_total_limit="2"
Code modifications: The Trainer of Transformer needs to be wrapped with TrainerWrapper which is provided by EasyCkpt and enable the resume_from_Checkpoint parameter.
Code after modification (upper-right figure)
import datasets
import transformers
from aimaster.python.torch.easyckpt.transformers import TrainerWrapper
logger = logging.getLogger(__name__)
tokenizer=tokenizer,
data_collator=transformers.default_data_collator,
)
trainer = TrainerWrapper(trainer)
trainer.train(resume_from_checkpoint=True)
if __name__ = ""__main__":
main()
Description
The EasyCkpt framework provides the following interfaces for DeepSpeed:
save_strategy: the saving mode of checkpoints during training. Valid values:
no: does not save the checkpoints during training.
epoch: saves the checkpoints at the end of each epoch.
steps: saves the checkpoints based on the specified value of save_steps.
save_steps: the number of steps at which checkpoints are saved during training. The interface is valid only when you set save_strategy to steps.
save_total_limit: the maximum number of checkpoints that can be retained.
Note: The outdated checkpoints folder is deleted when you enable the save_total_limit. Make sure that you have the required data saved before the folder is deleted. For more information, refer to the official Transformer documentation.
Data security notes
EasyCkpt needs to read and write data in the storage that you specify. EasyCkpt may need to delete data to control the maximum number of retained checkpoints. To ensure data security, PAI defines all read and write operations that are performed by EasyCkpt, ensures data security for EasyCkpt, and provides recommended usage methods.
EasyCkpt performs the following read and write operation. If you use EasyCkpt, the default permissions are granted:
Read checkpoint data from the load directory and integrate the data into new checkpoint data.
Save the checkpoint data to the save directory, and delete the checkpoint folder in the Megatron or Transformers format in the save directory based on specific configurations.
EasyCkpt ensures that the following requirements are met:
No operations are performed on data other than the save and load directories.
All save and delete operations that are performed by EasyCkpt are logged.
We recommend that you do not store other data in the save or load directory of the model. Otherwise, EasyCkpt may not work as expected. You are responsible for data risks and data losses that may occur.