Before you use Pai-Megatron-Patch to accelerate model training, you must install a Pai-Megatron-Patch image. This topic describes the limits and procedure of installing a Pai-Megatron-Patch image.
Limits
You can install a Pai-Megatron-Patch image only on GPU-accelerated instances.
The GPU driver version is 460.32 or later.
Procedure
Install a Pai-Megatron-Patch image in DLC
Deep Learning Containers (DLC) of Platform for AI (PAI) is a cloud-native all-in-one platform on which you can train deep learning models. DLC provides a flexible, stable, easy-to-use, and high-performance training environment. DLC supports various algorithms, including large-scale distributed deep learning algorithms and custom algorithm frameworks. This helps developers and enterprises reduce costs and improve efficiency.
DLC allows you to install custom images, including Pai-Megatron-Patch images. You need to only pass the URL of a Pai-Megatron-Patch image to DLC. Then, the system automatically installs the image. After the image is installed, you can perform ultra-large distributed training on multiple multi-GPU servers based on Pai-Megatron-Patch in DLC.
Perform the following steps to install a Pai-Megatron-Patch image:
Log on to the PAI console.
In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
In the left-side navigation pane of the Workspace page, choose Model Development and Training > Deep Learning Containers (DLC). Click Create Job.
On the page that appears, configure the parameters. The following descriptions provide the configurations of key parameters. You can configure other parameters based on your business requirements. For more information about the parameters, see Submit training jobs.
Environment Information: Set Node Image to Image Address and enter the address of the image Pai-Megatron-Patch in the field that appears. The image address is pai-image-manage-registry.cn-wulanchabu.cr.aliyuncs.com/pai/pytorch-training:2.0-ubuntu20.04-py3.10-cuda11.8-megatron-patch-llm.
Resource Information:
Framework: Select PyTorch.
Job Resource: Click in the Instance Type column, select a GPU node, and then select node specifications based on your business requirements.
Click OK.
Install a Pai-Megatron-Patch image in DSW
Data Science Workshop (DSW) is a development environment in the cloud that is used for deep learning algorithm development. JupyterLab is integrated into DSW to allow DSW instances to provide plug-ins for custom development. You can launch Notebook to write, debug, and run Python code without the need to perform O&M configurations. DSW supports open source deep learning frameworks and provides an optimized TensorFlow framework that is developed by Alibaba Cloud. You can optimize compilation to improve the training performance.
DSW also allows you to install custom images. You need to only pass the URL of a Pai-Megatron-Patch image to DSW. Then, the system automatically installs the image. After the image is installed, you can accelerate training based on Pai-Megatron-Patch in DSW.
Perform the following steps to install a Pai-Megatron-Patch image:
Log on to the PAI console.
In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
In the left-side navigation pane of the workspace page, choose Model Development and Training > Interactive Modeling (DSW). Click Create Instance.
On the page that appears, configure the parameters. The following descriptions provide the configurations of key parameters. You can configure other parameters based on your business requirements. For more information about the parameters, see Create a DSW instance.
Resource Quota: Select Public Resource Group (Pay-as-you-go).
Instance Type: Click to select a GPU instance type based on your business requirements.
Image: Enter the address of the Pai-Megatron-Patch image in the field. The address is
pai-image-manage-registry.cn-wulanchabu.cr.aliyuncs.com/pai/pytorch-training:2.0-ubuntu20.04-py3.10-cuda11.8-megatron-patch-llm
.
Click Yes. A DSW instance is created.
Use Pai-Megatron-Patch
After you install a Pai-Megatron-Patch image, you can view and use the sample code in the examples folder of Pai-Megatron-Patch.