Develop Qwen models in PAI-Lingjun AI Computing Service - Platform For AI

This topic aims to help foundation model developers get started with PAI-Lingjun AI Computing Service and develop the foundation models of Qwen-7B, Qwen-14B, and Qwen-72B. The development process includes distributed training, fine-tuning, offline inference, and online deployment. In this example, a Qwen-7B model is used to describe the best practice for developing a Qwen model in PAI-Lingjun AI Computing Service.

Prerequisites

In this example, Qwen-7B V1.1.4 is used. Before you start, make sure that the following prerequisites are met:

Platform for AI (PAI) is activated, including Data Science Workshop (DSW), Deep Learning Containers (DLC), and Elastic Algorithm Service (EAS). The default workspace is created. For more information, see Activate PAI and create a default workspace.

Lingjun resources are purchased, and a resource quota is created for the purchased Lingjun resources. The following table describes the resource specifications that are supported by different numbers of model parameters. Select appropriate resource specifications based on your actual number of model parameters. For more information about the node specifications of Lingjun resources, see the Pricing of nodes section of the "Billing of Lingjun resources (Serverless Edition)" topic. For more information, see Create a resource group and purchase Lingjun resources and Lingjun resource quotas.

Number of model parameters	Full-parameter training resources	Minimum inference resources	Model parallelism for Megatron-based training

Number of model parameters	Full-parameter training resources	Minimum inference resources	Model parallelism for Megatron-based training
7 billion	Eight gu7xf GPUs or eight gu7ef GPUs	One NVIDIA V100 GPU (32 GB of memory) or one NVIDIA A10 GPU (24 GB of memory)	TP1 and PP1
14 billion	Eight gu7xf GPUs or eight gu7ef GPUs	Two NVIDIA V100 GPUs (32 GB of memory) or two NVIDIA A10 GPUs (24 GB of memory)	TP2 and PP1
72 billion	Four servers, each with eight gu7xf GPUs or eight gu7ef GPUs	Six NVIDIA V100 GPUs (32 GB of memory) or two gu7xf GPUs	TP8 and PP2

A dataset is created based on a General-purpose NAS file system of File Storage NAS to store the files and result files required for training. The default mount directory is /mnt/data/nas. For more information, see Create and manage datasets.
A DSW instance is created based on the following key parameters. For more information, see Create a DSW instance.
- Resource Quota: Select the resource quota that is created for the purchased Lingjun resources.
- Instance Type: Configure the following resource specifications:
  - vCPUs: 90
  - Memory (GiB): 1024
  - Shared Memory (GiB): 1024
  - GPUs: at least 8
- Dataset Mounting: Click Custom Dataset, select the created dataset, and then specify the default mount directory.
- Image: Click Image Address and enter the following image URL: pai-image-manage-registry.cn-wulanchabu.cr.aliyuncs.com/pai/pytorch-training:1.12-ubuntu20.04-py3.10-cuda11.3-megatron-patch-llm.
A Resource Access Management (RAM) user is granted the required permissions on DSW, DLC, and EAS if you perform the operations in this best practice as the RAM user. For more information, see Grant the permissions that are required to use DSW, Grant the permissions that are required to use DLC, and Grant the permissions that are required to use EAS.

Limits

This best practice is supported only in the China (Ulanqab) region.

Step 1: Prepare a Qwen model

You can download a model by using one of the methods described in this best practice. Perform the following steps:

Go to the development environment of DSW.
1. Log on to the PAI console.
2. In the upper-left corner of the page, select the China (Ulanqab) region.
3. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
4. In the left-side navigation pane, choose Model Training > Data Science Workshop (DSW).
5. Find the DSW instance that you want to manage and click Open in the Actions column.
In the top navigation bar, click Terminal. On this tab, click Create Terminal or the plus (+) icon in the upper-right corner.

Download a Qwen model.

Download a model from the ModelScope community

Download a model from the Hugging Face community

Run the following command on the Terminal tab to install ModelScope:

pip install modelscope

View the returned results. You can ignore the WARNING information in the returned results.

Looking in indexes: https://mirrors.cloud.aliyuncs.com/pypi/simple
Collecting modelscope
  Downloading https://mirrors.cloud.aliyuncs.com/pypi/packages/ac/05/75b5d750608d7354dc3dd023dca7101e5f3b4645cb3e5b816536d472a058/modelscope-1.9.5-py3-none-any.whl (5.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.4/5.4 MB 104.7 MB/s eta 0:00:00
Requirement already satisfied: pyyaml in /opt/*/lib/python3.8/site-packages (from modelscope) (5.4.1)
Requirement already satisfied: pandas in /opt/*/lib/python3.8/site-packages (from modelscope) (1.5.3)
Requirement already satisfied: addict in /opt/*/lib/python3.8/site-packages (from modelscope) (2.4.0)
Requirement already satisfied: numpy in /opt/*/lib/python3.8/site-packages (from modelscope) (1.22.2)
Collecting simplejson>=3.3.0
  Downloading https://mirrors.cloud.aliyuncs.com/pypi/packages/33/5f/b9506e323ea89737b34c97a6eda9d22ad6b771190df93f6eb72657a3b996/simplejson-3.19.2-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (136 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 136.6/136.6 kB 70.2 MB/s eta 0:00:00
Collecting gast>=0.2.2
  Downloading https://mirrors.cloud.aliyuncs.com/pypi/packages/fa/39/5aae571e5a5f4de9c3445dae08a530498e5c53b0e74410eeeb0991c79047/gast-0.5.4-py3-none-any.whl (19 kB)
Requirement already satisfied: Pillow>=6.2.0 in /opt/*/lib/python3.8/site-packages (from modelscope) (9.3.0)
Requirement already satisfied: oss2 in /opt/*/lib/python3.8/site-packages (from modelscope) (2.17.0)
Requirement already satisfied: filelock>=3.3.0 in /opt/*/lib/python3.8/site-packages (from modelscope) (3.11.0)
Requirement already satisfied: urllib3>=1.26 in /opt/*/lib/python3.8/site-packages (from modelscope) (1.26.12)
Requirement already satisfied: datasets<=2.13.0,>=2.8.0 in /opt/*/lib/python3.8/site-packages (from modelscope) (2.11.0)
Requirement already satisfied: attrs in /opt/*/lib/python3.8/site-packages (from modelscope) (22.2.0)
Requirement already satisfied: scipy in /opt/*/lib/python3.8/site-packages (from modelscope) (1.9.3)
Requirement already satisfied: yapf in /opt/*/lib/python3.8/site-packages (from modelscope) (0.32.0)
Requirement already satisfied: pyarrow!=9.0.0,>=6.0.0 in /opt/*/lib/python3.8/site-packages (from modelscope) (11.0.0)
Requirement already satisfied: setuptools in /opt/*/lib/python3.8/site-packages (from modelscope) (65.5.0)
Requirement already satisfied: requests>=2.25 in /opt/*/lib/python3.8/site-packages (from modelscope) (2.28.1)
Requirement already satisfied: einops in /opt/*/lib/python3.8/site-packages (from modelscope) (0.6.0)
Requirement already satisfied: python-dateutil>=2.1 in /opt/*/lib/python3.8/site-packages (from modelscope) (2.8.2)
Collecting sortedcontainers>=1.5.9
  Downloading https://mirrors.cloud.aliyuncs.com/pypi/packages/32/46/9cb0e58b2deb7f82b84065f37f3bffeb12413f947f9388e4cac22c4621ce/sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
Requirement already satisfied: tqdm>=4.64.0 in /opt/*/lib/python3.8/site-packages (from modelscope) (4.65.0)
Requirement already satisfied: dill<0.3.7,>=0.3.0 in /opt/*/lib/python3.8/site-packages (from datasets<=2.13.0,>=2.8.0->modelscope) (0.3.6)
Requirement already satisfied: multiprocess in /opt/*/lib/python3.8/site-packages (from datasets<=2.13.0,>=2.8.0->modelscope) (0.70.14)
Requirement already satisfied: aiohttp in /opt/*/lib/python3.8/site-packages (from datasets<=2.13.0,>=2.8.0->modelscope) (3.8.4)
Requirement already satisfied: responses<0.19 in /opt/*/lib/python3.8/site-packages (from datasets<=2.13.0,>=2.8.0->modelscope) (0.18.0)
Requirement already satisfied: huggingface-hub<1.0.0,>=0.11.0 in /opt/*/lib/python3.8/site-packages (from datasets<=2.13.0,>=2.8.0->modelscope) (0.16.4)
Requirement already satisfied: fsspec[http]>=2021.11.1 in /opt/*/lib/python3.8/site-packages (from datasets<=2.13.0,>=2.8.0->modelscope) (2023.4.0)
Requirement already satisfied: packaging in /opt/*/lib/python3.8/site-packages (from datasets<=2.13.0,>=2.8.0->modelscope) (21.3)
Requirement already satisfied: xxhash in /opt/*/lib/python3.8/site-packages (from datasets<=2.13.0,>=2.8.0->modelscope) (3.2.0)
Requirement already satisfied: six>=1.5 in /opt/*/lib/python3.8/site-packages (from python-dateutil>=2.1->modelscope) (1.16.0)
Requirement already satisfied: certifi>=2017.4.17 in /opt/*/lib/python3.8/site-packages (from requests>=2.25->modelscope) (2022.9.24)
Requirement already satisfied: charset-normalizer<3,>=2 in /opt/*/lib/python3.8/site-packages (from requests>=2.25->modelscope) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in /opt/*/lib/python3.8/site-packages (from requests>=2.25->modelscope) (3.4)
Requirement already satisfied: aliyun-python-sdk-kms>=2.4.1 in /opt/*/lib/python3.8/site-packages (from oss2->modelscope) (2.16.0)
Requirement already satisfied: aliyun-python-sdk-core>=2.13.12 in /opt/*/lib/python3.8/site-packages (from oss2->modelscope) (2.13.36)
Requirement already satisfied: crcmod>=1.7 in /opt/*/lib/python3.8/site-packages (from oss2->modelscope) (1.7)
Requirement already satisfied: pycryptodome>=3.4.7 in /opt/*/lib/python3.8/site-packages (from oss2->modelscope) (3.15.0)
Requirement already satisfied: pytz>=2020.1 in /opt/*/lib/python3.8/site-packages (from pandas->modelscope) (2022.7.1)
Requirement already satisfied: cryptography>=2.6.0 in /opt/*/lib/python3.8/site-packages (from aliyun-python-sdk-core>=2.13.12->oss2->modelscope) (38.0.3)
Requirement already satisfied: jmespath<1.0.0,>=0.9.3 in /opt/*/lib/python3.8/site-packages (from aliyun-python-sdk-core>=2.13.12->oss2->modelscope) (0.10.0)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /opt/*/lib/python3.8/site-packages (from aiohttp->datasets<=2.13.0,>=2.8.0->modelscope) (4.0.2)
Requirement already satisfied: yarl<2.0,>=1.0 in /opt/*/lib/python3.8/site-packages (from aiohttp->datasets<=2.13.0,>=2.8.0->modelscope) (1.8.2)
Requirement already satisfied: frozenlist>=1.1.1 in /opt/*/lib/python3.8/site-packages (from aiohttp->datasets<=2.13.0,>=2.8.0->modelscope) (1.3.3)
Requirement already satisfied: multidict<7.0,>=4.5 in /opt/*/lib/python3.8/site-packages (from aiohttp->datasets<=2.13.0,>=2.8.0->modelscope) (6.0.4)
Requirement already satisfied: aiosignal>=1.1.2 in /opt/*/lib/python3.8/site-packages (from aiohttp->datasets<=2.13.0,>=2.8.0->modelscope) (1.3.1)
Requirement already satisfied: typing-extensions>=3.7.*.* in /opt/*/lib/python3.8/site-packages (from huggingface-hub<1.0.0,>=0.11.0->datasets<=2.13.0,>=2.8.0->modelscope) (4.4.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/*/lib/python3.8/site-packages (from packaging->datasets<=2.13.0,>=2.8.0->modelscope) (3.0.9)
Requirement already satisfied: cffi>=1.12 in /opt/*/lib/python3.8/site-packages (from cryptography>=2.6.0->aliyun-python-sdk-core>=2.13.12->oss2->modelscope) (1.15.1)
Requirement already satisfied: pycparser in /opt/*/lib/python3.8/site-packages (from cffi>=1.12->cryptography>=2.6.0->aliyun-python-sdk-core>=2.13.12->oss2->modelscope) (2.21)
Installing collected packages: sortedcontainers, simplejson, gast, modelscope
Successfully installed gast-0.5.4 modelscope-1.9.5 simplejson-3.19.2 sortedcontainers-2.4.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

Run the following command to go to the Python environment:

python

The following sample code downloads the package of a Qwen-7B model:

# ### Loading Model and Tokenizer
from modelscope.hub.snapshot_download import snapshot_download
model_dir = snapshot_download('qwen/Qwen-7B', 'v1.1.4')
# model_dir = snapshot_download('qwen/Qwen-14B', 'v1.0.4')
# model_dir = snapshot_download('qwen/Qwen-72B')
# Display the directory of the downloaded model.
print(model_dir)
# /root/.cache/modelscope/hub/qwen/Qwen-7B

Press Ctrl+D to exit the Python environment.
Run the following commands to move the downloaded model to the corresponding folder:

# mkdir -p /mnt/workspace/qwen-ckpts/${The ckpt folder with the hf suffix}
mkdir -p /mnt/workspace/qwen-ckpts/qwen-7b-hf
# cp -r ${The directory of the downloaded model}/* /mnt/workspace/qwen-ckpts/${The ckpt folder with the hf suffix}
cp -r /root/.cache/modelscope/hub/qwen/Qwen-7B/* /mnt/workspace/qwen-ckpts/qwen-7b-hf

Run the following commands on the Terminal tab of DSW to download the package of a model. In this example, the package of a Qwen-7B model is downloaded. If you want to download the package of a Qwen-14B or Qwen-72B model, modify the following sample code based on your business requirements:

mkdir /mnt/workspace/qwen-ckpts
cd /mnt/workspace/qwen-ckpts
git clone https://huggingface.co/Qwen/Qwen-7B
# git clone https://huggingface.co/Qwen/Qwen-7B-Chat
# git clone https://huggingface.co/Qwen/Qwen-14B
# git clone https://huggingface.co/Qwen/Qwen-14B-Chat
# git clone https://huggingface.co/Qwen/Qwen-72B
# git clone https://huggingface.co/Qwen/Qwen-72B-Chat

Step 2: Prepare data for pre-training

We recommend that you prepare the data used for pre-training in the DSW instance. In this example, the WuDaoCorpora 2.0 dataset is used to describe how to preprocess data for Megatron-based training. This dataset is used only for research. You can directly download the small-scale sample data processed by PAI. You can also prepare the data used for pre-training on your own.

Use the small-scale sample data processed by PAI

Process data on your own

To help you use this best practice, PAI provides the processed small-scale sample data. You can run the following commands on the Terminal tab of DSW to download the sample data:

mkdir /mnt/workspace/qwen-datasets/
cd /mnt/workspace/qwen-datasets
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/qwen-datasets/alpaca_zh-qwen-train.json
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/qwen-datasets/alpaca_zh-qwen-valid.json
mkdir -p /mnt/workspace/qwen-datasets/wudao
cd /mnt/workspace/qwen-datasets/wudao
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/qwen-datasets/wudao_qwenbpe_content_document.bin
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/qwen-datasets/wudao_qwenbpe_content_document.idx

Download the open source WuDaoCorpora 2.0 dataset to the /mnt/workspace/qwen-datasets working directory. In this example, the extracted folder is named wudao_200g.
The small-scale sample data processed by PAI is also sourced from this dataset. You can run the following commands on the Terminal tab of DSW to download and decompress the dataset:
```
mkdir /mnt/workspace/qwen-datasets
cd /mnt/workspace/qwen-datasets
wget https://atp-modelzoo.oss-cn-hangzhou.aliyuncs.com/release/datasets/WuDaoCorpus2.0_base_sample.tgz
tar zxvf WuDaoCorpus2.0_base_sample.tgz 
mv WuDaoCorpus2.0_base_sample wudao_200g
```

Run the following commands on the Terminal tab to perform data cleansing on the WuDaoCorpora 2.0 dataset, convert the file format, and then generate the merged_wudao_cleaned.json file:

#! /bin/bash
set -ex
# Specify the directory of the WuDaoCorpora 2.0 dataset. 
data_dir=/mnt/workspace/qwen-datasets/wudao_200g

# Start the data cleansing process. 
dataset_dir=$(dirname $data_dir)
mkdir -p ${dataset_dir}/cleaned_wudao_dataset
cd ${dataset_dir}/cleaned_wudao_dataset
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/llama2-codes/preprocess_wudao2.py
# Set the -k option to text. 
python preprocess_wudao2.py -i ${data_dir} -o ${dataset_dir}/cleaned_wudao_dataset -k text -p 32

# Merge the cleansed data. 
mkdir ${dataset_dir}/wudao
cd ${dataset_dir}/wudao
find ${dataset_dir}/cleaned_wudao_dataset -name "*.json" -exec cat {} + > ${dataset_dir}/wudao/merged_wudao_cleaned.json
rm -rf ${dataset_dir}/cleaned_wudao_dataset

The following sample code shows the structure of the qwen-datasets directory after the preceding commands are run. The wudao folder is created.

qwen-datasets
├── wudao_200g 
└── wudao
    └── merged_wudao_cleaned.json

Run the following commands on the Terminal tab to split data into several groups and compress the data of each group by processing the generated merged_wudao_cleaned.json file. This facilitates multithreading in subsequent operations.

apt-get update
apt-get install zstd

# Split data into 10 groups. If data processing is slow, you can split data into more groups. 
NUM_PIECE=10

# Process the merged_wudao_cleaned.json file. 
mkdir -p ${dataset_dir}/cleaned_zst/
# Query the total length of data and split the data. 
NUM=$(sed -n '$=' ${dataset_dir}/wudao/merged_wudao_cleaned.json)
echo "total line of dataset is $NUM, data will be split into $NUM_PIECE pieces for processing"
NUM=`expr $NUM / $NUM_PIECE`
echo "each group is processing $NUM sample"
split_dir=${dataset_dir}/split
mkdir $split_dir
split -l $NUM --numeric-suffixes --additional-suffix=.jsonl ${dataset_dir}/wudao/merged_wudao_cleaned.json $split_dir/

# Compress the data of each group. 
o_path=${dataset_dir}/cleaned_zst/
mkdir -p $o_path
files=$(ls $split_dir/*.jsonl)
for filename in $files
do
   f=$(basename $filename)
   zstd -z $filename -o $o_path/$f.zst &
done
rm -rf $split_dir
rm ${dataset_dir}/wudao/merged_wudao_cleaned.json

The following sample code shows the structure of the qwen-datasets directory after the preceding commands are run. The cleaned_zst folder is created and contains 10 compressed files.

qwen-datasets
├── wudao_200g
├── wudao
└── cleaned_zst
    ├── 00.jsonl.zst
		│   ...
    └── 09.jsonl.zst

Generate the dataset used for pre-training in the MMAP format.

MMAP is a file format in which data is tokenized in advance. It reduces the amount of time required to read data from the dataset during training and fine-tuning, especially when you process large amounts of data. Perform the following steps:

Run the following commands on the Terminal tab of DSW to copy the PAI-Megatron-Patch file that contains the source code of the Megatron-based training tool to the /mnt/workspace/ working directory of DSW:

cd /mnt/workspace/
# Method 1: Obtain the source code of the training tool from GitHub. 
git clone --recurse-submodules https://github.com/alibaba/Pai-Megatron-Patch.git
# Method 2: Obtain the source code of the training tool by running the wget command. Then, run the tar zxvf Pai-Megatron-Patch.tgz command to decompress the downloaded file. 
wget https://atp-modelzoo.oss-cn-hangzhou.aliyuncs.com/release/models/Pai-Megatron-Patch.tgz

Run the following commands on the Terminal tab to convert the dataset to the MMAP format:

After the commands are run, the .bin and .idx files are generated in the /mnt/workspace/qwen-datasets/wudao directory.

# Install the tokenizer library on which Qwen depends. 
pip install tiktoken
# Specify the directory of the dataset and the working directory. 
export dataset_dir=/mnt/workspace/qwen-datasets
export WORK_DIR=/mnt/workspace

# Generate the training set and validation set used for pre-training in the MMAP format. 
cd ${WORK_DIR}/Pai-Megatron-Patch/toolkits/pretrain_data_preprocessing
bash run_make_pretraining_dataset.sh \
../../Megatron-LM-23.04 \
${WORK_DIR}/Pai-Megatron-Patch/ \
${dataset_dir}/cleaned_zst/ \
qwenbpe \
${dataset_dir}/wudao/ \
${WORK_DIR}/qwen-ckpts/qwen-7b-hf
rm -rf ${dataset_dir}/cleaned_zst

The following table describes the six parameters that you must specify to run the run_make_pretraining_dataset.sh script.

Parameter	Description
MEGATRON_PATH=$1	The directory of the source code of the Megatron-based training tool.
MEGATRON_PATCH_PATH=$2	The directory of the Pai-Megatron-Patch folder.
input_data_dir=$3	The directory of the processed and packaged WuDaoCorpora 2.0 dataset.
tokenizer=$4	The type of the tokenizer. In this example, the value is set to qwenbpe.
output_data_dir=$5	The directory of the generated `.bin` and `.idx` files.
load_dir=$6	The directory of the generated tokenizer_config.json file.

The following sample code shows the structure of the qwen-datasets directory after the script is run:

qwen-datasets
├── wudao_200g
└── wudao
   ├── wudao_qwenbpe_content_document.bin
   └── wudao_qwenbpe_content_document.idx

Step 3: Perform Megatron-based training

You can perform the following operations to perform Megatron-based training:

Convert the model format

You must convert the model format from Hugging Face to Megatron.

Download the converted Megatron model

Convert the model format from Hugging Face to Megatron

To help you use this best practice, PAI provides the model whose format has been converted. You can run the following commands on the Terminal tab to download the model:

cd /mnt/workspace/
mkdir qwen-ckpts
cd qwen-ckpts
wget https://atp-modelzoo-wlcb-pai.oss-cn-wulanchabu.aliyuncs.com/release/models/pai-megatron-patch/qwen-ckpts/qwen-7b-hf-to-mg-tp1-pp1.tgz
tar -zxf qwen-7b-hf-to-mg-tp1-pp1.tgz
mv qwen-7b-hf-to-mg-tp1-pp1 qwen-7b-hf-to-megatron-tp1-pp1

Run the following commands on the Terminal tab to use the model conversion tool provided by PAI to convert the model format from Hugging Face to Megatron:

# Convert the model format. 
cd /mnt/workspace/Pai-Megatron-Patch/toolkits/model_checkpoints_convertor/qwen
sh model_convertor.sh \
../../../Megatron-LM-main        \
/mnt/workspace/qwen-ckpts/qwen-7b-hf         \
/mnt/workspace/qwen-ckpts/qwen-7b-hf-to-megatron-tp1-pp1  \
1  \
1  \
qwen-7b \
0 \
false

The following table describes the parameters that you must specify to run the model_convertor.sh script.

Parameter	Description
MEGATRON_PATH=$1	The directory of the source code of the Megatron-based training tool.
SOURCE_CKPT_PATH=$2	The directory of the Hugging Face model.
TARGET_CKPT_PATH=$3	The directory of the converted Megatron model.
TP=$4	The size of tensor parallelism, which must be the same as that for training. The size varies based on the number of model parameters. You must modify the size when you convert the model format. Qwen-7B: 1 Qwen-14B: 2 Qwen-72B: 8
PP=$5	The size of pipeline parallelism, which must be the same as that for training. The size varies based on the number of model parameters. You must modify the size when you convert the model format. Qwen-7B: 1 Qwen-14B: 1 Qwen-72B: 2
MN=$6	The name of the model, such as qwen-7b, qwen-14b, or qwen-72b.
EXTRA_VOCAB_SIZE=$7	The size of the extra vocabulary.
mg2hf=$8	Specifies whether to convert the model format from Megatron to Hugging Face.

Pre-train the model

You can submit a standalone job to train the model in DSW, or submit a distributed job to train the model on multiple multi-GPU servers in DLC. The training process lasts about 2 hours. After the job is run, a model file is exported to the /mnt/workspace/output_megatron_qwen/ directory.

Run a standalone job to pre-train the model in DSW

Run a distributed job to pre-train the model in DLC

The following sample code runs a standalone job to train a Qwen-7B model on the Terminal tab:

export WORK_DIR=/mnt/workspace
cd ${WORK_DIR}/Pai-Megatron-Patch/examples/qwen
sh run_pretrain_megatron_qwen.sh  \
dsw  \
${WORK_DIR}/Pai-Megatron-Patch  \
7B   \
1    \
8 \
1e-5   \
1e-6   \
2048  \
2048  \
85   \
fp16  \
1   \
1  \
sel  \
true   \
false  \
false   \
false  \
100000  \
${WORK_DIR}/qwen-datasets/wudao/wudao_qwenbpe_content_document   \
${WORK_DIR}/qwen-ckpts/qwen-7b-hf-to-megatron-tp1-pp1   \
100000000   \
10000   \
${WORK_DIR}/output_megatron_qwen/

The following table describes the parameters that you must specify to run the run_pretrain_megatron_qwen.sh script.

Parameter	Description
ENV=$1	The runtime environment. Valid values: dsw dlc
MEGATRON_PATH=$2	The directory of the source code of the Megatron-based training tool.
MODEL_SIZE=$3	The number of model parameters. Valid values: 7B, 14B, and 72B.
BATCH_SIZE=$4	The number of samples on each GPU for each training iteration. Valid values: 4 and 8.
GLOBAL_BATCH_SIZE=$5	The total number of samples for training iterations.
LR=$6	The learning rate. Valid values: 1e-5 and 5e-5.
MIN_LR=$7	The minimum learning rate. Valid values: 1e-6 and 5e-6.
SEQ_LEN=$8	The length of the sequence.
PAD_LEN=${9}	The length of the padding sequence.
EXTRA_VOCAB_SIZE=${10}	The size of the extra vocabulary. The size varies based on the number of model parameters. Qwen-7B: 85 Qwen-14B: 213 Qwen-72B: 213
PR=${11}	The training precision. Valid values: fp16 and bf16.
TP=${12}	The size of tensor parallelism.
PP=${13}	The size of pipeline parallelism.
AC=${14}	The activation checkpointing mode. Valid values: full sel
DO=${15}	Specifies whether to use the ZeRO-1 optimizer for Megatron. Valid values: true false
FL=${16}	Specifies whether to enable Flash Attention. Valid values: true false
SP=${17}	Specifies whether to use sequence parallelism. Valid values: true false
TE=${18}	Specifies whether to enable the acceleration technology of Transformer Engine. If you want to enable this technology, gu8xf GPUs are required.
SAVE_INTERVAL=${19}	The interval at which the checkpoint file is saved.
DATASET_PATH=${20}	The directory of the training set.
PRETRAIN_CHECKPOINT_PATH=${21}	The directory of the pre-trained model.
TRAIN_TOKENS=${22}	The number of tokens for training.
WARMUP_TOKENS=${23}	The number of tokens for warm-up.
OUTPUT_BASEPATH=${24}	The directory of the output model file generated after training.

After you train the model in DSW, you can configure a distributed job to train the model on multiple multi-GPU servers in DLC. Perform the following steps:

Go to the Create Job page.
1. Log on to the PAI console. Select a region and a workspace. Then, click Enter Deep Learning Containers (DLC).
2. On the Deep Learning Containers (DLC) page, click Create Job.

On the Create Job page, configure the parameters that are described in the following table. You can use the default values for other parameters. For more information, see Submit training jobs.

Parameter		Description
Basic Information	Job Name	The name of the training job. In this example, the value is set to test_qwen_dlc.
Environment Information	Node Image	Click Image Address and enter the following image URL in the field: `pai-image-manage-registry.cn-wulanchabu.cr.aliyuncs.com/pai/pytorch-training:1.12-ubuntu20.04-py3.10-cuda11.3-megatron-patch-llm`.
	Data Set	Click Custom Dataset and configure the following parameters: Custom Dataset: Select the dataset created based on the General-purpose NAS file system of File Storage NAS. Mount Path: Enter `/mnt/workspace/`.
	Startup Command	Enter the following commands. The parameters that you must specify to run the run_pretrain_megatron_qwen.sh script are the same as those that you specify when you submit a standalone job to train the model in DSW. `export WORK_DIR=/mnt/workspace cd ${WORK_DIR}/Pai-Megatron-Patch/examples/qwen sh run_pretrain_megatron_qwen.sh \ dlc \ ${WORK_DIR}/PAI-Megatron-Patch \ 7B \ 1 \ 8 \ 1e-5 \ 1e-6 \ 2048 \ 2048 \ 85 \ fp16 \ 1 \ 1 \ sel \ true \ false \ false \ false \ 100000 \ ${WORK_DIR}/qwen-datasets/wudao/wudao_qwenbpe_content_document \ ${WORK_DIR}/qwen-ckpts/qwen-7b-hf-to-megatron-tp1-pp1 \ 100000000 \ 10000 \ ${WORK_DIR}/output_megatron_qwen/`
Resource Information	Resource Type	Select Lingjun Resources.
	Source	Select Resource Quota.
	Resource Quota	Select the resource quota that is created for the purchased Lingjun resources.
	Framework	Select PyTorch.
	Job Resource	Configure the following parameters for worker nodes: Number of Nodes: Enter 2. If you want to train the model on more servers, you can increase the value of the Number of Nodes parameter. GPUs: Enter 8. vCPUs: Enter 90. Note The number of CPU cores cannot be greater than 96. Memory (GiB): Enter 1024. Shared Memory (GiB): Enter 1024.

Click OK. You are navigated to the Deep Learning Containers (DLC) page. If the state of the job changes to Succeeded, the training job is run.

Perform supervised fine-tuning

You can submit a standalone job to fine-tune the model in DSW, or submit a distributed job to fine-tune the model on multiple multi-GPU servers in DLC. The fine-tuning process lasts about 2 hours. After the job is run, a model file is exported to the /mnt/workspace/output_megatron_qwen/ directory.

Before you fine-tune the model, go to Step 2: Prepare data for pre-training. Use the sample code on the Use the small-scale sample data processed by PAI tab to download the JSON files.

Fine-tune the model.

Run a standalone job to fine-tune the model in DSW

Run a distributed job to fine-tune the model in DLC

The following sample code runs a standalone job to fine-tune a Qwen-7B model on the Terminal tab:

export WORK_DIR=/mnt/workspace
cd ${WORK_DIR}/Pai-Megatron-Patch/examples/qwen
sh run_finetune_megatron_qwen_withGA.sh  \
dsw  \
${WORK_DIR}/Pai-Megatron-Patch  \
7B     \
1      \
96 \
1e-5   \
1e-6   \
2048   \
2048     \
85      \
bf16   \
1      \
1      \
sel    \
true   \
false  \
false  \
false \
1000 \
${WORK_DIR}/qwen-datasets/alpaca_zh-qwen-train.json   \
${WORK_DIR}/qwen-datasets/alpaca_zh-qwen-valid.json   \
${WORK_DIR}/qwen-ckpts/qwen-7b-hf-to-megatron-tp1-pp1   \
2000   \
10 \
${WORK_DIR}/output_megatron_qwen/

The following table describes the parameters that you must specify to run the run_finetune_megatron_qwen_withGA.sh script.

Parameter	Description
ENV=$1	The runtime environment. Valid values: dlc dsw
MEGATRON_PATH=$2	The directory of the source code of the Megatron-based training tool.
MODEL_SIZE=$3	The number of model parameters. Valid values: 7B, 14B, and 72B.
BATCH_SIZE=$4	The number of samples on each GPU for each fine-tuning iteration. Valid values: 1, 2, 4, and 8.
GLOBAL_BATCH_SIZE=$5	The total number of samples for fine-tuning iterations. Valid values: 64, 96, and 128.
LR=$6	The learning rate. Valid values: 1e-5 and 5e-5.
MIN_LR=$7	The minimum learning rate. Valid values: 1e-6 and 5e-6.
SEQ_LEN=$8	The length of the sequence.
PAD_LEN=$9	The length of the padding sequence.
EXTRA_VOCAB_SIZE=${10}	The size of the extra vocabulary. The size varies based on the number of model parameters. Qwen-7B: 85 Qwen-14B: 213 Qwen-72B: 213
PR=${11}	The training precision. Valid values: fp16 and bf16.
TP=${12}	The size of tensor parallelism.
PP=${13}	The size of pipeline parallelism.
AC=${14}	The activation checkpointing mode. Valid values: full and sel.
DO=${15}	Specifies whether to use the ZeRO-1 optimizer for Megatron. Valid values: true false
FL=${16}	Specifies whether to enable Flash Attention. Valid values: true false
SP=${17}	Specifies whether to use sequence parallelism. Valid values: true false
TE=${18}	Specifies whether to enable the acceleration technology of Transformer Engine. If you want to enable this technology, gu8xf GPUs are required.
SAVE_INTERVAL=${19}	The interval at which the model is saved.
DATASET_PATH=${20}	The directory of the training set.
VALID_DATASET_PATH=${21}	The directory of the validation set.
PRETRAIN_CHECKPOINT_PATH=${22}	The directory of the pre-trained model.
TRAIN_ITERS=${23}	The number of training iterations.
LR_WARMUP_ITERS=${24}	The number of warm-up iterations for the learning rate.
OUTPUT_BASEPATH=${25}	The directory of the output model file generated after training.

After you fine-tune the model in DSW, you can configure a distributed job to fine-tune the model on multiple multi-GPU servers in DLC. When you submit a training job in DLC, enter the following commands for the Startup Command parameter. For more information about other parameters, see the Pre-train the model section of this topic.

export WORK_DIR=/mnt/workspace
cd ${WORK_DIR}/Pai-Megatron-Patch/examples/qwen
sh run_finetune_megatron_qwen_withGA.sh  \
dlc  \
${WORK_DIR}/Pai-Megatron-Patch  \
7B     \
1      \
96 \
1e-5   \
1e-6   \
2048   \
2048     \
85      \
bf16   \
1      \
1      \
sel    \
true   \
false  \
false  \
false \
1000 \
${WORK_DIR}/qwen-datasets/alpaca_zh-qwen-train.json   \
${WORK_DIR}/qwen-datasets/alpaca_zh-qwen-valid.json   \
${WORK_DIR}/qwen-ckpts/qwen-7b-hf-to-megatron-tp1-pp1   \
2000   \
10 \
${WORK_DIR}/output_megatron_qwen/

The parameters that you must specify to run the run_finetune_megatron_qwen_withGA.sh script are the same as those that you specify when you submit a standalone job to fine-tune the model in DSW.

Step 4: Use the model for offline inference

After the model is trained, you can perform offline inference by using the model based on Megatron to evaluate the effects of the model. Perform the following steps:

Download the pred_input.jsonl file that contains test samples and upload the file to the /mnt/workspace directory of DSW. For more information, see Upload or download data files.
Note
The data used for inference must be organized in the same way as that for fine-tuning.
Copy all the JSON files and the tokenizer.model file in the model directory before training to the directory of the output model file generated after training. Then, the files are placed in the {OUTPUT_BASEPATH }/checkpoint directory and in the same folder as the latest_checkpointed_iteration.txt file.
Note
Replace the directories in the commands with your actual directories.
```
cd /mnt/workspace/qwen-ckpts/qwen-7b-hf-to-megatron-tp1-pp1
cp *.json /mnt/workspace/output_megatron_qwen/checkpoint/dswXXX/
cp tokenizer.model /mnt/workspace/output_megatron_qwen/checkpoint/dswXXX/
```

Run the following commands on the Terminal tab to perform offline inference by using the model. The inference results are generated in the /mnt/workspace/qwen_pred.txt file. You can evaluate the effects of the model based on the inference results.

Note

Before you run the commands, you must set the CUDA_VISIBLE_DEVICES parameter to 0 and the GPUS_PER_NODE parameter to 1 in the run_text_generation_megatron_qwen.sh script.

export WORK_DIR=/mnt/workspace
cd ${WORK_DIR}/Pai-Megatron-Patch/examples/qwen
bash run_text_generation_megatron_qwen.sh \
dsw \
${WORK_DIR}/PAI-Megatron-Patch \
/mnt/workspace/output_megatron_qwen/checkpoint/dswXXX \
7B \
1 \
1 \
1024 \
1024 \
85 \
fp16 \
10 \
512 \
512 \
${WORK_DIR}/pred_input.jsonl \
${WORK_DIR}/qwen_pred.txt \
0 \
1.0 \
1.2

The following table describes the parameters that you must specify to run the run_text_generation_megatron_qwen.sh script.

Parameter	Description

Parameter	Description
ENV=$1	The runtime environment. Valid values: dlc dsw
MEGATRON_PATCH_PATH=$2	The directory of the Pai-Megatron-Patch folder.
CHECKPOINT_PATH=$3	The directory of the model during training. Important Replace this directory with your actual model directory.
MODEL_SIZE=$4	The number of model parameters. Valid values: 7B, 14B, and 72B.
TP=$5	The size of tensor parallelism. Important If you set this parameter to 1, you can use a single GPU for inference. If you set this parameter to a value greater than 1, you must use the corresponding number of GPUs for inference.
BS=$6	The number of samples on each GPU for each inference iteration. Valid values: 1, 4, and 8.
SEQ_LEN=$7	The length of the sequence. Valid values: 256, 512, and 1024.
PAD_LEN=$8	The length of the padding sequence, which is the length of the concatenated text.
EXTRA_VOCAB_SIZE=${9}	The number of tokens increased during model conversion. The number varies based on the number of model parameters. Qwen-7B: 85 Qwen-14B: 213 Qwen-72B: 213
PR=${10}	The inference precision. Valid values: fp16 and bf16.
TOP_K=${11}	The number of top n candidate words to be selected. Valid values: 0 to n. Examples: 0, 5, 10, and 20.
INPUT_SEQ_LEN=${12}	The length of the input sequence. Set the value to 512.
OUTPUT_SEQ_LEN=${13}	The length of the output sequence. Set the value to 256.
INPUT_FILE=${14}	The file that contains the text to be used for inference. In this example, the pred_input.jsonl file is used, in which each line contains a sample.
OUTPUT_FILE=${15}	The output file generated after inference. In this example, the qwen_pred.txt file is used.
TOP_P=${16}	The percentage of top candidate words to be selected. Valid values: 0 to 1. Examples: 0, 0.85, and 0.95. Note You must set one of the TOP_K and TOP_P parameters to 0.
TEMPERATURE=${17}	The randomness of the sampling process. Valid values: 1 to n.
REPETITION_PENALTY=${18}	The repetition penalty of the content generated by the model. Valid values: 1 to 2. Default value: 1.2.

Step 5: Convert the model format

If the effects of the model meet your expectations after offline inference is performed by using the model, you can convert the model format from Megatron to Hugging Face. Then, you can deploy the converted Hugging Face model as a model service.

Run the following commands on the Terminal tab to convert the model format from Megatron to Hugging Face:

export WORK_DIR=/mnt/workspace
cd /mnt/workspace/Pai-Megatron-Patch/toolkits/model_checkpoints_convertor/qwen
sh model_convertor.sh \
../../../Megatron-LM-main        \
${WORK_DIR}/output_megatron_qwen/checkpoint/${Directory}/iter_*******         \
/mnt/workspace/qwen-ckpts/qwen-7b-mg-to-hf-tp1-pp1/  \
1  \
1  \
qwen-7b \
0 \
true

The following table describes the parameters that you must specify to run the model_convertor.sh script.

Parameter	Description

Parameter	Description
MEGATRON_PATH=$1	The directory of the source code of the Megatron-based training tool.
SOURCE_CKPT_PATH=$2	The directory of the trained model in the Megatron format, including the `iter_` folder. Example: `${WORK_DIR}/output_megatron_qwen/checkpoint/dsw-pretrain-megatron-qwen-7B-lr-1e-5-bs-1-seqlen-2048-pr-bf16-tp-1-pp-1-ac-sel-do-true-sp-false-tt--wt-/iter_****`. Important Replace this directory with your actual model directory. If you need to convert the format of a pre-trained model, you must delete all the distrib_optim.pt** files in the model directory.
TARGET_CKPT_PATH=$3	The directory of the converted Hugging Face model.
TP=$4	The size of tensor parallelism, which must be the same as that for training.
PP=$5	The size of pipeline parallelism, which must be the same as that for training.
MN=$6	The name of the model, such as qwen-7b, qwen-14b, or qwen-72b.
EXTRA_VOCAB_SIZE=$7	The size of the extra vocabulary.
mg2hf=$8	Specifies whether to convert the model format from Megatron to Hugging Face.

Copy the .json, .py, and .tiktoken files in the /mnt/workspace/qwen-ckpts/qwen-7b-hf directory of the open source Hugging Face model to the /mnt/workspace/qwen-ckpts/qwen-7b-mg-to-hf-tp1-pp1 directory to ensure that the model can be properly used.
Important
Take note that you do not need to copy the pytorch_model.bin.index.json file.

Use the Hugging Face model for offline inference

You can perform offline inference by using the converted Hugging Face model based on Hugging Face and DeepSpeed. For example, create an infer.py file that contains the following content in a directory for a Qwen-7B model on the Terminal tab. Run the infer.py file to perform offline inference by using the model and evaluate the effects of the model based on the inference results.

#!/usr/bin/env python
#encoding=utf-8
from transformers import AutoTokenizer, LlamaTokenizer
from transformers import LlamaForCausalLM
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
 
checkpoint = '/mnt/workspace/qwen-ckpts/qwen-7b-mg-to-hf-tp1-pp1'
print(checkpoint)
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(checkpoint,device_map="auto", trust_remote_code=True)
 
prompts = 'Write a quick sorting algorithm.'
p = f"Human:{prompts}"
print(p)
inputs = tokenizer.encode(p, return_tensors="pt").to(model.device)
outputs = model.generate(inputs,max_new_tokens=512)
print(tokenizer.decode(outputs[0]))

Replace checkpoint with the directory of the converted Hugging Face model. In this example, the /mnt/workspace/qwen-ckpts/qwen-7b-mg-to-hf-tp1-pp1 directory is used.

Step 6: Deploy the model as a model service and call the model service

After you perform offline inference and evaluate the effects of the model, you can deploy the converted Hugging Face model as an online model service and call the model service in the actual production environment to perform inference. Perform the following steps:

Deploy the model as a model service

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).
Click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.

On the Custom Deployment page, configure the parameters that are described in the following table. You can use the default values for other parameters.

Parameter		Description

Parameter		Description
Basic Information	Service Name	The custom name of the model service. The name must be unique in a region. In this example, the value is set to test_qwen.
Environment Information	Deployment Method	In this example, select Image-based Deployment and Enable Web App.
	Image Configuration	Select Image Address, enter `eas-registry-vpc.cn-wulanchabu.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4-vllm` in the field.
	Model Settings	Select General-purpose NAS and configure the following parameters: Select a file system: the General-purpose NAS file system based on which the dataset is created. Mount Target: the mount target based on which the dataset is created. File System Path: the directory of the converted Hugging Face model that is stored in the NAS file system. In this example, the `/qwen-ckpts/qwen-7b-mg-to-hf-tp1-pp1` directory is used. Mount Path: the mount directory of the model. In this example, the value is set to `/qwen-7b`.
	Command	In this example, the following command is run: `python webui/webui_server.py --port=8000 --model-path=/qwen-7b --tensor-parallel-size 1 --backend=vllm`. Where: --model-path: the mount directory of the model, which must be the same as that in the model settings. --tensor-parallel-size: the size of tensor parallelism, which must be adjusted based on the number of GPUs. For example, set this parameter to 1 for a Qwen-7B model or 8 for a Qwen-72B model.
	Port number	In this example, port 8000 is used.
Resource Deployment	Resource Type	In this example, Resource Quota is selected.
	Resource Quota	Select the resource quota that is created for the purchased Lingjun resources.
	Instances	For a Qwen-7B model, each instance uses the following resources: vCPUs: 16 Memory (GB): 64 GPUs: 1
VPC Settings	VPC	After you configure the NAS Mount Target parameter, the system automatically matches the virtual private cloud (VPC), vSwitch, and security group of the specified NAS file system.
	vSwitch
	Security Group Name

Click Deploy.
If the state of the service changes to Running, the service is deployed.

Call the model service

After the model service is deployed, you can call the service to perform inference. Perform the following steps:

On the Inference Service tab, find the service that you want to call and click View Web App in the Service Type column.
On the WebUI page, perform inference.