E2E Development and Usage of LLM: Data Processing + Model Training + Model Inference - Platform For AI

This topic describes how to use the data processing, model training, and model inference components of Large Language Model (LLM) provided by PAI to complete end-to-end development and use of LLM.

Prerequisites

A workspace is created. For more information, see Create a workspace.
MaxCompute resources and common computing resources are associated with the workspace. For more information, see Manage workspaces.

Dataset

Each row of the input training data must contain a pair of question and answer, which corresponds to the following fields:

instruction: the question field.
output: the answer field.

If your data field names do not meet the requirements, you can use a custom SQL script to preprocess the data. If your data is obtained from the Internet, redundant data or dirty data may exist. You can use LLM data preprocessing components for preliminary data cleaning and sorting. For more information, see LLM Data Processing.

Procedure

Go to the Machine Learning Designer page.
1. Log on to the PAI console.
2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
3. In the left-side navigation pane of the workspace page, choose Model Development and Training > Visual Modeling (Designer) to go to the Machine Learning Designer page.

Create a pipeline.

On the Visualized Modeling (Designer) page, click the Preset Templates tab.
On the Preset Templates tab, click the Large Language Model (LLM) tab. In the E2E Development and Usage of LLM: Data Processing + Model Training + Model Inference card of the Large Language Model (LLM) tab, click Create.
In the Create Pipeline dialog box, configure the parameters and click OK. You can use the default values of the parameters.
The value of the Pipeline Data Path parameter indicates the Object Storage Service (OSS) path of the temporary data and models that are generated during the runtime of the pipeline.
In the pipeline list, double-click the pipeline that you create to open the pipeline.

View the components of the pipeline on the canvas, as shown in the following figure. The system automatically creates the pipeline based on the preset template.

No.

Description

Simple data preprocessing is performed only for end-to-end demonstration. For more information about data preprocessing, see LLM Data Processing.

Model training and offline inference are performed. The following components are used:

LLM Model Training
This component encapsulates LLM models provided in QuickStart. The underlying computing is performed based on Deep Learning Containers (DLC) tasks. You can click the LLM Model Training-1 node on the canvas, and specify model_name on the Fields Setting tab in the right pane. This component supports multiple mainstream LLM models. In this pipeline, the qwen-7b-chat model is selected for training.
LLM Model Inference
This component is used for offline inference. In this pipeline, the qwen-7b-chat model is selected for offline batch inference.

Click the button at the top of the canvas to run the pipeline.
After the pipeline is run as expected, view the inference result. To view the inference result, right-click the LLM Model Inference-1 node on the canvas and choose View Data > output directory to save infer results (OSS).

What to do next

You can also use the same preprocessed data to perform training and inference for multiple models at the same time. For example, you can create the following pipeline to fine-tune the qwen-7b-chat and llama2-7b-chat models at the same time, and then use the same batch of test data to compare the inference results.