Deploy and train models - Platform For AI - Alibaba Cloud Documentation Center

Model Gallery provides various pre-trained models. You can use pre-trained models to learn how to train and deploy models with Platform for AI (PAI). This topic describes how to find a model that meets your business requirements and how to deploy, debug, and fine-tune the model in Model Gallery.

Prerequisites

An Object Storage Service (OSS) bucket is created to store data for model fine-tuning or incremental training. For more information, see Create buckets.

Billing

Model Gallery alone is free. But you are charged for model deployment in Elastic Algorithm Service (EAS) and model training in Deep Learning Containers (DLC). For more information, see Billing of EAS and Billing of DLC.

Find a model that is suitable for your business

Model Gallery provides various models to meet your business requirements. When you search for a suitable model, take note of the following items:

Search for a model based on your business areas and tasks.
Most models provide information about the dataset used for pre-training. The more relevant the pre-training dataset is to your business scenario, the better your model performs after deployment and fine-tuning. For information about the pre-training dataset of a model, go to the details page of the model.
Models with more parameters are generally more capable, but such models require a larger amount of data for fine-tuning and incur higher fees when they are deployed as model services.

To find a model that is suitable for your business, perform the following steps:

Go to the Model Gallery page.
1. Log on to the PAI console.
2. In the left-side navigation pane, click Workspaces. On the Workspaces page, find the workspace that you want to manage and click the name of the workspace. The Workspace Details page appears.
3. In the left-side navigation pane, choose QuickStart > Model Gallery to go to the Model Gallery page.
Search for a suitable model.
You can deploy the selected model, debug it online, and evaluate its performance. For more information, see the next section.

Deploy and debug a model

On the Model Gallery page, click the model that you want to use to go to the model details page.

Directly deploy a model as a service

On the model details page, click Deploy in the upper-right corner.

Optional. Configure the model service and resource deployment information.

Model Gallery presets Model Service Information and Resource Deployment Information based on the characteristics of each model. You can use the default values or modify the parameters based on your business requirements. The following table describes the parameters.

Parameter	Description
Service Name	The name of the model service. You can use the default name or change the name based on the naming requirements. The name must be unique in a region.
Resource Group Type	The type of resource group that is used to deploy the model service. You can use a public resource group or a dedicated resource group.
Resource Configuration	The instance type that is used to deploy the service. You can use the default instance type or select another one based on your business requirements. We recommend that you select an instance type with a higher computing capacity than the default instance type. Otherwise, the model may fail to be deployed due to insufficient computing resources.

Click Deploy in the Deploy panel and then click OK in the Billing Notification dialog box.
You are directed to the details page of the model service. On the Service details tab, you can view the basic information and resource information of the service. The In operation state indicates that the model is deployed.

Debug a model service online

Click the Service details tab and then click Online Prediction in the right-side navigation pane. Enter the request data in the field and click Send Request. You can evaluate the model performance based on the output.

You can construct request data in the required format described on the model details page. Some models, such as Stable Diffusion V1.5, support convenient model inference by using a web application. You can click Web Application in the right-side navigation pane and click View Web App to access the web application.

If the dataset that is used to pre-train the model does not match your business scenario well, the inference output may not meet your expectations. In this case, you can fine-tune the model based on your business requirements. For more information, see the next section.

Train a model

To fine-tune a pre-trained model by using your own dataset, perform the following steps:

On the details page of the model, click Train in the upper-right corner.

In the Fine-tune panel, configure the parameters that are described in the following table.

Note

The parameters that you can configure may vary based on the model.

Section	Parameter	Description
Training Mode	SFT (Supervised Fine-Tuning)	The training mode. Valid values: SFT: fine-tunes the parameters of a large model by specifying the input and output of the large model. DPO: directly optimizes the language model to fit human preferences. It implies the same optimization goal as the RLHF algorithm. Both training mode supports Full-Parameter Fine-Tuning, LoRA, and QLoRA.
Training Mode	DPO (Direct Preference Optimization)
Job Configuration	Task Name	The name of the job. You can use the default job name or change the name by following the on-screen instructions.
Job Configuration	Maximum running time	A time limit to how long a job can run. When a job reaches the time limit, the system terminates the job and returns the result. The default value 0 specifies that no time limit is imposed on your training jobs.
Dataset Configuration	Training dataset	The dataset that is used for training. A default dataset is provided. You can also prepare your own training dataset in the required format and upload the dataset by using one of the following methods: Select OSS Object or Directory from the drop-down list. Click , select the OSS bucket in which the dataset is stored, and then select the dataset. If the dataset that you prepared is not uploaded to the OSS bucket, select a storage path and perform the following steps to upload the dataset to the bucket: Click the Upload File tab. Click View local files or Drag file here to upload the dataset. Select Dataset Selection from the drop-down list. Select a dataset that is stored in the cloud storage services, such as NAS and OSS, from the drop-down list. If no datasets are available, click New dataset to create one. For information about how to configure a dataset, see Create and manage datasets.
Dataset Configuration	Validation Dataset	Click Add validation dataset. The configuration method of the validation dataset is the same as the configuration method of the training dataset.
Output Configuration		The path in which the trained model and TensorBoard logs are stored in the cloud storage service. Note If a default OSS path is configured for the workspace, this parameter automatically uses the default path. For information about how to configure a default storage path for a workspace, see Manage workspaces. If you want to store your model in a file system, such as File Storage NAS (NAS) and Cloud Parallel File Storage (CPFS), you must create a dataset and select the dataset for output. For information about how to create and manage datasets, see Create and manage datasets.
Computing resources	Number of Nodes	The number of nodes.
Computing resources	Instance Type	The type of the instance. For information about the instance types supported by DLC and related billing, see Billing of DLC. Lingjun resources: Supported only in the China (Ulanqab) and Singapore regions. For LLMs with a high number of parameters, to successfully load and run the model, a GPU with larger VRAM is required. In this case, you can choose to use Lingjun resources (For example, GU100, GU108). Due to limited inventory, enterprises can contact your sales manager add you to the whitelist. Individual users can use Lingjun resources by preemptible instances, enjoying discounts of up to 90% off. For more information, see Create a resource group and purchase Lingjun resources.
Hyper-parameters		The hyperparameters that you can configure vary based on the model. You can use the default values or modify the hyperparameters based on your business requirements.

Click Train. In the Billing Notification dialog box, click OK.
You are directed to the details page of the training job on which you can view the basic information, real-time status, logs, and evaluation results of the job. The evaluation methods may vary based on the model.
Note
The trained model is automatically registered in AI Asset Management > Models. You can view or deploy the models. For more information, see Register and manage models.

References

FAQ about Model Gallery