Best practice for LoRA training in DLC - Platform For AI

This topic describes how to find the optimal hyperparameter combination for a Low-Rank Adaptation (LoRA) training by submitting an experiment that uses Deep Learning Container (DLC) resources in AutoML of Platform for AI (PAI).

Prerequisites

The permissions that are required to use AutoML are granted to your account. This prerequisite must be met if you use AutoML for the first time. For more information, see Grant permissions that are required to use AutoML.
The permissions that are required to use DLC are granted to your account. For more information, see Grant the permissions that are required to use DLC.
A workspace is created and associated with a public resource group for general computing resources. For more information, see Create a workspace.
Object Storage Service (OSS) is activated and an OSS bucket is created. For more information, see Get started by using the OSS console.

Step 1: Create a dataset

Create a dataset of the OSS type. Mount the dataset to the DLC path to store the data files generated by the hyperparameter tuning experiment in the OSS directory. Take note of the following parameters. Use the default settings for other parameters. For more information, see the "Create a dataset based on data that is stored in an Alibaba Cloud storage service" section in the Create and manage datasets topic.

Name: Enter the name of the dataset.
Select data store: Select the OSS path in which the script file is stored.
Property: Select a folder.

Step 2: Create an experiment

Go to the Create Experiment page, and perform the following operations to configure key parameters. For information about the settings of other parameters, see Create an experiment. After you configure the parameters, click Submit.

Configure parameters in the Execution Configurations section.

Parameter	Description
Job Type	Select DLC.
Resource Group	Select Public Resource Group.
Framework	Select Tensorflow.
Datasets	Select the dataset that you created in Step 1.
Node Image	Select Image Address and enter the image address in the input field. Example: `registry.cn-shanghai.aliyuncs.com/mybigpai/nni:diffusers`. The following data is preset in the image: Pre-trained basic model: The Stable-Diffusion-v1-5 is preset in the `/workspace/diffusers_model_data/model` path. LoRA training data: The pokemon data is preset in the `/workspace/diffusers_model_data/data` path. Training code: The diffusers are preset in the `/workspace/diffusers` path.
Instance Type	Click GPU > *12vCPU+92GB Mem+1NVIDIA V100 ecs.gn6e-c12g1.3xlarge** instance type.
Nodes	Set the value to 1.
Node Startup Command	cd /workspace/diffusers/examples/text_to_image && accelerate launch --mixed_precision="fp16" train_text_to_image_lora_eval.py \ --pretrained_model_name_or_path="/workspace/diffusers_model_data/model" \ --dataset_name="/workspace/diffusers_model_data/data" \ --caption_column="text" \ --resolution=512 --random_flip \ --train_batch_size=8 \ --val_batch_size=8 \ --num_train_epochs=100 --checkpointing_steps=100 \ --learning_rate=${lr} --lr_scheduler=${lr_scheduler} --lr_warmup_steps=0 \ --rank=${rank} --adam_beta1=${adam_beta1} --adam_beta2=${adam_beta2} --adam_weight_decay=${adam_weight_decay} \ --max_grad_norm=${max_grad_norm} \ --seed=42 \ --output_dir="/mnt/data/diffusers/pokemon/sd-pokemon_${exp_id}_${trial_id}" \ --validation_prompts "a cartoon pikachu pokemon with big eyes and big ears" \ --validation_metrics ImageRewardPatched \ --save_by_metric val_loss
Hyperparameter	The following section describes the constraint types and search spaces of the hyperparameters: lr: Constraint Type: choice. Search Space:: Click the icon to add the following enumeration values: 1e-4, 1e-5, and 2e-5. lr_scheduler: Constraint Type: choice. Search Space:: Click the icon to add the following enumeration values: constant, cosine, and polynomial. rank: Constraint Type: choice. Search Space: Click the icon to add the following enumeration values: 4, 32, and 64. adam_beta1: Constraint Type: choice. Search Space: Click the icon to add the following enumeration values: 0.9 and 0.95. adam_beta2: Constraint Type: choice. Search Space: Click the icon to add the following enumeration values: 0.99 and 0.999. adam_weight_decay: Constraint Type: choice. Search Space: Click the icon to add the following enumeration values: 1e-2, and 1e-3. max_grad_norm: Constraint Type: choice. Search Space: Click the icon to add the following enumeration values: 1, 5, and 10. The preceding configuration generates 648 hyperparameter combinations. The system creates a trial for each hyperparameter combination and runs the trial by using one set of hyperparameter combination.

Configure parameters in the Trial Configuration section.
Parameter
Description
Metric Type
Select stdout.
Method
Select best.
Metric Weight
Key: val_loss=([0-9\\.]+).
Value: 1.
Metric Source
Set the value to cmd1.
Optimization
Select Maximize.

Configure parameters in the Search Configurations section.

Parameter	Description
Search Algorithm	Select TPE.
Maximum Trials	Set the value to 5.
Maximum Concurrent Trials	Set the value to 2.
Enable EarlyStop	Specifies whether to enable the early stopping feature.
start step	5

Step 3: View the experiment details and results

On the AutoML page, click the name of the experiment to go to the Experiment Details page.
On the Experiment Details page, you can view the execution progress and status statistics of trials. The experiment automatically creates five trials based on the specified search algorithm and the maximum number of trials.
Click Trials to go to the Trials tab. You can view the trials that are automatically generated for the experiment, and the execution status, final metrics, and hyperparameter combination of each trial.
The trials run for approximately 5 hours. In this example, the Optimization parameter is set to Maximize. The hyperparameter combination indicated by the metric 0.087655 is the optimal combination.

Step 4: Deploy the model service and perform model inference

Download the LoRA model and convert the model file format.
1. After you run the experiment, a model file is generated in the output_dir directory that you specified in the startup command. You can go to the checkpoint-best directory of the OSS path that is mounted to the experiment to view and download the model file. For more information, see Get started by using the OSS console.
2. Run the following command to convert pytorch_model.bin into pytorch_model_converted.safetensors:
```
wget http://automl-nni.oss-cn-beijing.aliyuncs.com/aigc/convert.py
python convert.py --file pytorch_model.bin
```

Deploy a Stable Diffusion web application.

Go to the Elastic Algorithm Service (EAS) page. For more information, see the "Step 1: Go to the EAS-Online Model Services page" section of the Model service deployment by using the PAI console topic.
On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the dialog box that appears, select Custom Deployment and click OK.

On the Create Service page, configure the parameters and click Deploy. The following table describes the parameters.

Parameter	Description
Service Name	The name of the service. In this topic, sdwebui_demo is used.
Deployment Method	Select Deploy Web App by Using Image.
Select Image	Click PAI Image and select stable-diffusion-webui from the Image drop-down list and 4.2-standard from the Image Version drop-down list. Note You can select the latest version of the image when you deploy the model service.
Model Settings	Click Specify Model Settings to configure the model. In the Model Settings section, select Mount OSS Path. In the Mount Path field, specify the OSS bucket path that you created in Step 1. Example:`oss://bucket-test/data-oss/`. Mount Path:Mount the OSS file directory to the`/code/stable-diffusion-webui` path of the image. Example:`/code/stable-diffusion-webui/data-oss`. Enable Read-only Mode:turn off the read-only mode.
Command to Run	After you configure the image, the system automatically specifies the command. You must append `--data-dir <mount directory>` to the command. The mount directory must be the same as the last-level directory of the Mount Path that you specified in the Model Settings. In this example, `--data-dir data-oss` is appended to the command.
Resource Configuration Mode	Select General.
Resource Configuration	Select an Instance Type on the GPU tab. To ensure cost-effectiveness, we recommend that you use the ml.gu7i.c16m60.1-gu30 instance type.
System Disks	Set the additional system disk capacity to 100 GB.

Click Deploy.
The following figure shows the directory that PAI automatically creates in the empty OSS directory you specified. PAI also copies the required data to the directory.

Upload the model files to the specified path and select > Restart Service in the Actions column of the service. The configuration takes effect after the service is restarted.
- Upload the pytorch_model_converted.safetensors model file generated in the preceding step to the models/lora/ directory of OSS.
- Upload the revAnimated_v122 basic model to the models/Stable-diffusion/ directory of OSS.
Find the service that you want to manage and click View Web App in the Service Type column. On the Web UI page, perform model inference and verification.

Parameter	Description
Metric Type	Select stdout.
Method	Select best.
Metric Weight	Key: val_loss=([0-9\\.]+). Value: 1.
Metric Source	Set the value to cmd1.
Optimization	Select Maximize.