This topic describes how to find the optimal hyperparameter combination for a Low-Rank Adaptation (LoRA) training by submitting an experiment that uses Deep Learning Container (DLC) resources in AutoML of Platform for AI (PAI).
Prerequisites
The permissions that are required to use AutoML are granted to your account. This prerequisite must be met if you use AutoML for the first time. For more information, see Grant permissions that are required to use AutoML.
The permissions that are required to use DLC are granted to your account. For more information, see Grant the permissions that are required to use DLC.
A workspace is created and associated with a public resource group for general computing resources. For more information, see Create a workspace.
Object Storage Service (OSS) is activated and an OSS bucket is created. For more information, see Get started by using the OSS console.
Step 1: Create a dataset
Create a dataset of the OSS type. Mount the dataset to the DLC path to store the data files generated by the hyperparameter tuning experiment in the OSS directory. Take note of the following parameters. Use the default settings for other parameters. For more information, see the "Create a dataset based on data that is stored in an Alibaba Cloud storage service" section in the Create and manage datasets topic.
Name: Enter the name of the dataset.
Select data store: Select the OSS path in which the script file is stored.
Property: Select a folder.
Step 2: Create an experiment
Go to the Create Experiment page, and perform the following operations to configure key parameters. For information about the settings of other parameters, see Create an experiment. After you configure the parameters, click Submit.
Configure parameters in the Execution Configurations section.
Parameter
Description
Job Type
Select DLC.
Resource Group
Select Public Resource Group.
Framework
Select Tensorflow.
Datasets
Select the dataset that you created in Step 1.
Node Image
Select Image Address and enter the image address in the input field. Example:
registry.cn-shanghai.aliyuncs.com/mybigpai/nni:diffusers
.The following data is preset in the image:
Pre-trained basic model: The Stable-Diffusion-v1-5 is preset in the
/workspace/diffusers_model_data/model
path.LoRA training data: The pokemon data is preset in the
/workspace/diffusers_model_data/data
path.Training code: The diffusers are preset in the
/workspace/diffusers
path.
Instance Type
Click
instance type.Nodes
Set the value to 1.
Node Startup Command
cd /workspace/diffusers/examples/text_to_image && accelerate launch --mixed_precision="fp16" train_text_to_image_lora_eval.py \ --pretrained_model_name_or_path="/workspace/diffusers_model_data/model" \ --dataset_name="/workspace/diffusers_model_data/data" \ --caption_column="text" \ --resolution=512 --random_flip \ --train_batch_size=8 \ --val_batch_size=8 \ --num_train_epochs=100 --checkpointing_steps=100 \ --learning_rate=${lr} --lr_scheduler=${lr_scheduler} --lr_warmup_steps=0 \ --rank=${rank} --adam_beta1=${adam_beta1} --adam_beta2=${adam_beta2} --adam_weight_decay=${adam_weight_decay} \ --max_grad_norm=${max_grad_norm} \ --seed=42 \ --output_dir="/mnt/data/diffusers/pokemon/sd-pokemon_${exp_id}_${trial_id}" \ --validation_prompts "a cartoon pikachu pokemon with big eyes and big ears" \ --validation_metrics ImageRewardPatched \ --save_by_metric val_loss
Hyperparameter
The following section describes the constraint types and search spaces of the hyperparameters:
lr:
Constraint Type: choice.
Search Space:: Click the icon to add the following enumeration values: 1e-4, 1e-5, and 2e-5.
lr_scheduler:
Constraint Type: choice.
Search Space:: Click the icon to add the following enumeration values: constant, cosine, and polynomial.
rank:
Constraint Type: choice.
Search Space: Click the icon to add the following enumeration values: 4, 32, and 64.
adam_beta1:
Constraint Type: choice.
Search Space: Click the icon to add the following enumeration values: 0.9 and 0.95.
adam_beta2:
Constraint Type: choice.
Search Space: Click the icon to add the following enumeration values: 0.99 and 0.999.
adam_weight_decay:
Constraint Type: choice.
Search Space: Click the icon to add the following enumeration values: 1e-2, and 1e-3.
max_grad_norm:
Constraint Type: choice.
Search Space: Click the icon to add the following enumeration values: 1, 5, and 10.
The preceding configuration generates 648 hyperparameter combinations. The system creates a trial for each hyperparameter combination and runs the trial by using one set of hyperparameter combination.
Configure parameters in the Trial Configuration section.
Parameter
Description
Metric Type
Select stdout.
Method
Select best.
Metric Weight
Key: val_loss=([0-9\\.]+).
Value: 1.
Metric Source
Set the value to cmd1.
Optimization
Select Maximize.
Configure parameters in the Search Configurations section.
Parameter
Description
Search Algorithm
Select TPE.
Maximum Trials
Set the value to 5.
Maximum Concurrent Trials
Set the value to 2.
Enable EarlyStop
Specifies whether to enable the early stopping feature.
start step
5
Step 3: View the experiment details and results
On the AutoML page, click the name of the experiment to go to the Experiment Details page.
On the Experiment Details page, you can view the execution progress and status statistics of trials. The experiment automatically creates five trials based on the specified search algorithm and the maximum number of trials.
Click Trials to go to the Trials tab. You can view the trials that are automatically generated for the experiment, and the execution status, final metrics, and hyperparameter combination of each trial.
The trials run for approximately 5 hours. In this example, the Optimization parameter is set to Maximize. The hyperparameter combination indicated by the metric 0.087655 is the optimal combination.
Step 4: Deploy the model service and perform model inference
Download the LoRA model and convert the model file format.
After you run the experiment, a model file is generated in the
output_dir
directory that you specified in the startup command. You can go to thecheckpoint-best
directory of the OSS path that is mounted to the experiment to view and download the model file. For more information, see Get started by using the OSS console.Run the following command to convert pytorch_model.bin into pytorch_model_converted.safetensors:
wget http://automl-nni.oss-cn-beijing.aliyuncs.com/aigc/convert.py python convert.py --file pytorch_model.bin
Deploy a Stable Diffusion web application.
Go to the Elastic Algorithm Service (EAS) page. For more information, see the "Step 1: Go to the EAS-Online Model Services page" section of the Model service deployment by using the PAI console topic.
On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the dialog box that appears, select Custom Deployment and click OK.
On the Create Service page, configure the parameters and click Deploy. The following table describes the parameters.
Parameter
Description
Service Name
The name of the service. In this topic, sdwebui_demo is used.
Deployment Method
Select Deploy Web App by Using Image.
Select Image
Click PAI Image and select stable-diffusion-webui from the Image drop-down list and 4.2-standard from the Image Version drop-down list.
NoteYou can select the latest version of the image when you deploy the model service.
Model Settings
Click Specify Model Settings to configure the model.
In the Model Settings section, select Mount OSS Path. In the Mount Path field, specify the OSS bucket path that you created in Step 1. Example:
oss://bucket-test/data-oss/
.Mount Path:Mount the OSS file directory to the
/code/stable-diffusion-webui
path of the image. Example:/code/stable-diffusion-webui/data-oss
.Enable Read-only Mode:turn off the read-only mode.
Command to Run
After you configure the image, the system automatically specifies the command. You must append
--data-dir <mount directory>
to the command. The mount directory must be the same as the last-level directory of the Mount Path that you specified in the Model Settings. In this example,--data-dir data-oss
is appended to the command.Resource Configuration Mode
Select General.
Resource Configuration
Select an Instance Type on the GPU tab. To ensure cost-effectiveness, we recommend that you use the ml.gu7i.c16m60.1-gu30 instance type.
System Disks
Set the additional system disk capacity to 100 GB.
Click Deploy.
The following figure shows the directory that PAI automatically creates in the empty OSS directory you specified. PAI also copies the required data to the directory.
Upload the model files to the specified path and select > Restart Service in the Actions column of the service. The configuration takes effect after the service is restarted.
Upload the pytorch_model_converted.safetensors model file generated in the preceding step to the
models/lora/
directory of OSS.Upload the revAnimated_v122 basic model to the
models/Stable-diffusion/
directory of OSS.
Find the service that you want to manage and click View Web App in the Service Type column. On the Web UI page, perform model inference and verification.