Submit a hyperparameter fine-tuning experiment on DLC computing resources - Platform For AI

This topic describes how to submit an AutoML experiment on Deep Learning Containers (DLC) computing resources to perform hyperparameter fine-tuning. This solution uses the PyTorch framework. It automatically downloads and loads an MNIST dataset of handwritten digits by using the torchvision.datasets.MNIST module and uses the dataset to train models. This way, you can obtain the optimal hyperparameter combination. You can use the standalone, distributed, or nested parameter mode to train models based on your training requirements.

Prerequisites

The permissions that are required to use AutoML are granted to your account. This prerequisite must be met if you use AutoML for the first time. For more information, see Grant permissions that are required to use AutoML.
The permissions that are required to use DLC are granted to your account. For more information, see Grant the permissions that are required to use DLC.
A workspace is created and associated with a public resource group for general computing resources. For more information, see Create a workspace.
Object Storage Service (OSS) is activated and an OSS bucket is created. For more information, see Get started by using the OSS console.

Step 1: Create a dataset

Upload the script file mnist.py to the created OSS bucket. For more information, see Get started by using the OSS console.
Create an OSS dataset to store data files that are generated in hyperparameter fine-tuning experiments. For more information, see the "Create a dataset based on data that is stored in an Alibaba Cloud storage service" section in Create and manage datasets.
Configure the following key parameters based on actual situations and retain default values of other parameters:
- Name: Enter the name of the dataset.
- Select data store: Select the OSS path where the script file is stored.
- Property: Select a folder.

Step 2: Create an experiment

Go to the Create Experiment page, and perform the following steps to configure key parameters. For more information about the settings of other parameters, see Create an experiment. After you configure the parameters, click Submit.

Configure parameters in the Execution Configurations section.

This solution provides the standalone, distributed, and nested parameter training modes. You can select one mode to train models.

Parameter settings used for the standalone training mode

Parameter	Description
Job Type	Select DLC.
Resource Group	Select Public Resource Group.
Framework	Select PyTorch.
Datasets	Select the dataset that you created in Step 2.
Node Image	Select PAI Image. Then, select pytorch-training:1.12PAI-gpu-py38-cu113-ubuntu20.04 from the drop-down list.
Instance Type	Select CPU. Then, select 16vCPU+64GB Mem ecs.g6.4xlarge from the drop-down list.
Nodes	Set this parameter to 1.
Node Startup Command	Enter `python3 /mnt/data/mnist.py --save_model=/mnt/data/examples/search/model/model_${exp_id}_${trial_id} --batch_size=${batch_size} --lr=${lr}`.
Hyperparameter	batch_size Constraint Type: Select choice. Search Space: Click to add three enumerated values: 16, 32, and 64. lr Constraint Type: Select choice. Search Space: Click to add three enumerated values: 0.0001, 0.001, and 0.01. The experiment can generate nine hyperparameter combinations based on the preceding configurations and create a trial for each of the hyperparameter combinations. In each trial, the hyperparameter combination is used to run the script.

Parameter settings used for the distributed training mode

Parameter	Description
Job Type	Select DLC.
Resource Group	Select Public Resource Group.
Framework	Select PyTorch.
Datasets	Select the dataset that you created in Step 2.
Node Image	Select PAI Image. Then, select pytorch-training:1.12PAI-gpu-py38-cu113-ubuntu20.04 from the drop-down list.
Instance Type	Select CPU. Then, select 16vCPU+64GB Mem ecs.g6.4xlarge from the drop-down list.
Nodes	Set this parameter to 3.
Node Startup Command	Enter `python -m torch.distributed.launch --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT --nproc_per_node=1 --nnodes=$WORLD_SIZE --node_rank=$RANK /mnt/data/mnist.py --data_dir=/mnt/data/examples/search/data --save_model=/mnt/data/examples/search/pai/model/model_${exp_id}_${trial_id} --batch_size=${batch_size} --lr=${lr}`.
Hyperparameter	batch_size Constraint Type: Select choice. Search Space: Click to add three enumerated values: 16, 32, and 64. lr Constraint Type: Select choice. Search Space: Click to add three enumerated values: 0.0001, 0.001, and 0.01. The experiment can generate nine hyperparameter combinations based on the preceding configurations and create a trial for each of the hyperparameter combinations. In each trial, the hyperparameter combination is used to run the script.

Parameter settings used for the nested parameter training mode

Parameter	Description
Job Type	Select DLC.
Resource Group	Select Public Resource Group.
Framework	Select PyTorch.
Datasets	Select the dataset that you created in Step 2.
Node Image	Select PAI Image. Then, select pytorch-training:1.12PAI-gpu-py38-cu113-ubuntu20.04 from the drop-down list.
Instance Type	Select CPU. Then, select 16vCPU+64GB Mem ecs.g6.4xlarge from the drop-down list.
Nodes	Set this parameter to 1.
Node Startup Command	Enter `python3 /mnt/data/mnist.py --save_model=/mnt/data/examples/search/pai/model/model_${exp_id}_${trial_id} --batch_size=${nested_params}.{batch_size} --lr=${nested_params}.{lr} --gamma=${gamma}`.
Hyperparameter	nested_params Constraint Type: Select choice. Search Space: Click to add two enumerated values: `{"_name":"large","{lr}":{"_type":"choice","_value":[0.02,0.2]},"{batch_size}":{"_type":"choice","_value":[256,128]}}` and `{"_name":"small","{lr}":{"_type":"choice","_value":[0.01,0.1]},"{batch_size}":{"_type":"choice","_value":[64,32]}}`. gamma Constraint Type: Select choice. Search Space: Click to add three enumerated values: 0.8, 0.7, and 0.9. The experiment can generate nine hyperparameter combinations based on the preceding configurations and create a trial for each of the hyperparameter combinations. In each trial, the hyperparameter combination is used to run the script.

Configure parameters in the Trial Configuration section.

Parameter		Description
Metric Optimization	Metric Type	Select stdout. This setting indicates that the final metric value is extracted from stdout in the running process.
	Method	Select best.
	Metric Weight	Use the following settings: key: validation: accuracy=([0-9\\.]+) Value: 1
	Metric Source	Configure cmd1 as the command keyword.
	Optimization	Select Maximize.
Model Storage Path		Enter the OSS path where the model is saved. In this example, the path is `oss://examplebucket/examples/model/model_${exp_id}_${trial_id}`.

Configure parameters in the Search Configurations section.

Parameter	Description
Search Algorithm	Select TPE. For more information about search algorithms, see the "Supported search algorithms" section in Limits and usage notes of AutoML.
Maximum Trials	Set this parameter to 3. This value indicates that up to three trials can run in the experiment.
Maximum Concurrent Trials	Set this parameter to 2. This value indicates that up to two trials can run in parallel in the experiment.
Enable EarlyStop	Specifies whether to enable the early stopping feature. This feature enables the system to stop the evaluation process of a trial early if the related hyperparameter combination is obviously underperforming.
Start step	Set this parameter to 5. This value indicates that the system can decide whether to early stop a trial after five evaluations on the trial are completed.

Step 3: View the experiment details and execution results

In the experiment list, click the name of the desired experiment to go to the Experiment Details page.
On the Experiment Details page, you can view the execution progress and status statistics of trials. The experiment automatically creates three trials based on the settings of the Search Algorithm and Maximum Trials parameters.
Click the Trials tab to view all the trials that are generated by the experiment, and the execution status, final metric value, and hyperparameter combination of each trial.
In this example, Optimization is set to Maximize. In the preceding figure, the hyperparameter combination (batch_size: 16 and lr: 0.01) that corresponds to the final metric value 96.52 is the optimal hyperparameter combination.

References

You can also submit hyperparameter fine-tuning experiments on MaxCompute computing resources. For more information, see Best practice for running the K-means Clustering component.
For more information about how AutoML works, see AutoML.