All Products
Search
Document Center

Platform For AI:Best practice for performing DLC MNIST training

Last Updated:Mar 28, 2024

This topic describes how to submit an AutoML experiment on Deep Learning Containers (DLC) computing resources to perform hyperparameter fine-tuning. This solution uses the PyTorch framework. It automatically downloads and loads an MNIST dataset of handwritten digits by using the torchvision.datasets.MNIST module and uses the dataset to train models. This way, you can obtain the optimal hyperparameter combination. You can use the standalone, distributed, or nested parameter mode to train models based on your training requirements.

Prerequisites

Step 1: Create a dataset

  1. Upload the script file mnist.py to the created OSS bucket. For more information, see Get started by using the OSS console.

  2. Create an OSS dataset to store data files that are generated in hyperparameter fine-tuning experiments. For more information, see the "Create a dataset based on data that is stored in an Alibaba Cloud storage service" section in Create and manage datasets.

    Configure the following key parameters based on actual situations and retain default values of other parameters:

    • Name: Enter the name of the dataset.

    • Select data store: Select the OSS path where the script file is stored.

    • Property: Select a folder.

Step 2: Create an experiment

Go to the Create Experiment page, and perform the following steps to configure key parameters. For more information about the settings of other parameters, see Create an experiment. After you configure the parameters, click Submit.

  1. Configure parameters in the Execution Configurations section.

    This solution provides the standalone, distributed, and nested parameter training modes. You can select one mode to train models.

    Parameter settings used for the standalone training modeimage

    Parameter

    Description

    Job Type

    Select DLC.

    Resource Group

    Select Public Resource Group.

    Framework

    Select PyTorch.

    Datasets

    Select the dataset that you created in Step 2.

    Node Image

    Select PAI Image. Then, select pytorch-training:1.12PAI-gpu-py38-cu113-ubuntu20.04 from the drop-down list.

    Instance Type

    Select CPU. Then, select 16vCPU+64GB Mem ecs.g6.4xlarge from the drop-down list.

    Nodes

    Set this parameter to 1.

    Node Startup Command

    Enter python3 /mnt/data/mnist.py --save_model=/mnt/data/examples/search/model/model_${exp_id}_${trial_id} --batch_size=${batch_size} --lr=${lr}.

    Hyperparameter

    • batch_size

      • Constraint Type: Select choice.

      • Search Space: Click image.png to add three enumerated values: 16, 32, and 64.

    • lr

      • Constraint Type: Select choice.

      • Search Space: Click image.png to add three enumerated values: 0.0001, 0.001, and 0.01.

    The experiment can generate nine hyperparameter combinations based on the preceding configurations and create a trial for each of the hyperparameter combinations. In each trial, the hyperparameter combination is used to run the script.

    Parameter settings used for the distributed training modeimage

    Parameter

    Description

    Job Type

    Select DLC.

    Resource Group

    Select Public Resource Group.

    Framework

    Select PyTorch.

    Datasets

    Select the dataset that you created in Step 2.

    Node Image

    Select PAI Image. Then, select pytorch-training:1.12PAI-gpu-py38-cu113-ubuntu20.04 from the drop-down list.

    Instance Type

    Select CPU. Then, select 16vCPU+64GB Mem ecs.g6.4xlarge from the drop-down list.

    Nodes

    Set this parameter to 3.

    Node Startup Command

    Enter python -m torch.distributed.launch --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT --nproc_per_node=1 --nnodes=$WORLD_SIZE --node_rank=$RANK /mnt/data/mnist.py --data_dir=/mnt/data/examples/search/data --save_model=/mnt/data/examples/search/pai/model/model_${exp_id}_${trial_id} --batch_size=${batch_size} --lr=${lr}.

    Hyperparameter

    • batch_size

      • Constraint Type: Select choice.

      • Search Space: Click image.png to add three enumerated values: 16, 32, and 64.

    • lr

      • Constraint Type: Select choice.

      • Search Space: Click image.png to add three enumerated values: 0.0001, 0.001, and 0.01.

    The experiment can generate nine hyperparameter combinations based on the preceding configurations and create a trial for each of the hyperparameter combinations. In each trial, the hyperparameter combination is used to run the script.

    Parameter settings used for the nested parameter training modeimage

    Parameter

    Description

    Job Type

    Select DLC.

    Resource Group

    Select Public Resource Group.

    Framework

    Select PyTorch.

    Datasets

    Select the dataset that you created in Step 2.

    Node Image

    Select PAI Image. Then, select pytorch-training:1.12PAI-gpu-py38-cu113-ubuntu20.04 from the drop-down list.

    Instance Type

    Select CPU. Then, select 16vCPU+64GB Mem ecs.g6.4xlarge from the drop-down list.

    Nodes

    Set this parameter to 1.

    Node Startup Command

    Enter python3 /mnt/data/mnist.py --save_model=/mnt/data/examples/search/pai/model/model_${exp_id}_${trial_id} --batch_size=${nested_params}.{batch_size} --lr=${nested_params}.{lr} --gamma=${gamma}.

    Hyperparameter

    • nested_params

      • Constraint Type: Select choice.

      • Search Space: Click image.png to add two enumerated values: {"_name":"large","{lr}":{"_type":"choice","_value":[0.02,0.2]},"{batch_size}":{"_type":"choice","_value":[256,128]}} and {"_name":"small","{lr}":{"_type":"choice","_value":[0.01,0.1]},"{batch_size}":{"_type":"choice","_value":[64,32]}}.

    • gamma

      • Constraint Type: Select choice.

      • Search Space: Click image.png to add three enumerated values: 0.8, 0.7, and 0.9.

    The experiment can generate nine hyperparameter combinations based on the preceding configurations and create a trial for each of the hyperparameter combinations. In each trial, the hyperparameter combination is used to run the script.

  2. Configure parameters in the Trial Configuration section.

    Parameter

    Description

    Metric Optimization

    Metric Type

    Select stdout. This setting indicates that the final metric value is extracted from stdout in the running process.

    Method

    Select best.

    Metric Weight

    Use the following settings:

    • key: validation: accuracy=([0-9\\.]+)

    • Value: 1

    Metric Source

    Configure cmd1 as the command keyword.

    Optimization

    Select Maximize.

    Model Storage Path

    Enter the OSS path where the model is saved. In this example, the path is oss://examplebucket/examples/model/model_${exp_id}_${trial_id}.

  3. Configure parameters in the Search Configurations section.

    Parameter

    Description

    Search Algorithm

    Select TPE. For more information about search algorithms, see the "Supported search algorithms" section in Limits and usage notes of AutoML.

    Maximum Trials

    Set this parameter to 3. This value indicates that up to three trials can run in the experiment.

    Maximum Concurrent Trials

    Set this parameter to 2. This value indicates that up to two trials can run in parallel in the experiment.

    Enable EarlyStop

    Specifies whether to enable the early stopping feature. This feature enables the system to stop the evaluation process of a trial early if the related hyperparameter combination is obviously underperforming.

    Start step

    Set this parameter to 5. This value indicates that the system can decide whether to early stop a trial after five evaluations on the trial are completed.

Step 3: View the experiment details and execution results

  1. In the experiment list, click the name of the desired experiment to go to the Experiment Details page.

    On the Experiment Details page, you can view the execution progress and status statistics of trials. The experiment automatically creates three trials based on the settings of the Search Algorithm and Maximum Trials parameters.

  2. Click the Trials tab to view all the trials that are generated by the experiment, and the execution status, final metric value, and hyperparameter combination of each trial.

    In this example, Optimization is set to Maximize. In the preceding figure, the hyperparameter combination (batch_size: 16 and lr: 0.01) that corresponds to the final metric value 96.52 is the optimal hyperparameter combination.

References