submit PyTorch training jobs that run on DSW and NAS in DLC - Platform For AI

Prerequisites

A general-purpose NAS file system is created in a region. For more information, see Create a General-purpose NAS file system in the NAS console.

Limits

The operations described in this topic are applicable only for clusters that use general computing resources and are deployed in the public resource group.

Step 1: Create datasets

Go to the Datasets page.
1. Log on to the PAI console.
2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
3. In the left-side navigation pane, choose AI Asset Management > Datasets.
On the Dataset page, click Create Dataset. On the Create Dataset page, set the Storage Type parameter to General-purpose NAS. For more information about how to create a basic dataset, see Create a basic dataset.

Step 2: Create a DSW instance

To create a DSW instance, configure the parameters described in the following table. For information about other parameters, see Create a DSW instance.

Parameter		Description
Environment Information	Dataset Mounting	Click Custom Dataset, select the NAS dataset that you created in Step 1, and then set the Mount Path parameter to `/mnt/data/`.
Environment Information	Working Directory	Select `Dataset name-/mnt/data/`.
Network Information	VPC	You do not need to configure this parameter.

Step 3: Prepare data

The data used in this topic is available for public access. You can click here to download the data and then decompress and use the data.

Go to the development environment of a DSW instance.
1. Log on to the PAI console.
2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.
3. In the upper-left corner of the page, select the region where you want to use PAI.
4. In the left-side navigation pane, choose Model Training > Data Science Workshop (DSW).
5. Optional: On the Data Science Workshop (DSW) page, enter the name of a DSW instance or a keyword in the search box to search for the DSW instance.
6. Click Open in the Actions column of the instance.
In the DSW development environment, click the Notebook tab in the top navigation bar.
Download data.
1. Click the icon in the upper-left toolbar to create a folder. In this example, pytorch_transfer_learning is used as the folder name.
2. In the DSW development environment, click the Terminal tab in the top navigation bar.
3. On the Terminal tab, run the commands as shown in the following figure. You can use the cd command to go to the folder that you create, and the wget command to download the dataset.
```
cd /mnt/workspace/pytorch_transfer_learning/
wget https://pai-public-data.oss-cn-beijing.aliyuncs.com/hol-pytorch-transfer-cv/data.tar.gz
```
  https://pai-public-data.oss-cn-beijing.aliyuncs.com/hol-pytorch-transfer-cv/data.tar.gz is the URL for downloading the dataset file.
4. Run the tar -xf ./data.tar.gz command to decompress the dataset.
5. Click the Notebook tab. Go to the pytorch_transfer_learning directory, right-click the extracted hymenoptera_data folder, and then click Rename to rename the file as input.

Step 4: Prepare the training code and the model storage folder

On the Terminal tab of the DSW instance, run the wget command to download the training code to the pytorch_transfer_learning folder.
```
cd /mnt/workspace/pytorch_transfer_learning/
wget https://pai-public-data.oss-cn-beijing.aliyuncs.com/hol-pytorch-transfer-cv/main.py
```
https://pai-public-data.oss-cn-beijing.aliyuncs.com/hol-pytorch-transfer-cv/main.py is the URL for downloading the training code.
In the pytorch_transfer_learning folder, create a folder named output to store the trained model.
```
mkdir output
```
View the content contained in the pytorch_transfer_learning folder.
The folder contains the following content:
- input: the folder that stores the training data.
- main.py: the training code file.
- output: the folder that stores the trained model.

Step 5: Create a scheduled job

Go to the Create Job page.
1. Log on to the PAI console. In the top navigation bar, select a desired region. On the page on the right, select a desired workspace and click Enter Deep Learning Containers (DLC).
2. On the Deep Learning Containers (DLC) page, click Create Job.

On the Create Job page, configure the parameters. The following table describes the parameters.

Section	Parameter	Description
Basic Information	Job Name	Specify the name of the job.
Environment Information	Node Image	Select Alibaba Cloud Image and then select a PyTorch image from the drop-down list. In this example, the `pytorch-training:1.12-gpu-py39-cu113-ubuntu20.04` image is used.
	Datasets	Click Custom Dataset and select the NAS dataset that you created in Step 1.
	Startup Command	Set this parameter to `python /mnt/data/pytorch_transfer_learning/main.py -i /mnt/data/pytorch_transfer_learning/input -o /mnt/data/pytorch_transfer_learning/output`.
	Third-party Libraries	Click Select from List and then enter the following content in the text box: `numpy==1.16.4 absl-py==0.11.0`
	Code Builds	You do not need to configure this parameter.
Resource Information	Source	Select Public Resources.
	Framework	Select PyTorch.
	Job Resource	Select an instance type and specify the number of instances. Example: select ecs.g6.xlarge on the CPU tab of the Resource Type page. Set the Nodes parameter to 1.

Click OK.

Step 6: View the job details and logs

On the Deep Learning Containers (DLC) page, click the name of the job that you want to view.
On the Overview tab of the job details page, view the Basic Information and Resource Information of the job.
In the Instance section on the Overview tab of the job details page, find the instance that you want to manage and click Log in the Actions column to view the logs.
The following figure shows an example of the logs.