All Products
Search
Document Center

Platform For AI:Submit a standalone training job that uses PyTorch

Last Updated:Nov 01, 2024

This topic describes how to use Deep Learning Containers (DLC) to train offline transfer learning based on PyTorch.

Step 1: Prepare data

In this topic, the data used for training is pre-stored in a public storage medium. You can download the data directly and do not need to prepare additional data.

Step 2: Prepare the training code and model storage file

In this topic, the training code package is pre-stored in a public storage medium. You can download the code package directly and do not need to develop additional code.

Step 3: Create a job

  1. Go to the Create Job page.

    1. Log on to the PAI console.

    2. In the left-side navigation pane, click Workspaces. Find the workspace that you want to manage and click the workspace ID.

    3. In the left-side navigation pane of the Workspace page, choose Model Development and Training > Deep Learning Containers (DLC). On the Distributed Training Jobs page, click Create Job. The Create Job page appears.

  2. On the Create Job page, configure the parameters listed in the following table, and use default values for the remaining parameters.

    image

    Parameter

    Description

    Basic Information

    Job Name

    Enter a name for the job. Example: torch-sample.

    Environment Information

    Node Image

    Click Alibaba Cloud Image and select a PyTorch image.

    Startup Command

    Enter the following command to perform the following operations: download data, download code package, run training jobs, and check models.

    wget https://pai-public-data.oss-cn-beijing.aliyuncs.com/hol-pytorch-transfer-cv/data.tar.gz && tar -xf ./data.tar.gz && mv ./hymenoptera_data/ ./input && mkdir output && wget https://pai-public-data.oss-cn-beijing.aliyuncs.com/hol-pytorch-transfer-cv/main.py && python main.py -i ./input -o ./output && ls ./output

    Resource Information

    Source

    Select Public Resources.

    Framework

    Select PyTorch.

    Job Resource

    • Number of Nodes: Set the value to 1.

    • Instance Type: Click image and select an instance type. For example, ecs.gn6e-c12g1.3xlarge. If the type is not available in your current region, you can create a job in another region. For information about regions that support the pay-as-you-go billing method, see Deep Learning Containers (DLC).

  3. Click OK.

    The Deep Learning Containers (DLC) page appears.

Step 4: View the details and logs of the training job

  1. On the Deep Learning Containers (DLC) page, click the job name.

  2. On the job details page, view Basic Information and Resource Information of the job.

  3. In the Instance section at the bottom of the job details page, find the desired node and click Log in the Actions column to view the logs of the node.

    image