Preparations for creating a training job - Platform For AI

This topic describes what you need to prepare before you create a training job, including computing resources, an image, a dataset, and a code build. Platform for AI (PAI) allows you to specify datasets stored in File Storage NAS (NAS) file systems, Cloud Parallel File Storage (CPFS) file systems, or Object Storage Service (OSS) buckets, and code builds stored in Git repositories.

Prerequisites

If you use OSS for storage, make sure that Deep Learning Container (DLC) is granted the required permissions to access OSS. Otherwise, I/O errors may occur when the system accesses the data stored in your OSS bucket. For information about how to grant the permissions, see Grant the permissions that are required to use DLC.

Limits

OSS is a distributed object storage service. When you use OSS to store data, some file system features are not supported. For example, you cannot append data to or overwrite the existing objects in an OSS bucket after you mount the bucket.

Step 1: Prepare resources

Prepare computing resources for AI training before you submit a training job. Select one of the following resources:

Public resources
After you grant the permissions, the system automatically creates a public resource group for general computing resources. You can select public resources when you create a training job on the Create Job page in your workspace.
General computing resources
You can create a dedicated resource group, purchase the required general computing resources, and allocate the computing resources in the dedicated resource group. To allocate computing resources, you need to create resource quotas and associate them with the workspace in which you want to run training jobs. For more information, see General computing resource quotas.
Lingjun resources
To achieve high-performance AI training powered by Lingjun resources, you must prepare the required Lingjun resources, create resource quotas, and associate the quotas with your workspace. For more information, see Lingjun resource quotas.

Step 2: Prepare an image

Prepare an image for the training environment before you submit a training job. Select one of the following image types:

Official PAI images: PAI provides official images based on different frameworks that are dedicated to Alibaba Cloud services. These images are suitable for training jobs that use Alibaba Cloud services and provide improved compatibility and performance. To view the images that support DLC training jobs, choose AI Asset Management > Images in the left-side navigation pane. On the Image page, click the Official PAI Images tab and then select DLC from the Modules drop-down list.
Custom images: If your training jobs require special environments or dependencies, you can use a custom image that you added to PAI. To add a custom image to PAI, choose AI Computing Asset Management > Images in the left-side navigation pane. On the Image page, click the Custom Image tab and click Add Image. After you add custom images to PAI, you can directly select them when you run training jobs. For information about how to add a custom image, see Custom images.
Important
For information about how to use Lingjun resources and custom images for your training jobs, see RDMA: high-performance networks for distributed training.
Image Address: When you submit a training job, you can specify the image address of a custom image or an official image. To view the image addresses, choose AI Asset Management > Images.

Step 3: Prepare a dataset

Upload the data required by the training job to an OSS bucket, a NAS file system, or CPFS and create a custom dataset. You can also mount OSS data or public dataset. The following section describes how to prepare a custom dataset.

Supported dataset types

The following dataset types are supported: OSS, General-purpose NAS, Extreme NAS, CPFS, and CPFS for Lingjun. You can enable dataset acceleration for all types of datasets except CPFS for Lingjun. This feature accelerates data reads for DLC training jobs.

Create a dataset

For information about how to create a dataset, see Create and manage datasets. When you create a dataset, take note of the following items:

Select From Alibaba Cloud and set Property to Folder.
Different from NAS, OSS is a distributed object storage service. When you use OSS to store data, some file system features are not supported. For example, you cannot append data to or overwrite the existing objects in an OSS bucket after you mount the bucket.
If you create a CPFS dataset, you must configure a virtual private cloud (VPC). The VPC must be the same as the VPC that you configured for the CPFS file system. Otherwise, the submitted DLC training jobs may stay in the Preparing Environment state.

Enable dataset acceleration

You can enable dataset acceleration for a dataset to accelerate data reads in training jobs. For more information, see Use Dataset Accelerator.

Step 4: Prepare a code build

Create a code build and add the code required by the training job to the code build. To create a code build, choose AI Computing Asset Management > Source Code Repositories in the left-side navigation pane. On the Code Configuration page, click Create Code Build. After you create a code build, you can directly select it when you run training jobs. For more information, see Code builds.

References

After you prepare all the resources, you are ready to create a training job. For more information, see Submit a training job.