Preparations for creating a training job - Platform For AI

Before creating a Deep Learning Containers (DLC) training job, prepare the following components: computing resources, a container image, a dataset, and a code build. PAI supports datasets stored in File Storage NAS (NAS), Cloud Parallel File Storage (CPFS), or Object Storage Service (OSS), and code builds stored in Git repositories.

Prerequisites

If you use OSS for storage, grant DLC the required permissions to access OSS. Otherwise, I/O errors may occur when accessing OSS data. For details, see Cloud product dependencies and authorization: DLC.

Limitations

OSS is a distributed object storage service that does not support certain file system operations. For example, you cannot append data to or overwrite existing objects in an OSS bucket after mounting.

Step 1: Prepare computing resources

Select computing resources for AI training based on your requirements:

Public resources

After you grant the required permissions, the system automatically creates a public resource group. Select public resources on the Create Job page in your workspace.
General computing resources

Create a dedicated resource group, purchase general computing resources, and allocate them through resource quotas associated with your workspace. For details, see General computing resource quotas.
Lingjun resources

For high-performance AI training, prepare Lingjun resources, create resource quotas, and associate them with your workspace. For details, see Create resource quotas.

Step 2: Prepare a container image

Select a container image for the training environment:

Official PAI images

PAI provides official images based on different frameworks, optimized for Alibaba Cloud services. To view available images, choose AI Asset Management > Images in the left-side navigation pane. On the Image: page, click Alibaba Cloud Images and select DLC from the Modules drop-down list.
Custom images

For training jobs that require custom environments or dependencies, add a custom image to PAI. Choose AI Asset Management > Images, click the Custom Image tab, and click Register Image. For details, see Custom images.

Important
To use Lingjun resources with custom images, see RDMA: High-performance networks for distributed training.
Image address

Specify the image address of a custom or official image when submitting a training job. View image addresses at AI Asset Management > Images.

Step 3: Prepare a dataset

Upload your training data to OSS, NAS, or CPFS, and then create a dataset. Alternatively, you can mount data directly from OSS or public datasets.

Supported dataset types

The supported dataset types are OSS, General-purpose NAS, Extreme NAS, CPFS, and AI Computing CPFS. You can enable the dataset acceleration feature for all types except AI Computing CPFS to improve data read efficiency for distributed training jobs.

Create a dataset

For details, see Create and manage datasets. Note the following points:

OSS limitations: OSS does not support certain file system operations. For example, you cannot append data to or overwrite existing files after mounting. For details, see Limitations.
CPFS VPC requirement: If you create a CPFS dataset, configure the training job to use the same VPC as the CPFS file system. Otherwise, the job cannot run and may remain in the Preparing environment state.

Enable dataset acceleration

Enable dataset acceleration to improve data read efficiency for training jobs. For details, see Use Dataset Accelerator.

Step 4: Prepare a code build

Create a code build to store your training code. Choose AI Asset Management > Code Configuration, and click Create Code Build. For details, see Code configuration.

What to do next

After you complete the preparations, create a training job. For details, see Submit a training job.