Create and manage datasets in PAI - Platform For AI - Alibaba Cloud Documentation Center

You need to prepare datasets before data processing or model training. Platform for AI (PAI) Asset Management provides a dataset management feature that allows you to create and manage datasets in multiple versions. Dataset version management allows for precise experiment reproduction, tracking of data versions, recording data lineage, and seamless transition to previous versions in case of issues, thus ensuring uninterrupted business operations.

Overview

The dataset management feature supports comprehensive management of basic and labeled datasets. Basic datasets, typically comprising a large volume of raw data, are mainly used for pre-trained models to discern broad features and patterns. Labeled datasets, enriched with explicit labels through manual annotation, are mainly used for model fine-tuning and evaluation to enhance performance on specific tasks.

Item	Basic dataset	Labeled dataset
Definition	Unlabeled raw data	Data manually annotated with labels
Data processing	Data cleansing, duplicate removal, and others	Data labeling, validation, and others
Application scenarios	Unsupervised learning Pre-trained models for broad feature extraction	Supervised learning, model evaluation Model fine-tuning for task-specific performance

Go to the Datasets page

Log on to the PAI console.
In the upper-left corner, select a region.
In the left-side navigation bar, choose Workspaces, and click the name of the desired workspace.
In the left-side navigation bar, choose AI Asset Management > Datasets.

Create a basic dataset

On the Custom Datasets > Basic Datasets tab and click Create Dataset.

The Storage Type supports various storage options such as Object Storage Service (OSS), file storage (General-purpose NAS, Extreme NAS, CPFS, and CPFS for Lingjun), and MaxCompute.

Configure the following key parameters:

Storage type is OSS

Parameter	Description
Type	The data type. Supported types include image, text, audio, video, table, and general. After you select a specific type, the system filters datasets for subsequent labeling scenarios.
Owner	The dataset owner. Only workspace administrators can set this parameter.
Import Format/OSS Path	When Import Format is set to File, the OSS path must be a file. The dataset created is associated with this specified file. This is usually used for creating iTAG datasets. When Import Format is set to Folder, the OSS path must be a folder path. This path can be mounted within a container. This is usually used for datasets in services such as DSW, DLC, and EAS.
Default Mount Path	The default mount path for the data, which is usually used in DSW and DLC: In DSW, you can mount a created file system to this path when creating an instance. In DLC, the system searches for files in this path when executing code, such as `python /root/data/file.py`.
Enable Version Acceleration	When Import Format is set to Folder, the option to enable dataset version acceleration becomes available. Configure the following key parameters: Maximum Capacity: The slot capacity, which should be at least equal to the dataset size. Adjust based on the dataset requiring acceleration. Accelerated Mount Target: By default, an internal mount target is used. You may use an existing mount target or create a new one. Note When using Lingjun resources, if Accelerated Mount Target is set to Create Mount Target, the Mount Target Type must be VPC, and the selected VPC and vSwitch must match the Lingjun resources. Accelerated Version Default Mount Path: The default mount path for the dataset version.

Storage type is file storage

Parameter	Description
Type	The data type. Supported types include image, text, audio, video, table, and general. After you select a specific type, the system filters datasets for subsequent labeling scenarios.
Owner	The dataset owner. Only workspace administrators can set this parameter.
Select File System	Choose the file system that corresponds to the Storage Type.
Mount Target	Select a mount target under the file system.
File System Path	Select an existing path within the file system, such as `/`.
Default Mount Path	The default mount path for the data, which is usually used in DSW and DLC: In DSW, you can mount a created file system to this path when creating an instance. In DLC, the system searches for files in this path when executing code, such as `python /root/data/file.py`.
Enable Version Acceleration	When Storage Type is set to General-purpose NAS, Extreme NAS, and Cloud Parallel File Storage (CPFS), the option to enable dataset version acceleration becomes available. Configure the following key parameters: Maximum Capacity: The slot capacity, which should be at least equal to the dataset size. Adjust based on the dataset requiring acceleration. Accelerated Version Default Mount Path: The default mount path for the dataset version.

Storage type is MaxCompute

Parameter	Description
Type	Supports only Table.
Owner	The dataset owner. Only workspace administrators can set this parameter.
Default Mount Path	The default mount path for the data, which is usually used in DSW and DLC: In DSW, you can mount a created file system to this path when creating an instance. In DLC, the system searches for files in this path when executing code, such as `python /root/data/file.py`.
Enable Version Acceleration	Enables dataset version acceleration. Configure the following key parameters: Initial configurations: Configure initialization code and click Test. Accelerated Mount Target: By default, an internal mount target is used. You may use an existing mount target or create a new one. Note When using Lingjun resources, if Accelerated Mount Target is set to Create Mount Target, the Mount Target Type must be VPC, and the selected VPC and vSwitch must match the Lingjun resources. Accelerated Version Default Mount Path: The default mount path for the dataset version.

Create a basic dataset version

On the Custom Dataset > Basic Dataset tab, click Create Version in the Actions column for the desired dataset.

Note the following key parameters:

Name, Storage Type, and Type is the same as the V1 version and cannot be changed.
The system automatically generates the dataset version and cannot be changed.
For other key parameters, see Create a basic dataset.

View public datasets

The system provides a variety of public datasets (such as MMLU, CMMLU, and GSM8K). Click the dataset name on the Public Dataset tab to view the basic information of the datasets.

Manage datasets

For basic datasets, you can view the versions, create a new version, set them to public, and delete them. For labeled datasets, you can view data, make them public, and delete them.

Take note:

For datasets with Visibility set to Visible Only to the Dataset Owner, click Set Dataset to Public to share the dataset within the workspace, allowing all workspace members to view it. Note that once a dataset is made public, it cannot be reverted to its previous state. Proceed with caution.
If you encounter access privilege issues when viewing dataset data as a RAM user, authorize the RAM user.
Deleting a dataset may disrupt existing tasks. After you delete a dataset, it cannot be restored. Proceed with caution.