All Products
Search
Document Center

Platform For AI:Create and manage datasets

Last Updated:Nov 20, 2024

You need to prepare datasets before data processing or model training. Platform for AI (PAI) Asset Management provides a dataset management feature that allows you to create and manage datasets in multiple versions. Dataset version management allows for precise experiment reproduction, tracking of data versions, recording data lineage, and seamless transition to previous versions in case of issues, thus ensuring uninterrupted business operations.

Overview

The dataset management feature supports comprehensive management of basic and labeled datasets. Basic datasets, typically comprising a large volume of raw data, are mainly used for pre-trained models to discern broad features and patterns. Labeled datasets, enriched with explicit labels through manual annotation, are mainly used for model fine-tuning and evaluation to enhance performance on specific tasks.

Item

Basic dataset

Labeled dataset

Definition

Unlabeled raw data

Data manually annotated with labels

Data processing

Data cleansing, duplicate removal, and others

Data labeling, validation, and others

Application scenarios

  • Unsupervised learning

  • Pre-trained models for broad feature extraction

  • Supervised learning, model evaluation

  • Model fine-tuning for task-specific performance

Go to the Datasets page

  1. Log on to the PAI console.

  2. In the upper-left corner, select a region.

  3. In the left-side navigation bar, choose Workspaces, and click the name of the desired workspace.

  4. In the left-side navigation bar, choose AI Asset Management > Datasets.

Create a basic dataset

On the Custom Datasets > Basic Datasets tab and click Create Dataset.

The Storage Type supports various storage options such as Object Storage Service (OSS), file storage (General-purpose NAS, Extreme NAS, CPFS, and CPFS for Lingjun), and MaxCompute.

image

Configure the following key parameters:

Storage type is OSS

Parameter

Description

Type

The data type. Supported types include image, text, audio, video, table, and general. After you select a specific type, the system filters datasets for subsequent labeling scenarios.

Owner

The dataset owner. Only workspace administrators can set this parameter.

Import Format/OSS Path

  • When Impore Format is set to File, the OSS path must be a file. The dataset created is associated with this specified file. This is usually used for creating iTAG datasets.

  • When Import Format is set to Folder, the OSS path must be a folder path. This path can be mounted within a container. This is usually used for datasets in services such as DSW, DLC, and EAS.

Default Mount Path

The default mount path for the data, which is usually used in DSW and DLC:

  • In DSW, you can mount a created file system to this path when creating an instance.

  • In DLC, the system searches for files in this path when executing code, such as python /root/data/file.py.

Enable Version Acceleration

When Import Format is set to Folder, the option to enable dataset version acceleration becomes available. Configure the following key parameters:

  • Maximum Capacity: The slot capacity, which should be at least equal to the dataset size. Adjust based on the dataset requiring acceleration.

  • Accelerated Mount Target: By default, an internal mount target is used. You may use an existing mount target or create a new one.

    Note

    When using Lingjun resources, if Accelerated Mount Target is set to Create Mount Target, the Mount Target Type must be VPC, and the selected VPC and vSwitch must match the Lingjun resources.

  • Accelerated Version Default Mount Path: The default mount path for the dataset version.

Storage type is file storage

Parameter

Description

Type

The data type. Supported types include image, text, audio, video, table, and general. After you select a specific type, the system filters datasets for subsequent labeling scenarios.

Owner

The dataset owner. Only workspace administrators can set this parameter.

Select File System

Choose the file system that corresponds to the Storage Type.

Mount Target

Select a mount target under the file system.

File System Path

Select an existing path within the file system, such as /.

Default Mount Path

The default mount path for the data, which is usually used in DSW and DLC:

  • In DSW, you can mount a created file system to this path when creating an instance.

  • In DLC, the system searches for files in this path when executing code, such as python /root/data/file.py.

Enable Version Acceleration

When Storage Type is set to General-purpose NAS, Extreme NAS, and Cloud Parallel File Storage (CPFS), the option to enable dataset version acceleration becomes available. Configure the following key parameters:

  • Maximum Capacity: The slot capacity, which should be at least equal to the dataset size. Adjust based on the dataset requiring acceleration.

  • Accelerated Version Default Mount Path: The default mount path for the dataset version.

Storage type is MaxCompute

Parameter

Description

Type

Supports only Table.

Owner

The dataset owner. Only workspace administrators can set this parameter.

Default Mount Path

The default mount path for the data, which is usually used in DSW and DLC:

  • In DSW, you can mount a created file system to this path when creating an instance.

  • In DLC, the system searches for files in this path when executing code, such as python /root/data/file.py.

Enable Version Acceleration

Enables dataset version acceleration. Configure the following key parameters:

  • Initial configurations: Configure initialization code and click Test.

  • Accelerated Mount Target: By default, an internal mount target is used. You may use an existing mount target or create a new one.

    Note

    When using Lingjun resources, if Accelerated Mount Target is set to Create Mount Target, the Mount Target Type must be VPC, and the selected VPC and vSwitch must match the Lingjun resources.

  • Accelerated Version Default Mount Path: The default mount path for the dataset version.

Create a basic dataset version

On the Custom Dataset > Basic Dataset tab, click Create Version in the Actions column for the desired dataset.

image

Note the following key parameters:

  • Name, Storage Type, and Type is the same as the V1 version and cannot be changed.

  • The system automatically generates the dataset version and cannot be changed.

  • For other key parameters, see Create a basic dataset.

View public datasets

The system provides a variety of public datasets (such as MMLU, CMMLU, and GSM8K). Click the dataset name on the Public Dataset tab to view the basic information of the datasets.

image

Manage datasets

For basic datasets, you can view the versions, create a new version, set them to public, and delete them. For labeled datasets, you can view data, make them public, and delete them.

image

Take note:

  • For datasets with Visibility set to Visible Only to the Dataset Owner, click Set Dataset to Public to share the dataset within the workspace, allowing all workspace members to view it. Note that once a dataset is made public, it cannot be reverted to its previous state. Proceed with caution.

  • If you encounter access privilege issues when viewing dataset data as a RAM user, authorize the RAM user.

  • Deleting a dataset may disrupt existing tasks. After you delete a dataset, it cannot be restored. Proceed with caution.