You need to prepare datasets before data processing or model training. Platform for AI (PAI) Asset Management provides a dataset management feature that allows you to create and manage datasets in multiple versions. Dataset version management allows for precise experiment reproduction, tracking of data versions, recording data lineage, and seamless transition to previous versions in case of issues, thus ensuring uninterrupted business operations.
Overview
The dataset management feature supports comprehensive management of basic and labeled datasets. Basic datasets, typically comprising a large volume of raw data, are mainly used for pre-trained models to discern broad features and patterns. Labeled datasets, enriched with explicit labels through manual annotation, are mainly used for model fine-tuning and evaluation to enhance performance on specific tasks.
Item | Basic dataset | Labeled dataset |
Definition | Unlabeled raw data | Data manually annotated with labels |
Data processing | Data cleansing, duplicate removal, and others | Data labeling, validation, and others |
Application scenarios |
|
|
Go to the Datasets page
Log on to the PAI console.
In the upper-left corner, select a region.
In the left-side navigation bar, choose Workspaces, and click the name of the desired workspace.
In the left-side navigation bar, choose AI Asset Management > Datasets.
Create a basic dataset
On the Custom Datasets > Basic Datasets tab and click Create Dataset.
The Storage Type supports various storage options such as Object Storage Service (OSS), file storage (General-purpose NAS, Extreme NAS, CPFS, and CPFS for Lingjun), and MaxCompute.
Configure the following key parameters:
Storage type is OSS
Parameter | Description |
Type | The data type. Supported types include image, text, audio, video, table, and general. After you select a specific type, the system filters datasets for subsequent labeling scenarios. |
Owner | The dataset owner. Only workspace administrators can set this parameter. |
Import Format/OSS Path |
|
Default Mount Path | The default mount path for the data, which is usually used in DSW and DLC:
|
Enable Version Acceleration | When Import Format is set to Folder, the option to enable dataset version acceleration becomes available. Configure the following key parameters:
|
Storage type is file storage
Parameter | Description |
Type | The data type. Supported types include image, text, audio, video, table, and general. After you select a specific type, the system filters datasets for subsequent labeling scenarios. |
Owner | The dataset owner. Only workspace administrators can set this parameter. |
Select File System | Choose the file system that corresponds to the Storage Type. |
Mount Target | Select a mount target under the file system. |
File System Path | Select an existing path within the file system, such as |
Default Mount Path | The default mount path for the data, which is usually used in DSW and DLC:
|
Enable Version Acceleration | When Storage Type is set to General-purpose NAS, Extreme NAS, and Cloud Parallel File Storage (CPFS), the option to enable dataset version acceleration becomes available. Configure the following key parameters:
|
Storage type is MaxCompute
Parameter | Description |
Type | Supports only Table. |
Owner | The dataset owner. Only workspace administrators can set this parameter. |
Default Mount Path | The default mount path for the data, which is usually used in DSW and DLC:
|
Enable Version Acceleration | Enables dataset version acceleration. Configure the following key parameters:
|
Create a basic dataset version
On the Custom Dataset > Basic Dataset tab, click Create Version in the Actions column for the desired dataset.
Note the following key parameters:
Name, Storage Type, and Type is the same as the V1 version and cannot be changed.
The system automatically generates the dataset version and cannot be changed.
For other key parameters, see Create a basic dataset.
View public datasets
The system provides a variety of public datasets (such as MMLU, CMMLU, and GSM8K). Click the dataset name on the Public Dataset tab to view the basic information of the datasets.
Manage datasets
For basic datasets, you can view the versions, create a new version, set them to public, and delete them. For labeled datasets, you can view data, make them public, and delete them.
Take note:
For datasets with Visibility set to Visible Only to the Dataset Owner, click Set Dataset to Public to share the dataset within the workspace, allowing all workspace members to view it. Note that once a dataset is made public, it cannot be reverted to its previous state. Proceed with caution.
If you encounter access privilege issues when viewing dataset data as a RAM user, authorize the RAM user.
Deleting a dataset may disrupt existing tasks. After you delete a dataset, it cannot be restored. Proceed with caution.