High-precision models require high-quality datasets. The goal of data preparation is to create high-quality datasets. Platform for AI (PAI) provides a dataset management module that allows you to create datasets based on various types of data, including data stored in Alibaba Cloud storage services. The module also allows you to scan Object Storage Service (OSS) folders to generate datasets and provides common third-party public datasets that you can use for intelligent labeling and model training. This topic describes how to create and manage datasets.
Overview
The dataset module allows you to create custom datasets or use public datasets.
Create a custom dataset.
Create a dataset based on data that is stored in an Alibaba Cloud storage service: Create a dataset based on data that is stored in OSS or File Storage NAS (NAS). You can use the dataset in subsequent data processing or modeling.
Create a dataset by scanning a folder: Scan a file that is stored in OSS to generate an index file whose extension is *.manifest and use the index file as a dataset. You can use the dataset in scenarios in which iTAG is used.
Create a dataset by registering a public dataset.
The public datasets available in the dataset management module are open source datasets, such as MMLU, CMMLU, or GSM8K. Alibaba Cloud assumes no responsibility for the availability, compliance, and security of the third-party datasets. Before you use the datasets, make sure to read the third-party agreements to ensure legal and compliant use.
Prerequisites
An AI workspace is created. The datasets that you want to register are added to the AI workspace.
Limits
In the China (Ulanqab) region, you can create datasets only by using data from an Alibaba Cloud storage service or by scanning a folder.
You can create CPFS for Lingjun datasets only in the China (Ulanqab) region. The Alibaba Cloud File Storage (CPFS) datasets are not supported in the China (Ulanqab) region.
Account and permission requirements
Alibaba Cloud account: You can use an Alibaba Cloud account to complete all operations without additional authorization.
RAM user: Grant the following permissions to the RAM user:
Dataset-related permissions
You need to add a RAM user as a workspace member of specific roles and assign permissions to the roles. For information about the permissions of roles, visit the Roles and Permissions page. For information about how to add a RAM user as a workspace member, see Manage workspace members.
Permissions to view and use OSS buckets when you use an OSS dataset
Use the following script to create a policy and attach the policy to the RAM user. For information about how to create a policy, see Create custom policies. For information about how to grant permissions to a RAM user, see Grant permissions to a RAM user.
{ "Version": "1", "Statement": [ { "Effect": "Allow", "Action": [ "oss:ListBuckets", "oss:GetBucketStat", "oss:GetBucketInfo", "oss:GetBucketTagging", "oss:GetBucketLifecycle", "oss:GetBucketWorm", "oss:GetBucketVersioning", "oss:GetBucketAcl", "oss:PutObject", "oss:GetBucketCors", "oss:PutBucketCors" ], "Resource": "acs:oss:*:*:*" }, { "Effect": "Allow", "Action": [ "oss:ListObjects", "oss:GetBucketAcl" ], "Resource": "acs:oss:*:*:mybucket" }, { "Effect": "Allow", "Action": [ "oss:GetObject", "oss:GetObjectAcl" ], "Resource": "acs:oss:*:*:mybucket/*" } ] }
Permissions to view and use the NAS file systems including the permissions to query file systems and protocol service information (only for CPFS) when you use a NAS or CPFS dataset
Use the following script to create a policy and attach the policy to the RAM user. For information about how to create a policy, see Create custom policies. For information about how to grant permissions to a RAM user, see Grant permissions to a RAM user.
{ "Version": "1", "Statement": [ { "Effect": "Allow", "Action": [ "nas:DescribeFileSystems", "nas:DescribeProtocolMountTarget", "nas:DescribeProtocolService " ], "Resource": "acs:nas:*:*:filesystem/*" } ] }
Create a custom dataset
Go to the Dataset page.
Log on to the PAI console.
In the upper-left corner, select a region based on your business requirements.
In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to use.
In the left-side navigation pane, choose AI Computing Asset Management > Datasets.
On the Custom Dataset tab, click Create Dataset.
Create a dataset based on data that is stored in an Alibaba Cloud storage service
If you set the Create Dataset parameter to From Alibaba Cloud, you can set the Select Data Storage parameter to one of the following values: OSS, General-purpose NAS, Extreme NAS, Cloud Paralleled File System (CPFS), or CPFS for LINGJUN. The following section describes the parameters that you need to configure for each storage service.
NoteYou can mount only datasets of the general-purpose NAS type in Elastic Algorithm Service (EAS).
You can create CPFS for LINGJUN datasets only in the China (Ulanqab) region.
You can mount NAS file systems for which encrypted transmission is configured for Deep Learning Containers (DLC) and Data Science Workshop (DSW) jobs.
OSS
Parameter
Description
Property
File: Select a file. If the dataset that you want to create is used in iTAG, we recommend that you select a document. The path of the generated dataset is the same as the path of the selected document.
Folder: Select a folder path. The folder can be mounted to a container. If the dataset that you want to create is used for jobs related to DSW, DLC, or EAS, we recommend that you select a folder.
Dataset Owner
The owner of the dataset. Only workspace administrators can configure this parameter.
Dataset Type
The type of the dataset. Valid values: Image, Text, Audio, Video, and General. If you select a specific dataset type, the system displays datasets of the specified type in the subsequent labeling scenarios.
Default Mount Path
You can use the default mount path in DLC and DSW.
When you create an instance in DSW, you can mount the file system that you create to the default mount path.
When you run code in DLC, the system searches files in the default mount path. Example:
python /root/data/file.py
.
Enable Dataset Acceleration
This parameter is available only if you set the Property parameter to Folder. For more information, see Overview of dataset accelerator. Take note of the following parameters:
Maximum Capacity: Specify the capacity of the slot. The slot capacity must be greater than or equal to the dataset capacity.
Accelerated Mount Target: By default, an internal mount target is used. You can use an existing mount target or create a mount target.
NoteIf you use Lingjun resources and you set the Accelerated Mount Target parameter to Create Mount Target, set the Mount Target Type parameter to VPC. In addition, the VPC and the vSwitch must be the same as the Lingjun resources that you use.
Accelerated Dataset Default Mount Path: The default mount path of the data.
NAS/CPFS
Parameter
Description
Dataset Owner
The owner of the dataset. Only workspace administrators can configure this parameter.
Dataset Type
The type of the dataset. Valid values: Image, Text, Audio, Video, and General. If you select a specific dataset type, the system displays datasets of the specified type in the subsequent labeling scenarios.
Select File System
Select a file system. The type of the file system must be the same as the value that you specified for the Select Data Storage parameter.
Mount Target
The mount target that is used to access the NAS file system.
File System Path
An existing path in the NAS file system. Example:
/
.Default Mount Path
You can use the default mount path in DLC and DSW.
When you create an instance in DSW, you can mount the file system that you create to the default mount path.
When you run code in DLC, the system searches files in the default mount path. Example:
python /root/data/file.py
.
Enable Dataset Acceleration
This parameter is available only if you set the Select Data Storage parameter to General-purpose NAS, Extreme NAS, or CPFS. For more information, see Overview of dataset accelerator. Take note of the following parameters:
Maximum Capacity: Specify the capacity of the slot. The slot capacity must be greater than or equal to the dataset capacity.
Accelerated Mount Target: By default, an internal mount target is used. You can use an existing mount target or create a mount target.
NoteIf you use Lingjun resources and you set the Accelerated Mount Target parameter to Create Mount Target, you need to set the Mount Target Type parameter to VPC. In addition, the VPC and vSwitch must be the same as the Lingjun resources that you use.
Accelerated Dataset Default Mount Path: The default mount path of the data.
Create a dataset by scanning a folder
Parameter
Description
Dataset Owner
The owner of the dataset. Only workspace administrators can configure this parameter.
Dataset Type
The type of the dataset. Valid values: Image, Text, Audio, Video, and General. If you select a specific dataset type, the system displays datasets of the specified type in the subsequent labeling scenarios.
Path Wildcard
The path wildcard is used to scan and filter files in the specified format. You can scan up to 100,000 files.
Preview
Click Scan. The system indexes the files based on the specified OSS path and wildcard characters and previews the files in the JSONL format.
Save Result To
After the scan is completed, the system generates a file named dataset_****.manifest. You can change the file name and select the OSS path in which the file is saved.
Click Submit.
Public datasets
Go to the Dataset Manager page.
Log on to the PAI console.
In the upper-left corner, select a region based on your business requirements.
In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to use.
In the left-side navigation pane, choose AI Computing Asset Management > Datasets.
On the Public Dataset tab, view the available public datasets.
The dataset management module provides various public datasets, such as MMLU, CMMLU, or GSM8K. You can click the name of a dataset to view the dataset details.
Manage datasets
On the Datasets page of the PAI console, you can view all datasets that you have permissions to manage. You can also perform operations on the datasets. For example, you can view the details of a dataset or delete a dataset.
You can find the dataset that you want to manage and click View datasets to go to the OSS path of the dataset and view the dataset details. You can also click Delete to delete the dataset.
NoteIf the RAM user that you want to use to view the public dataset does not have the required permissions, you must use the Alibaba Cloud account to grant the AliyunOSSFullAccess permissions to the RAM user. For more information, see the "Step 2: Grant permissions to the RAM user" section in the Use the credentials of a RAM user to log on to the OSS console topic.
If a message appears indicating that you do not have the related permissions when you view public datasets by using a RAM user that is granted the AliyunOSSFullAccess permission, ignore the message and close the window.
For datasets whose visibility scope is Visible Only to the Dataset Owner, you can click Set Dataset to Public to allow all users in the workspace to view the datasets.
ImportantAfter you set the visibility scope of a dataset to publicly visible in the workspace, you can no longer set the visibility scope of the dataset to Visible Only to the Dataset Owner. Proceed with caution.
You can add labels to datasets and filter datasets by label keys or label values.
You can click the column filter icon in the upper-right corner of the Dataset management page to specify the columns that you want to display in the dataset list.