Dataset Accelerator (DatasetAcc) is a Platform as a Service (PaaS) service provided by Alibaba Cloud Platform for AI (PAI) to accelerate AI-powered datasets in the cloud. The service provides a centralized dataset acceleration solution for various cloud-native training engines by pre-analyzing and pre-processing the datasets that you trained to improve the overall training efficiency.
Architecture
The following figure shows the architecture of Dataset Accelerator of PAI.
Limits
Before you use Dataset Accelerator, make sure that you understand the following limits:
Only datasets that are stored on Alibaba Cloud can be accelerated, such as Object Storage Service (OSS) or Cloud Parallel File System (CPFS) datasets.
The datasets cannot be encrypted.
Data in dataset accelerators is read-only. Dynamic data writes are not supported.
A dataset accelerator can accelerate up to 100 TB of data.
Billing rules
Dataset Accelerator is billed based on the capacity and duration. For more information, see Billing of Dataset Accelerator.
Features
Training optimization for large amounts of small files such as images, texts, and videos.
Dataset Accelerator pre-packages and processes data such as images, texts, and videos based on the model type and network structure used in deep learning training to improve the performance of training that involves a large number of small files.
Fully managed and ready-to-use service
Dataset Accelerator provides fully managed and ready-to-use cloud services.
Scalable service
Dataset Accelerator leverages the Infrastructure-as-a-Service (IaaS) capability to support quick resource scaling.
Data sharing
Datasets in dataset accelerators can be used by multiple training clusters.
Data security
Dataset Accelerator supports multi-tenant isolation to ensure data security among users.
Concepts
Before you use Dataset Accelerator, make sure that you understand the following concepts:
Accelerator
The billing and management unit of Dataset Accelerator. If you create a subscription accelerator, the system reserves related resources and billing starts when the accelerator is created. If you create a pay-as-you-go accelerator, you are charged for the accelerator based on slot usage.
Slot
You can create multiple slots for an accelerator. One slot accelerates one dataset. This allows you to accelerate deep learning tasks that use different datasets at the same time.
Relationship between an accelerator and a slot
You can create multiple accelerators and apply for multiple slots with different capacities for each accelerator. One slot is associated with one dataset.
Procedure
To use Dataset Accelerator, perform the following steps:
Create and manage accelerators
You can create accelerators based on your business requirements, team size, training frequency, and dataset sizes, and use multiple slots to accelerate multiple datasets for different training tasks.
Accelerators consume cloud resources. If you want to ensure resources are available to accelerate important training tasks, we recommend that you use the subscription billing method to create accelerators.
You can create slots in an accelerator based on the size of datasets used for training. An accelerator can contain multiple slots. The total capacity of all slots cannot exceed the capacity of the accelerator to which the slots belong.
After you create slots, the system preprocesses the associated datasets based on factors such as the data type, data size, framework, and model used for training. After the initialization is completed, the accelerator provides relevant interfaces for the training tasks.
When you create a dataset in PAI, you can enable dataset acceleration. You can use an accelerated dataset when you create a Data Science Workshop (DSW) instance or submit a Deep Learning Containers (DLC) job to improve data reading efficiency.