All Products
Search
Document Center

Object Storage Service:Use data in OSS to build an iterable dataset suitable for sequential streaming reading

Last Updated:Sep 24, 2024

Iterable datasets are suitable for scenarios in which limited memory is available or large amounts of data is stored. It is especially suitable for scenarios in which sequential processing is used and you do not have high requirements on random access and parallel processing of data. This topic describes how to build an iterable dataset by using OssIterableDataset.

Prerequisites

OSS Connector for AI/ML is installed and configured. For more information, see Install OSS Connector for AI/ML and Configure OSS Connector for AI/ML.

Build a dataset

Methods

You can use one of the following methods to build an iterable dataset by using OssIterableDataset:

  • OSS_URI prefix: suitable for scenarios in which the storage paths of OSS data have uniform rules.

  • OSS_URI list: suitable for scenarios in which the storage paths of OSS data are clear but scattered.

  • manifest file: suitable for scenarios in which the dataset that you want to build contains a large number of files, such as tens of millions, the dataset is frequently loaded, and data indexing is enabled for the bucket. This method reduces the fees generated when you call API operations to list OSS objects.

Build a dataset by using the OSS_URI prefix

The following sample code provides an example on how to use the from_prefix method of OssIterableDataset to build a dataset by specifying the OSS_URI prefix in OSS:

from osstorchconnector import OssIterableDataset

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"

# Use the from_prefix method of OssIterableDataset to build a dataset.
map_dataset = OssIterableDataset.from_prefix(OSS_URI, endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH)

# Traverse the objects in the dataset.
for item in map_dataset:
    print(item.key)
    print(item.size)
    content = item.read()
    print(len(content))

Build a dataset by using the OSS_URI list

The following sample code provides an example on how to use the from_objects method of OssIterableDataset to build a dataset by specifying the OSS_URI list in OSS: In the example, uris is a string iterator that contains multiple OSS_URIs.

from osstorchconnector import OssIterableDataset

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"

uris = [
    "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00001.png",
    "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00002.png",
    "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00003.png"
]

# Use the from_objects method of OssIterableDataset to build a dataset.
map_dataset = OssIterableDataset.from_objects(uris, endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH)

# Traverse the objects in the dataset.
for item in map_dataset:
    print(item.key)
    print(item.size)
    content = item.read()
    print(len(content))

Build a dataset by using a manifest file

You need to create a manifest file and use the manifest file to build a dataset.

  1. Create a manifest file:

    Run the touch manifest_file command to create a manifest file and specify the manifest file by referring to the following examples:

    Example of a manifest file that contains the names of OSS objects:

    Img/BadImag/Bmp/Sample001/img001-00001.png
    Img/BadImag/Bmp/Sample001/img001-00002.png
    Img/BadImag/Bmp/Sample001/img001-00003.png

    Example of a manifest file that contains the names and labels of OSS objects:

    Img/BadImag/Bmp/Sample001/img001-00001.png label1
    Img/BadImag/Bmp/Sample001/img001-00002.png label2
    Img/BadImag/Bmp/Sample001/img001-00003.png label3
  2. Build a dataset by using a manifest file.

    The following sample code provides an example on how to use the from_manifest_file method of OssIterableDataset to build a dataset by specifying the manifest file:

    import io
    from typing import Iterable,Tuple,Union
    from osstorchconnector import OssIterableDataset
    from osstorchconnector import imagenet_manifest_parser
    
    ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
    CONFIG_PATH = "/etc/oss-connector/config.json"
    CRED_PATH = "/root/.alibabacloud/credentials"
    OSS_BASE_URI = "oss://ai-testset/EnglistImg/"
    MANIFEST_FILE_URI = "oss://manifest_fileai-testset/EnglistImg/manifest_file"
    
    # Use the from_manifest_file method of OssIterableDataset to build a dataset from a local file.
    # The manifest_file_path parameter specifies the local path of the manifest file.
    # The manifest_parser parameter specifies the method used for parsing the manifest file. In the example, the built-in parsing method imagenet_manifest_parser is used.
    # The oss_base_uri parameter specifies BASE_OSS_URI, which is used to combine the URI parsed from the manifest file to form FULL_OSS_URI.
    MANIFEST_FILE_LOCAL = "/path/to/manifest_file.txt"         
    iterable_dataset = OssIterableDataset.from_manifest_file(manifest_file_path=MANIFEST_FILE_LOCAL, manifest_parser=imagenet_manifest_parser, oss_base_uri=OSS_BASE_URI, endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH)
    for item in iterable_dataset:
        print(item.key)
        print(item.size)
        print(item.label)
        content = item.read()
        print(len(content))
    
    # Use the from_manifest_file method of OssIterableDataset to build a dataset by specifying the manifest file in the OSS bucket.
    iterable_dataset = OssIterableDataset.from_manifest_file(manifest_file_path=MANIFEST_FILE_URI, manifest_parser=imagenet_manifest_parser, oss_base_uri=OSS_BASE_URI, endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH)

Data type in OSS Connector for AI/ML

The data type of objects in a dataset supports I/O methods. For more information, see Data type in OSS Connector for AI/ML.

Parameter description

If you use OssMapDataset or OssIterableDataset to build a dataset, you must configure the parameters. The following table describes the parameters.

Parameter

Type

Required

Description

endpoint

string

Yes

The endpoint that is used to access OSS. This is a common parameter. For more information, see Regions and endpoints.

transform

object

No

The transform function that is used to customize the response of DataObject. This is a common parameter. You can use a custom method based on your business requirements. For more information, see transform.

Important

If the DataObject object is returned in the transform function, the iterator may fail to work. Do not directly return the DataObject object. If you want to return the DataObject object, use the copy method.

cred_path

string

Yes

The path of the authentication file. Default value: /root/.alibabacloud/credentials. This is a common parameter. For more information, see Configure access credentials.

config_path

string

Yes

The path of the OSS Connector for AI/ML configuration file. Default value: /etc/oss-connector/config.json. This is a common parameter. For more information, see Configure OSS Connector for AI/ML.

oss_uri

string

Yes

The OSS resource path, which is used to build a dataset by using the OSS_URI prefix. This is a from_prefix parameter. Only OSS_URIs that start with oss:// are supported.

object_uris

string

Yes

The list of OSS resource paths. You can use the paths in the list to build a dataset. This is a from_objects parameter. Only OSS_URIs that start with oss:// are supported.

manifest_file_path

string

Yes

The path of the manifest file. A local file path or OSS_URI that starts with oss:// is supported. This is a from_manifest_file parameter.

manifest_parser

Callable Object

Yes

The built-in method that parses the manifest file. The built-in method receives an open manifest file as input, and returns an iterator with each element as a (oss_uri,label) tuple. This is a from_manifest_file parameter. For more information, see manifest_parser. You can also customize the manifest_parser method based on the manifest file format of different datasets.

oss_base_uri

string

Yes

The OSS base URI, which is used to concatenate the OSS_URI that may be incomplete in the manifest file to form a complete OSS_URI. This is a from_manifest_file parameter. If oss_base_uri does not exist, enter "".

Built-in methods

transform

When the dataset is built, the dataset returns an iterator for the transform(DataObject) function. DataObject is the data type of OSS Connector for AI/ML.

The transform function allows you to choose a custom method. If you do not specify a method when you build a dataset, the default method is used.

Default method

The following sample code provides an example of how to use the default method. You do not need to specify this method when you build the dataset.

# The default transform function.
def identity(obj: DataObject) -> DataObject:
    if obj is not None:
        return obj.copy()
    else:
        return None

Custom method

The following sample code provides an example of how to use a custom method:

import sys
import io
import torchvision.transforms as transforms
from PIL import Image
from osstorchconnector import OssIterableDataset

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"

# Specify the transform operations on image objects.
trans = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Use a transform method to process the input object.
def transform(object):
    try:
        img = Image.open(io.BytesIO(object.read())).convert('RGB')
        val = trans(img)
    except Exception as e:
        raise e
    return val, object.label

# Specify the transform=transform parameter when you build the dataset.
iterable_dataset = OssIterableDataset.from_prefix(OSS_URI, endpoint=ENDPOINT, transform=transform, cred_path=CRED_PATH, config_path=CONFIG_PATH)
 

manifest_parser

Use the default manifest_parser method. The following sample code provides an example on how to import the manifest_parser method when you build a dataset:

from osstorchconnector import imagenet_manifest_parser

Example:

def imagenet_manifest_parser(reader: io.IOBase) -> Iterable[Tuple[str, str]]:
    lines = reader.read().decode("utf-8").strip().split("\n")
    for i, line in enumerate(lines):
        try:
            items = line.strip().split('\t')
            if len(items) >= 2:
                key = items[0]
                label = items[1]
                yield (key, label)
            elif len(items) == 1:
                key = items[0]
                yield (key, '')
            else:
                raise ValueError("format error")
        except ValueError as e:
            logging.error(f"Error: {e} for line {i}: {line}")

Create a PyTorch data loader by using a dataset

The following sample code provides an example on how to create a PyTorch data loader by using an iterable dataset as a data source. The iterable dataset is built by using OssIterableDataset.

import torch
from osstorchconnector import OssIterableDataset

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"

def transform(object):
 data = object.read()
 return object.key, object.label

# Use the from_prefix method of OssIterableDataset to build a dataset.
map_dataset = OssIterableDataset.from_prefix(OSS_URI, endpoint=ENDPOINT,transform=transform, cred_path=CRED_PATH, config_path=CONFIG_PATH)

# Create a PyTorch data loader based on map_dataset.
loader = torch.utils.data.DataLoader(map_dataset, batch_size=256, num_workers=32, prefetch_factor=2)
# Use data in the training loop.
# for batch in loader:
     # Perform the training operations.
   ...

References

If you perform data training jobs in a containerized environment, OSS Connector for AI/ML is also suitable for the containerized environment. For more information, see Build a Docker image that contains an OSS Connector for AI/ML environment.