Iterable datasets are suitable for scenarios in which limited memory is available or large amounts of data is stored. It is especially suitable for scenarios in which sequential processing is used and you do not have high requirements on random access and parallel processing of data. This topic describes how to build an iterable dataset by using OssIterableDataset.
Prerequisites
OSS Connector for AI/ML is installed and configured. For more information, see Install OSS Connector for AI/ML and Configure OSS Connector for AI/ML.
Build a dataset
Methods
You can use one of the following methods to build an iterable dataset by using OssIterableDataset:
OSS_URI prefix: suitable for scenarios in which the storage paths of OSS data have uniform rules.
OSS_URI list: suitable for scenarios in which the storage paths of OSS data are clear but scattered.
manifest file: suitable for scenarios in which the dataset that you want to build contains a large number of files, such as tens of millions, the dataset is frequently loaded, and data indexing is enabled for the bucket. This method reduces the fees generated when you call API operations to list OSS objects.
Build a dataset by using the OSS_URI prefix
The following sample code provides an example on how to use the from_prefix method of OssIterableDataset to build a dataset by specifying the OSS_URI prefix in OSS:
from osstorchconnector import OssIterableDataset
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"
# Use the from_prefix method of OssIterableDataset to build a dataset.
map_dataset = OssIterableDataset.from_prefix(OSS_URI, endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH)
# Traverse the objects in the dataset.
for item in map_dataset:
print(item.key)
print(item.size)
content = item.read()
print(len(content))
Build a dataset by using the OSS_URI list
The following sample code provides an example on how to use the from_objects method of OssIterableDataset to build a dataset by specifying the OSS_URI list in OSS: In the example, uris is a string iterator that contains multiple OSS_URIs.
from osstorchconnector import OssIterableDataset
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
uris = [
"oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00001.png",
"oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00002.png",
"oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00003.png"
]
# Use the from_objects method of OssIterableDataset to build a dataset.
map_dataset = OssIterableDataset.from_objects(uris, endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH)
# Traverse the objects in the dataset.
for item in map_dataset:
print(item.key)
print(item.size)
content = item.read()
print(len(content))
Build a dataset by using a manifest file
You need to create a manifest file and use the manifest file to build a dataset.
Create a manifest file:
Run the
touch manifest_file
command to create a manifest file and specify the manifest file by referring to the following examples:Example of a manifest file that contains the names of OSS objects:
Img/BadImag/Bmp/Sample001/img001-00001.png Img/BadImag/Bmp/Sample001/img001-00002.png Img/BadImag/Bmp/Sample001/img001-00003.png
Example of a manifest file that contains the names and labels of OSS objects:
Img/BadImag/Bmp/Sample001/img001-00001.png label1 Img/BadImag/Bmp/Sample001/img001-00002.png label2 Img/BadImag/Bmp/Sample001/img001-00003.png label3
Build a dataset by using a manifest file.
The following sample code provides an example on how to use the from_manifest_file method of OssIterableDataset to build a dataset by specifying the manifest file:
import io from typing import Iterable,Tuple,Union from osstorchconnector import OssIterableDataset from osstorchconnector import imagenet_manifest_parser ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com" CONFIG_PATH = "/etc/oss-connector/config.json" CRED_PATH = "/root/.alibabacloud/credentials" OSS_BASE_URI = "oss://ai-testset/EnglistImg/" MANIFEST_FILE_URI = "oss://manifest_fileai-testset/EnglistImg/manifest_file" # Use the from_manifest_file method of OssIterableDataset to build a dataset from a local file. # The manifest_file_path parameter specifies the local path of the manifest file. # The manifest_parser parameter specifies the method used for parsing the manifest file. In the example, the built-in parsing method imagenet_manifest_parser is used. # The oss_base_uri parameter specifies BASE_OSS_URI, which is used to combine the URI parsed from the manifest file to form FULL_OSS_URI. MANIFEST_FILE_LOCAL = "/path/to/manifest_file.txt" iterable_dataset = OssIterableDataset.from_manifest_file(manifest_file_path=MANIFEST_FILE_LOCAL, manifest_parser=imagenet_manifest_parser, oss_base_uri=OSS_BASE_URI, endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH) for item in iterable_dataset: print(item.key) print(item.size) print(item.label) content = item.read() print(len(content)) # Use the from_manifest_file method of OssIterableDataset to build a dataset by specifying the manifest file in the OSS bucket. iterable_dataset = OssIterableDataset.from_manifest_file(manifest_file_path=MANIFEST_FILE_URI, manifest_parser=imagenet_manifest_parser, oss_base_uri=OSS_BASE_URI, endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH)
Data type in OSS Connector for AI/ML
The data type of objects in a dataset supports I/O methods. For more information, see Data type in OSS Connector for AI/ML.
Parameter description
If you use OssMapDataset or OssIterableDataset to build a dataset, you must configure the parameters. The following table describes the parameters.
Parameter | Type | Required | Description |
endpoint | string | Yes |
The endpoint that is used to access OSS. This is a common parameter. For more information, see Regions and endpoints. |
transform | object | No |
The transform function that is used to customize the response of DataObject. This is a common parameter. You can use a custom method based on your business requirements. For more information, see transform. Important If the |
cred_path | string | Yes |
The path of the authentication file. Default value: |
config_path | string | Yes |
The path of the OSS Connector for AI/ML configuration file. Default value: |
oss_uri | string | Yes |
The OSS resource path, which is used to build a dataset by using the OSS_URI prefix. This is a from_prefix parameter. Only OSS_URIs that start with |
object_uris | string | Yes |
The list of OSS resource paths. You can use the paths in the list to build a dataset. This is a from_objects parameter. Only OSS_URIs that start with |
manifest_file_path | string | Yes |
The path of the manifest file. A local file path or OSS_URI that starts with |
manifest_parser | Callable Object | Yes |
The built-in method that parses the manifest file. The built-in method receives an open manifest file as input, and returns an iterator with each element as a |
oss_base_uri | string | Yes |
The OSS base URI, which is used to concatenate the OSS_URI that may be incomplete in the manifest file to form a complete OSS_URI. This is a from_manifest_file parameter. If oss_base_uri does not exist, enter |
Built-in methods
transform
When the dataset is built, the dataset returns an iterator for the transform(DataObject) function. DataObject is the data type of OSS Connector for AI/ML.
The transform function allows you to choose a custom method. If you do not specify a method when you build a dataset, the default method is used.
Default method
The following sample code provides an example of how to use the default method. You do not need to specify this method when you build the dataset.
# The default transform function.
def identity(obj: DataObject) -> DataObject:
if obj is not None:
return obj.copy()
else:
return None
Custom method
The following sample code provides an example of how to use a custom method:
import sys
import io
import torchvision.transforms as transforms
from PIL import Image
from osstorchconnector import OssIterableDataset
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"
# Specify the transform operations on image objects.
trans = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
# Use a transform method to process the input object.
def transform(object):
try:
img = Image.open(io.BytesIO(object.read())).convert('RGB')
val = trans(img)
except Exception as e:
raise e
return val, object.label
# Specify the transform=transform parameter when you build the dataset.
iterable_dataset = OssIterableDataset.from_prefix(OSS_URI, endpoint=ENDPOINT, transform=transform, cred_path=CRED_PATH, config_path=CONFIG_PATH)
manifest_parser
Use the default manifest_parser method. The following sample code provides an example on how to import the manifest_parser method when you build a dataset:
from osstorchconnector import imagenet_manifest_parser
Example:
def imagenet_manifest_parser(reader: io.IOBase) -> Iterable[Tuple[str, str]]:
lines = reader.read().decode("utf-8").strip().split("\n")
for i, line in enumerate(lines):
try:
items = line.strip().split('\t')
if len(items) >= 2:
key = items[0]
label = items[1]
yield (key, label)
elif len(items) == 1:
key = items[0]
yield (key, '')
else:
raise ValueError("format error")
except ValueError as e:
logging.error(f"Error: {e} for line {i}: {line}")
Create a PyTorch data loader by using a dataset
The following sample code provides an example on how to create a PyTorch data loader by using an iterable dataset as a data source. The iterable dataset is built by using OssIterableDataset.
import torch
from osstorchconnector import OssIterableDataset
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"
def transform(object):
data = object.read()
return object.key, object.label
# Use the from_prefix method of OssIterableDataset to build a dataset.
map_dataset = OssIterableDataset.from_prefix(OSS_URI, endpoint=ENDPOINT,transform=transform, cred_path=CRED_PATH, config_path=CONFIG_PATH)
# Create a PyTorch data loader based on map_dataset.
loader = torch.utils.data.DataLoader(map_dataset, batch_size=256, num_workers=32, prefetch_factor=2)
# Use data in the training loop.
# for batch in loader:
# Perform the training operations.
...
References
If you perform data training jobs in a containerized environment, OSS Connector for AI/ML is also suitable for the containerized environment. For more information, see Build a Docker image that contains an OSS Connector for AI/ML environment.