Use OSS Connector for AI/ML to efficiently complete data training tasks - Object Storage Service

This topic describes how to use OSS Connector for AI/ML to efficiently create and train data models.

Deployment environment

Operating system: 64-bit x86 Linux
glibc: 2.17 or later
Python: 3.8 to 3.12
PyTorch: 2.0 or later
To use the OSS checkpoint feature, the Linux kernel must support userfaultfd.
Note
In this example, Ubuntu is used. You can run the sudo grep CONFIG_USERFAULTFD /boot/config-$(uname -r) command to check whether the Linux kernel supports userfaultfd. If CONFIG_USERFAULTFD=y is returned, the Linux kernel supports userfaultfd. If CONFIG_USERFAULTFD=n is returned, the Linux kernel does not support userfaultfd. In this case, you cannot use the OSS checkpoint feature.

Installation

The following example describes how to install OSS Connector for AI/ML for Python 3.12.

Run the pip3.12 install osstorchconnector command to install OSS Connector for AI/ML in the container that is generated by using Linux or an image based on Linux.
```
pip3.12 install osstorchconnector
```
Run the pip3.12 show osstorchconnector command to check whether the OSS Connector for AI/ML is installed.
```
pip3.12 show osstorchconnector
```
If the version information of osstorchconnector is returned, OSS Connector for AI/ML is installed.

Configuration

Create a configuration file for access credentials.

mkdir -p /root/.alibabacloud && touch /root/.alibabacloud/credentials

Add the access credentials to the configuration file and save the configuration file.
Replace <Access-key-id> and <Access-key-secret> in the example with the AccessKey ID and AccessKey secret of a RAM user. For more information about how to create an AccessKey ID and AccessKey secret, see Create an AccessKey pair. For more information about the configuration items and configuration by using temporary access credentials, see Configure access credentials.
```
{
  "AccessKeyId": "LTAI************************",
  "AccessKeySecret": "At32************************"
}
```

Create an configuration file for OSS Connector.

mkdir -p /etc/oss-connector/ && touch /etc/oss-connector/config.json

Add the configurations of the OSS connector to the configuration file and save the configuration file. For more information about the configuration items, see Configure OSS Connector.

In most cases, you can use the following default configurations.

{
    "logLevel": 1,
    "logPath": "/var/log/oss-connector/connector.log",
    "auditPath": "/var/log/oss-connector/audit.log",
    "datasetConfig": {
        "prefetchConcurrency": 24,
        "prefetchWorker": 2
    },
    "checkpointConfig": {
        "prefetchConcurrency": 24,
        "prefetchWorker": 4,
        "uploadConcurrency": 64
    }
}

Example

The following example shows how to create a handwritten digit recognition model by using PyTorch. The MNIST dataset used for the model is created by using OssMapDataset. Checkpoints are stored and accessed by using OssCheckpoint.

import io
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.transforms as transforms
from PIL import Image
from torch.utils.data import DataLoader
from osstorchconnector import OssMapDataset
from osstorchconnector import OssCheckpoint

# Specify the following hyperparameters:
EPOCHS = 1
BATCH_SIZE = 64
LEARNING_RATE = 0.001
CHECKPOINT_READ_URI = "oss://you_bucketname/epoch.ckpt"  # Read the URL of the checkpoint in OSS.
CHECKPOINT_WRITE_URI = "oss://you_bucketname/epoch.ckpt" # Specify the checkpoint write URL in OSS.
ENDPOINT = "oss-cn-hangzhou-internal.aliyuncs.com"       # Specify the endpoint of the region that is used to access OSS. To use this endpoint, the Elastic Compute Service (ECS) instance and the OSS bucket must be in the same region.
CONFIG_PATH = "/etc/oss-connector/config.json"           # Specify the path of the OSS connector configuration file.
CRED_PATH = "/root/.alibabacloud/credentials"            # Specify the path of the configuration file that is used to configure access credentials.
OSS_URI = "oss://you_bucketname/mninst/"                 # Specify the path of resources in the OSS bucket.

# Create an object by using OssCheckpoint for saving and reading checkpoints during the training process.
checkpoint = OssCheckpoint(endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH)

# Specify a simple convolutional neural network (CNN).
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
        input_size = 224  
        after_conv1 = (input_size - 3 + 2*1) 
        after_pool1 = after_conv1 // 2  
        after_conv2 = (after_pool1 - 3 + 2*1) // 1 + 1  
        after_pool2 = after_conv2 // 2  
        flattened_size = 64 * after_pool2 * after_pool2
        self.fc1 = nn.Linear(flattened_size, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = nn.ReLU()(self.conv1(x)) 
        x = nn.MaxPool2d(2)(x)
        x = nn.ReLU()(self.conv2(x))
        x = nn.MaxPool2d(2)(x)
        x = x.view(x.size(0), -1)
        x = nn.ReLU()(self.fc1(x))
        x = self.fc2(x)
        return x

# Preprocess the data.
trans = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5], std=[0.5])
])
def transform(object):
    try:
        img = Image.open(io.BytesIO(object.read())).convert('L')
        val = trans(img)
    except Exception as e:
        raise e
    label = 0
    return val, torch.tensor(label)

# Load the OssMapDataset dataset.
train_dataset = OssMapDataset.from_prefix(OSS_URI, endpoint=ENDPOINT, transform=transform, cred_path=CRED_PATH, config_path=CONFIG_PATH)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, num_workers=32, prefetch_factor=2,shuffle=True)

# Initialize the model, loss function, and optimizer.
model = SimpleCNN()
criterion = nn.CrossEntropyLoss()  
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)  

# Train the model.
for epoch in range(EPOCHS):
    for i, (images, labels) in enumerate(train_loader):
        optimizer.zero_grad()  
        outputs = model(images)  
        loss = criterion(outputs, labels)  
        loss.backward()  
        optimizer.step()  
        if (i + 1) % 100 == 0:
            print(f'Epoch [{epoch + 1}/{EPOCHS}], Step [{i + 1}/{len(train_loader)}], Loss: {loss.item():.4f}')
    # Store the checkpoint using the object created by using OssCheckpoint.
    with checkpoint.writer(CHECKPOINT_WRITE_URI) as writer:
        torch.save(model.state_dict(), writer)
        print("-------------------------")
        print(model.state_dict)

# Read the checkpoint using the object created by using OssCheckpoint.
with checkpoint.reader(CHECKPOINT_READ_URI) as reader:
   state_dict = torch.load(reader)

# Load the model.
model = SimpleCNN()
model.load_state_dict(state_dict)
model.eval()