This topic describes how to use OSS Connector for AI/ML to efficiently create and train data models.
Deployment environment
Operating system: 64-bit x86 Linux
glibc: 2.17 or later
Python: 3.8 to 3.12
PyTorch: 2.0 or later
To use the OSS checkpoint feature, the Linux kernel must support userfaultfd.
NoteIn this example, Ubuntu is used. You can run the
sudo grep CONFIG_USERFAULTFD /boot/config-$(uname -r)
command to check whether the Linux kernel supports userfaultfd. IfCONFIG_USERFAULTFD=y
is returned, the Linux kernel supports userfaultfd. IfCONFIG_USERFAULTFD=n
is returned, the Linux kernel does not support userfaultfd. In this case, you cannot use the OSS checkpoint feature.
Installation
The following example describes how to install OSS Connector for AI/ML for Python 3.12.
Run the
pip3.12 install osstorchconnector
command to install OSS Connector for AI/ML in the container that is generated by using Linux or an image based on Linux.pip3.12 install osstorchconnector
Run the
pip3.12 show osstorchconnector
command to check whether the OSS Connector for AI/ML is installed.pip3.12 show osstorchconnector
If the version information of osstorchconnector is returned, OSS Connector for AI/ML is installed.
Configuration
Create a configuration file for access credentials.
mkdir -p /root/.alibabacloud && touch /root/.alibabacloud/credentials
Add the access credentials to the configuration file and save the configuration file.
Replace
<Access-key-id>
and<Access-key-secret>
in the example with the AccessKey ID and AccessKey secret of a RAM user. For more information about how to create an AccessKey ID and AccessKey secret, see Create an AccessKey pair. For more information about the configuration items and configuration by using temporary access credentials, see Configure access credentials.{ "AccessKeyId": "LTAI************************", "AccessKeySecret": "At32************************" }
Create an configuration file for OSS Connector.
mkdir -p /etc/oss-connector/ && touch /etc/oss-connector/config.json
Add the configurations of the OSS connector to the configuration file and save the configuration file. For more information about the configuration items, see Configure OSS Connector.
In most cases, you can use the following default configurations.
{ "logLevel": 1, "logPath": "/var/log/oss-connector/connector.log", "auditPath": "/var/log/oss-connector/audit.log", "datasetConfig": { "prefetchConcurrency": 24, "prefetchWorker": 2 }, "checkpointConfig": { "prefetchConcurrency": 24, "prefetchWorker": 4, "uploadConcurrency": 64 } }
Example
The following example shows how to create a handwritten digit recognition model by using PyTorch. The MNIST dataset used for the model is created by using OssMapDataset. Checkpoints are stored and accessed by using OssCheckpoint.
import io
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.transforms as transforms
from PIL import Image
from torch.utils.data import DataLoader
from osstorchconnector import OssMapDataset
from osstorchconnector import OssCheckpoint
# Specify the following hyperparameters:
EPOCHS = 1
BATCH_SIZE = 64
LEARNING_RATE = 0.001
CHECKPOINT_READ_URI = "oss://you_bucketname/epoch.ckpt" # Read the URL of the checkpoint in OSS.
CHECKPOINT_WRITE_URI = "oss://you_bucketname/epoch.ckpt" # Specify the checkpoint write URL in OSS.
ENDPOINT = "oss-cn-hangzhou-internal.aliyuncs.com" # Specify the endpoint of the region that is used to access OSS. To use this endpoint, the Elastic Compute Service (ECS) instance and the OSS bucket must be in the same region.
CONFIG_PATH = "/etc/oss-connector/config.json" # Specify the path of the OSS connector configuration file.
CRED_PATH = "/root/.alibabacloud/credentials" # Specify the path of the configuration file that is used to configure access credentials.
OSS_URI = "oss://you_bucketname/mninst/" # Specify the path of resources in the OSS bucket.
# Create an object by using OssCheckpoint for saving and reading checkpoints during the training process.
checkpoint = OssCheckpoint(endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH)
# Specify a simple convolutional neural network (CNN).
class SimpleCNN(nn.Module):
def __init__(self):
super(SimpleCNN, self).__init__()
self.conv1 = nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=1)
self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
input_size = 224
after_conv1 = (input_size - 3 + 2*1)
after_pool1 = after_conv1 // 2
after_conv2 = (after_pool1 - 3 + 2*1) // 1 + 1
after_pool2 = after_conv2 // 2
flattened_size = 64 * after_pool2 * after_pool2
self.fc1 = nn.Linear(flattened_size, 128)
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = nn.ReLU()(self.conv1(x))
x = nn.MaxPool2d(2)(x)
x = nn.ReLU()(self.conv2(x))
x = nn.MaxPool2d(2)(x)
x = x.view(x.size(0), -1)
x = nn.ReLU()(self.fc1(x))
x = self.fc2(x)
return x
# Preprocess the data.
trans = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.5], std=[0.5])
])
def transform(object):
try:
img = Image.open(io.BytesIO(object.read())).convert('L')
val = trans(img)
except Exception as e:
raise e
label = 0
return val, torch.tensor(label)
# Load the OssMapDataset dataset.
train_dataset = OssMapDataset.from_prefix(OSS_URI, endpoint=ENDPOINT, transform=transform, cred_path=CRED_PATH, config_path=CONFIG_PATH)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, num_workers=32, prefetch_factor=2,shuffle=True)
# Initialize the model, loss function, and optimizer.
model = SimpleCNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
# Train the model.
for epoch in range(EPOCHS):
for i, (images, labels) in enumerate(train_loader):
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
if (i + 1) % 100 == 0:
print(f'Epoch [{epoch + 1}/{EPOCHS}], Step [{i + 1}/{len(train_loader)}], Loss: {loss.item():.4f}')
# Store the checkpoint using the object created by using OssCheckpoint.
with checkpoint.writer(CHECKPOINT_WRITE_URI) as writer:
torch.save(model.state_dict(), writer)
print("-------------------------")
print(model.state_dict)
# Read the checkpoint using the object created by using OssCheckpoint.
with checkpoint.reader(CHECKPOINT_READ_URI) as reader:
state_dict = torch.load(reader)
# Load the model.
model = SimpleCNN()
model.load_state_dict(state_dict)
model.eval()