AIACC-AGSpeed (AGSpeed) is designed to optimize the computing performance of PyTorch models on Alibaba Cloud GPU-accelerated compute-optimized instances. Compared to AIACC, AGSpeed delivers imperceptible computing optimization. This article describes how to install and use AGSpeed.
An Alibaba Cloud GPU-accelerated instance that meets the following requirements is created:
The OS is Alibaba Cloud Linux, CentOS 7.x and later, or Ubuntu 16.04 and later.
An NVIDIA driver and CUDA 10.0 or later are installed.
AGSpeed supports Python, PyTorch, and CUDA versions. The following table describes the supported versions.
Python | PyTorch | CUDA | Download link |
3.7 | 1.12.0 | 11.3 | wheel package (torch1.12.0_cu113-cp37) |
11.6 | wheel package (torch1.12.0_cu116-cp37) | ||
1.12.1 | 11.3 | wheel package (torch1.12.1_cu113-cp37) | |
11.6 | wheel package (torch1.12.1_cu116-cp37) | ||
3.8 | 1.12.0 | 11.3 | wheel package (torch1.12.0_cu113-cp38) |
11.6 | wheel package (torch1.12.0_cu116-cp38) | ||
1.12.1 | 11.3 | wheel package (torch1.12.1_cu113-cp38) | |
11.6 | wheel package (torch1.12.1_cu116-cp38) | ||
3.9 | 1.12.0 | 11.3 | wheel package (torch1.12.0_cu113-cp39) |
11.6 | wheel package (torch1.12.0_cu116-cp39) | ||
1.12.1 | 11.3 | wheel package (torch1.12.1_cu113-cp39) | |
11.6 | wheel package (torch1.12.1_cu116-cp39) |
1. Download the wheel package.
Select the wheel package that matches the version of Python, PyTorch, and CUDA installed on your machine. For more information, see the table above.
2. Run the following command to install AGSpeed.
Run the pip install command to install AGSpeed in your environment.
pip install ${WHEEL_NAME} # Replace ${WHEEL_NAME} with the name of the wheel package that you download
We recommend that you use agspeed.optimize()
to package the model when you complete the preparations and are ready to execute the training loop.
For example, you can package the model with agspeed.optimize()
after you place the model on the device and are ready to perform DDP optimization.
1. Run the following command to import the AGSpeed module.
import agspeed # Import AGSpeed to register the IR optimization pass and the optimized NvFuser in the PyTorch backend.
model=agspeed.optimize (model) # Optimize the model. This calls the API to automatically obtain the computing diagram and then optimize the diagram with AGSpeed Backend Autotuner.
2. If your model uses PyTorch automatic mixed precision (AMP), you need to add the cache_enabled=False
parameter in the autocast()
context. The following section provides a sample code.
Note
This step applies only to the models that use AMP. For models that use FP32, skip this step.
After TorchDynamo obtains the computing diagram, AGSpeed uses torch.jit.trace
to convert the diagram to TorchScript IR for backend for optimization. In this case, performing torch.jit.trace
directly in the context of the autocast()
will cause a conflict. Therefore, you must disable the cache_enabled
parameter by adding cache_enabled=False
in the context of the autocast()
. For more information, see PyTorch commit.
from torch.cuda.amp.autocast_model import autocast
# ...
# Add cache_enabled=False to the autocast context parameter
with autocast(cache_enabled=False):
loss = model(inputs)
scaler.scale(loss).backward()
scaler.step(optimizer)
# ...
3. If you use PyTorch 1.12.x and the model to be trained includes a SiLU function, use the LD_PRELOAD
environment variable to import the symbolic differential of the SiLU function.
Note
This step applies only when you are using PyTorch 1.12.x and the model you want to train contains the SiLU function. Skip this step for other scenarios.
In PyTorch 1.12.x, the backend TorchScript does not contain the symbolic differential of aten::silu
, which means that the aten::silu
operation will not be included in differentiable computing diagram, and cannot be fused by the backend NvFuser. PyTorch does not allow you to dynamically add symbolic differentials. Therefore, AGSpeed integrates SiLU's symbolic differential in another dynamic link library (LD_PRELOAD) and adds the symbolic differential of aten::silu
to the TorchScript backend. Before you start the training task, we recommend that you use the LD_PRELOAD environment variable to import the symbolic differential of the SiLU function.
a) Run the following command to view the installation path of AGSpeed.
python -c "import agspeed; print(agspeed.__path__[0])"
The following figure provides an example of the output.Output
b) Run the following command to check whether the libsymbolic_expand.so
file is included in the preceding path.
ls -l ${your_agspeed_install_path} # Replace ${your_agspeed_install_path} with the AGSpeed installation path on your server.
The following figure provides an example of the output, which indicates that the libsymbolic_expand.so
file is included in the path.File
c) Run the following command to import LD_PRELOAD
environment variables.
# Replace ${your_agspeed_install_path} with the AGSpeed installation path on your server.
export LD_PRELOAD=${your_agspeed_install_path}/libsymbolic_expand.so
# Start Training...
The following figure provides an example of the output during the process, which indicates that the aten::silu
symbolic differential has been added to the TorchScript backend.register
The following section provides an example on how to import AGSpeed to your training code. In the example, the plus sign (+
) indicates a new line.
+ import agspeed
# Define dataloader
dataloader = ...
# Define model object
model = ResNet()
# Set the model device
model.to(device)
# Define the optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
# Set DDP
if distributed:
model = DDP(model)
+ model = agspeed.optimize(model)
############################## The following section provides samples of the train loop when the model uses FP32 and AMP precision separately##############################
############### FP32 ###############
# If the model to be trained uses FP32 precision, you do not need to modify train loop
for data, target in dataloader:
loss = model(data)
loss.backward()
optimizer.step()
optimizer.zero_grad()
############### FP32 ###############
############### AMP ###############
# If the model to be trained uses AMP precision, you need to add cache_enabled=False in the context of autocast.
+ with autocast(cache_enabled=False):
for data, target in dataloader:
loss = model(data)
scaler.scale(loss).backward()
scaler.step(optimizer)
optimizer.zero_grad()
scaler.update()
############### AMP ###############
############################## Add the symbolic differential of SiLU function by LD_PRELOAD ##############################
# The displayed path is the AGSpeed installation path on your server
python -c "import agspeed; print(agspeed.__path__[0])"
# Replace ${your_agspeed_install_path} with the installation path of AGSpeed on your server
+ export LD_PRELOAD=${your_agspeed_install_path}/libsymbolic_expand.so
# Run the training command
python train.py
The log examples help you check whether AGSpeed is enabled.
When you import AGSpeed, the IR optimization pass and the optimized NvFuser are automatically registered. The following log output indicates that AGSpeed is successfully imported. You can proceed with the next step.log1
AGSpeed performs Autotuning in the first few steps of the training process to automatically select the optimal solution for your training task. The following log output indicates that AGSpeed is enabled.
1,095 posts | 321 followers
FollowAlibaba Cloud Community - April 8, 2024
Alibaba Cloud Community - April 3, 2024
Alex - December 26, 2018
Alibaba Cloud Community - April 7, 2024
Alibaba Clouder - October 18, 2019
Alibaba Cloud Indonesia - November 5, 2021
1,095 posts | 321 followers
FollowPowerful parallel computing capabilities based on GPU technology.
Learn MoreApply the latest Reinforcement Learning AI technology to your Field Service Management (FSM) to obtain real-time AI-informed decision support.
Learn MoreSelf-service network O&M service that features network status visualization and intelligent diagnostics capabilities
Learn MoreThis solution provides you with Artificial Intelligence services and allows you to build AI-powered, human-like, conversational, multilingual chatbots over omnichannel to quickly respond to your customers 24/7.
Learn MoreMore Posts by Alibaba Cloud Community
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
Get Started for Free Get Started for Free