Machine Learning Platform for AI (PAI)-Blade provides an SDK for C++ that you can
use to deploy optimized models for inference. This topic describes how to use PAI-Blade
SDK to deploy a PyTorch model.
Prerequisites
- A PyTorch model is optimized by using PAI-Blade. For more information, see Optimize a PyTorch model.
- An SDK is installed, and an authentication token is obtained. This is required because the SDK for the Pre-CXX11 application binary interface (ABI)
and the .deb package of V3.7.0 are used in this example.
Note A model that is optimized by using PAI-Blade can be properly run only if the corresponding
SDK is installed.
Prepare the environment
This topic describes how to use PAI-Blade SDK to deploy a PyTorch model for inference.
In this example, Ubuntu 18.04 64-bit is installed.
- Prepare the server.
Prepare an Elastic Compute Service (ECS) instance that is configured with the following
specifications:
- Instance type: ecs.gn6i-c4g1.xlarge, NVIDIA Tesla T4 GPU
- Operating system: Ubuntu 18.04 64-bit
- Device: CUDA 10.0
- GPU: Driver 440.64.00
- GPU computing acceleration package: cuDNN 7.6.5
- Install Python 3.
# Update pip.
python3 -m pip install --upgrade pip
# Install virtualenv, which is a virtual environment in which you can install PyTorch.
pip3 install virtualenv==16.0
python3 -m virtualenv venv
# Activate virtualenv.
source venv/bin/activate
Deploy a model for inference
To use PAI-Blade SDK to load and deploy an optimized model for inference, you can
link the libraries in the SDK when you compile the inference code, without the need
to modify the original code logic.
- Prepare the model and test data.
In this example, an optimized sample model is used. Run the following command to download
the sample model. You can also use your own optimized model. For more information
about how to optimize a model by using PAI-Blade, see
Optimize a PyTorch model.
# Download the optimized sample model.
wget http://pai-blade.oss-cn-zhangjiakou.aliyuncs.com/demo/sdk/pytorch/optimized_resnet50.pt
# Download the test data.
wget http://pai-blade.oss-cn-zhangjiakou.aliyuncs.com/demo/sdk/pytorch/inputs.pth
- Download and view the inference code.
You can run a PyTorch model that is optimized by using PAI-Blade in the same way as
a regular PyTorch model. You do not need to write extra code or set extra parameters.
In this example, the following interface code is downloaded:
#include <torch/script.h>
#include <torch/serialize.h>
#include <chrono>
#include <iostream>
#include <fstream>
#include <memory>
int benchmark(torch::jit::script::Module &module,
std::vector<torch::jit::IValue> &inputs) {
// warmup 10-iter
for (int k = 0; k < 10; ++ k) {
module.forward(inputs);
}
auto start = std::chrono::system_clock::now();
// run 20-iter
for (int k = 0; k < 20; ++ k) {
module.forward(inputs);
}
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
std::time_t end_time = std::chrono::system_clock::to_time_t(end);
std::cout << "finished computation at " << std::ctime(&end_time)
<< "\nelapsed time: " << elapsed_seconds.count() << "s"
<< "\navg latency: " << 1000.0 * elapsed_seconds.count()/20 << "ms\n";
return 0;
}
torch::Tensor load_data(const char* data_file) {
std::ifstream file(data_file, std::ios::binary);
std::vector<char> data((std::istreambuf_iterator<char>(file)), std::istreambuf_iterator<char>());
torch::IValue ivalue = torch::pickle_load(data);
CHECK(ivalue.isTensor());
return ivalue.toTensor();
}
int main(int argc, const char* argv[]) {
if (argc != 3) {
std::cerr << "usage: example-app <path-to-exported-script-module> <path-to-saved-test-data>\n";
return -1;
}
torch::jit::script::Module module;
try {
// Deserialize the ScriptModule from a file using torch::jit::load().
module = torch::jit::load(argv[1]);
auto image_tensor = load_data(argv[2]);
std::vector<torch::IValue> inputs{image_tensor};
benchmark(module, inputs);
auto outputs = module.forward(inputs);
}
catch (const c10::Error& e) {
std::cerr << "error loading the model" << std::endl << e.what();
return -1;
}
std::cout << "ok\n";
}
Save the preceding sample code to a local file named torch_app.cc.
- Compile the inference code.
When you compile the code, link LibTorch libraries and the libtorch_blade.so and libral_base_context.so
files in the
/usr/local/lib directory. Run the following command to compile the code:
TORCH_DIR=$(python3 -c "import torch; import os; print(os.path.dirname(torch.__file__))")
g++ torch_app.cc -std=c++14 \
-D_GLIBCXX_USE_CXX11_ABI=0 \
-I ${TORCH_DIR}/include \
-I ${TORCH_DIR}/include/torch/csrc/api/include \
-Wl,--no-as-needed \
-L /usr/local/lib \
-L ${TORCH_DIR}/lib \
-l torch -l torch_cuda -l torch_cpu -l c10 -l c10_cuda \
-l torch_blade -l ral_base_context \
-o torch_app
You can modify the following parameters based on your business requirements:
- torch_app.cc: the name of the file that contains the inference code.
- /usr/local/lib: the installation path of the SDK. In most cases, you do not need to modify this
parameter.
- torch_app: the name of the executable program that is generated after compilation.
In some versions of the system and compiler, the link works only if you write the
code line
-Wl,--no-as-needed \
.
Notice
- Set the value of the
GLIBCXX_USE_CXX11_ABI
macro based on the version of the LibTorch ABI.
- PyTorch for CUDA 10.0 provided by PAI-Blade is compiled by using GNU Compiler Collection
(GCC) 7.5. If you use the CXX11 ABI, make sure that the GCC version is 7.5. If you
use the Pre-CXX11 ABI, no limits are placed on the GCC version.
- Run the model for inference on a local device.
Use the executable program to load and run the optimized model. The following sample
code provides an example. In this example, the executable program torch_app and the
optimized sample model optimized_resnet50.pt are used.
export BLADE_REGION=<region> # Region: cn-beijing, cn-shanghai for example.
export BLADE_TOKEN=<token>
export LD_LIBRARY_PATH=/usr/local/lib:${TORCH_DIR}/lib:${LD_LIBRARY_PATH}
./torch_app optimized_resnet50.pt inputs.pth
Modify the following parameters based on your business requirements:
- <region>: the region in which you use PAI-Blade. You can join the DingTalk group of PAI-Blade
users to obtain the regions in which PAI-Blade can be used.
- <token>: the authentication token that is required to use PAI-Blade. You can join the DingTalk
group of PAI-Blade users to obtain the authentication token.
- torch_app: the executable program that is generated after compilation.
- optimized_resnet50.pt: the PyTorch model that is optimized by using PAI-Blade. In this example, the optimized
sample model that is downloaded in Step 1 is used.
- inputs.pth: the test data. In this example, the test data that is downloaded in Step 1 is used.
If the system displays information similar to the following output, the model is being
run.
finished computation at Wed Jan 27 20:03:38 2021
elapsed time: 0.513882s
avg latency: 25.6941ms
ok