Most PyTorch users use TensorRT plug-ins to build a post-processing network for a detection model so that they can export the model to TensorRT. Machine Learning Platform for AI (PAI)-Blade features good scalability. If you have developed your own TensorRT plug-ins, you can use PAI-Blade and TensorRT plug-ins for collaborative model optimization. This topic describes how to use PAI-Blade to optimize a detection model whose post-processing network is built by using TensorRT plug-ins.
Background information
TensorRT is a powerful tool for inference optimization on NVIDIA GPUs. PAI-Blade deeply integrates the optimization methods of TensorRT at the underlying layer. In addition, PAI-Blade integrates multiple optimization technologies, including graph optimization, optimization libraries such as TensorRT and oneDNN, AI compilation optimization, an optimization operator library, mixed precision, and EasyCompression.
RetinaNet is a detection network of the One-Stage Region-based Convolutional Neural
Network (R-CNN) type. The basic structure of RetinaNet consists of a backbone, multiple
subnetworks, and Non-Maximum Suppression (NMS). NMS is a post-processing algorithm.
RetinaNet is implemented in many training frameworks. Detectron2 is a typical training
framework that uses RetinaNet. You can call the scripting_with_instances
method of Detectron2 to export a RetinaNet model and use PAI-Blade to optimize the
model. For more information, see Use PAI-Blade to optimize a RetinaNet model that is in the Detectron2 framework.
Most PyTorch users usually export models in the Open Neural Network Exchange (ONNX) format and then deploy the models by using TensorRT. However, both ONNX models and TensorRT provide limited support for ONNX opsets. As a result, the process of exporting an ONNX model and optimizing the model by using TensorRT lacks robustness in many cases. In particular, the post-processing network of a detection model cannot be directly exported to an ONNX model and optimized by using TensorRT. In addition, the code is implemented in an inefficient way for the post-processing network of a detection model in actual scenarios. Therefore, many users use TensorRT plug-ins to build a post-processing network for a detection model so that they can export the model to TensorRT.
You can also use PAI-Blade and TorchScript custom C++ operators to optimize a model. This method is easier to use than the method of building a post-processing network by using TensorRT plug-ins. PAI-Blade features good scalability. If you have developed your own TensorRT plug-ins, you can use PAI-Blade and TensorRT plug-ins for collaborative model optimization.
Limits
- System environment: Python 3.6 or later, GCC 5.4 or later, NVIDIA Tesla T4, CUDA 10.2, cuDNN 8.0.5.39, and TensorRT 7.2.2.3 in Linux
- Framework: PyTorch 1.8.1 or later, and Detectron2 0.4.1 or later
- Inference optimization tool: PAI-Blade V3.16.0 or later, which supports TensorRT
Procedure
- Step 1: Create a PyTorch model by using TensorRT plug-ins
Use TensorRT plug-ins to build a post-processing network for the RetinaNet model.
- Step 2: Use PAI-Blade to optimize the model
Call the
blade.optimize
method to optimize the model, and save the optimized model. - Step 3: Load and run the optimized model
If the optimized model passes the performance testing and meets your expectations, load the optimized model for inference.
Step 1: Create a PyTorch model by using TensorRT plug-ins
PAI-Blade can collaborate with TensorRT plug-ins for model optimization. This step describes how to use TensorRT plug-ins to build a post-processing network for the RetinaNet model. For more information about how to develop and compile TensorRT plug-ins, see NVIDIA Deep Learning TensorRT Documentation. In this topic, the program logic for the post-processing network of the RetinaNet model comes from the open source community of NVIDIA. For more information, see retinanet-examples. The core code is used in this example to show you how to develop and implement custom operators.