RetinaNet最佳化案例3：結合Blade和TensorRT Plugin最佳化模型 - Platform For AI

大部分PyTorch使用者會使用TensorRT Plugin實現檢測模型的後處理部分，以支援整個模型匯出到TensorRT。Blade擁有良好的可擴充性，如果您已經自己實現了TensorRT Plugin，也可以結合Blade協同最佳化。本文介紹如何使用Blade對已經實現了TensorRT Plugin機制的檢測模型進行最佳化。

背景資訊

TensorRT是NVIDIA GPU平台進行推理最佳化的利器，Blade底層最佳化深度採納了TensorRT的最佳化手段。相比而言，Blade有機融合了計算圖最佳化、TensorRT/oneDNN等Vendor最佳化庫、AI編譯最佳化、Blade手工最佳化運算元庫、Blade混合精度及Blade EasyCompression等多種最佳化技術。

RetinaNet是一種One-Stage RCNN類型的檢測網路，基本結構由一個Backbone、多個子網及NMS後處理組成。許多訓練架構中均實現了RetinaNet，典型的架構有Detectron2。之前介紹了如何通過scripting_with_instances方式匯出RetinaNet（Detectron2）模型並使用Blade快速完成模型最佳化，詳情請參見RetinaNet最佳化案例1：使用Blade最佳化RetinaNet（Detectron2）模型。

然而，對於大部分PyTorch使用者而言，先匯出ONNX再使用TensorRT部署是常見且熟悉的使用方式。但是ONNX匯出和TensorRT對ONNX Opset的支援均有限，導致很多情況下匯出ONNX並使用TensorRT最佳化的過程並不具備魯棒性。特別是對於Detection網路的後處理部分，難以直接匯出ONNX並使用TensorRT最佳化。除此之外，實際情境中檢測模型的後處理部分代碼實現通常不高效，因此，許多使用者會使用TensorRT提供的Plugin機制實現後處理部分，以支援整個模型匯出到TensorRT。

相比而言，Blade結合TorchScript Custom C++ Operators的最佳化方式比使用TensorRT提供的Plugin機制實現後處理部分更加簡便，詳情請參見RetinaNet最佳化案例2：結合Blade和Custom C++ Operator最佳化模型。此外，Blade擁有良好的可擴充性，如果您已經自己實現了TensorRT Plugin，也可以結合Blade協同最佳化。

使用限制

本文使用的環境需要滿足以下版本限制：

系統內容：Linux系統中使用Python 3.6及其以上版本、GCC 5.4及其以上版本、Nvidia Tesla T4、CUDA 10.2、CuDNN 8.0.5.39、TensorRT 7.2.2.3。
架構：PyTorch 1.8.1及其以上版本、Detectron2 0.4.1及其以上版本。
推理最佳化工具：Blade 3.16.0及其以上版本（動態連結TensorRT版本）。

操作流程

結合Blade和TensorRT Plugin最佳化模型的流程如下：

步驟一：建立帶有TensorRT Plugin的PyTorch模型
使用TensorRT Plugin實現RetinaNet的後處理部分。
步驟二：調用Blade最佳化模型
調用blade.optimize介面最佳化模型，並儲存最佳化後的模型。
步驟三：載入運行最佳化後的模型
經過對最佳化前後的模型進行效能測試，如果對結果滿意，可以載入最佳化後的模型進行推理。

步驟一：建立帶有TensorRT Plugin的PyTorch模型

Blade能夠和TensorRT擴充機制協同最佳化，以下介紹如何使用TensorRT擴充實現RetinaNet的後處理部分。關於開發和編譯TensorRT Plugin的教程請參見NVIDIA Deep Learning TensorRT Documentation。本文使用的RetinaNet後處理部分的程式邏輯來自NVIDIA開源社區，詳情請參見Retinanet-Examples。本文抽取了核心的代碼用於說明開發實現Custom Operator的流程。

下載範例程式碼並解壓。

wget -nv https://pai-blade.oss-cn-zhangjiakou.aliyuncs.com/tutorials/retinanet_example/retinanet-examples.tar.gz -O retinanet-examples.tar.gz
tar xvfz retinanet-examples.tar.gz 1>/dev/null

編譯TensorRT Plugin。

範例程式碼中包含了RetinaNet後處理的decode和nms的TensorRT Plugin實現及註冊。PyTorch官方文檔中（詳情請參見EXTENDING TORCHSCRIPT WITH CUSTOM C++ OPERATORS）提供了三種編譯Custom Operators的方式：Building with CMake、Building with JIT Compilation及Building with Setuptools。這三種編譯方式適用於不同情境，您可以根據自己的需求進行選擇。本文為了簡便，採用Building with JIT Compilation方式，範例程式碼如下所示。

說明編譯之前，您需要配置好TensorRT、CUDA,、CUDNN等依賴庫。

import torch.utils.cpp_extension
import os

codebase="retinanet-examples"
sources=['csrc/plugins/plugin.cpp',
         'csrc/cuda/decode.cu',
         'csrc/cuda/nms.cu',]
sources = [os.path.join(codebase,src) for src in sources]
torch.utils.cpp_extension.load(
    name="plugin",
    sources=sources,
    build_directory=codebase,
    extra_include_paths=['/usr/local/TensorRT/include/', '/usr/local/cuda/include/', '/usr/local/cuda/include/thrust/system/cuda/detail'],
    extra_cflags=['-std=c++14', '-O2', '-Wall'],
    extra_ldflags=['-L/usr/local/TensorRT/lib/', '-lnvinfer'],
    extra_cuda_cflags=[
        '-std=c++14', '--expt-extended-lambda',
        '--use_fast_math', '-Xcompiler', '-Wall,-fno-gnu-unique',
        '-gencode=arch=compute_75,code=sm_75',],
    is_python_module=False,
    with_cuda=True,
    verbose=False,
)

封裝RetinaNet卷積模型部分。

將RetinaNet模型部分單獨封裝為一個RetinaNetBackboneAndHeads Module。

import torch
from typing import List
from torch import Tensor
from torch.testing import assert_allclose
from detectron2 import model_zoo

# 這個類封裝了RetinaNet的backbone和rpn heads部分。
class RetinaNetBackboneAndHeads(torch.nn.Module):

    def __init__(self, model):
        super().__init__()
        self.model = model

    def preprocess(self, img):
        batched_inputs = [{"image": img}]
        images = self.model.preprocess_image(batched_inputs)
        return images.tensor

    def forward(self, images):
        features = self.model.backbone(images)
        features = [features[f] for f in self.model.head_in_features]
        cls_heads, box_heads = self.model.head(features)
        cls_heads = [cls.sigmoid() for cls in cls_heads]
        box_heads = [b.contiguous() for b in box_heads]
        return cls_heads, box_heads

retinanet_model = model_zoo.get("COCO-Detection/retinanet_R_50_FPN_3x.yaml", trained=True).eval()
retinanet_bacbone_heads = RetinaNetBackboneAndHeads(retinanet_model)

使用TensorRT Plugin構建RetinaNet後處理網路。如果您已經建立過TensorRT Engine，可以跳過此步驟。

建立TensorRT Engine。

為了使TensorRT Plugin生效，需要實現以下功能：

通過ctypes.cdll.LoadLibrary動態載入編譯好的plugin.so。
build_retinanet_decode通過tensorrt Python API構建後處理網路並將其Build成為Engine。

範例程式碼如下。

import os
import numpy as np
import tensorrt as trt

import ctypes
# 載入TensorRT Plugin動態連結程式庫。
codebase="retinanet-examples"
ctypes.cdll.LoadLibrary(os.path.join(codebase, 'plugin.so'))

TRT_LOGGER = trt.Logger()
trt.init_libnvinfer_plugins(TRT_LOGGER, "")
PLUGIN_CREATORS = trt.get_plugin_registry().plugin_creator_list

# 擷取TensorRT Plugin的函數。
def get_trt_plugin(plugin_name, field_collection):
    plugin = None
    for plugin_creator in PLUGIN_CREATORS:
        if plugin_creator.name != plugin_name:
            continue
        if plugin_name == "RetinaNetDecode":
            plugin = plugin_creator.create_plugin(
                name=plugin_name, field_collection=field_collection
            )
        if plugin_name == "RetinaNetNMS":
            plugin = plugin_creator.create_plugin(
                name=plugin_name, field_collection=field_collection
            )
    assert plugin is not None, "plugin not found"
    return plugin

# 構建TensorRT網路的函數。
def build_retinanet_decode(example_outputs,
        input_image_shape,
        anchors_list,
        test_score_thresh = 0.05,
        test_nms_thresh = 0.5,
        test_topk_candidates = 1000,
        max_detections_per_image = 100,
    ):
    builder = trt.Builder(TRT_LOGGER)
    EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
    network = builder.create_network(EXPLICIT_BATCH)
    config = builder.create_builder_config()
    config.max_workspace_size = 3 ** 20

    cls_heads, box_heads = example_outputs
    profile = builder.create_optimization_profile()
    decode_scores = []
    decode_boxes = []
    decode_class = []

    input_blob_names = []
    input_blob_types = []
    def _add_input(head_tensor, head_name):
        input_blob_names.append(head_name)
        input_blob_types.append("Float")
        head_shape = list(head_tensor.shape)[-3:]
        profile.set_shape(
             head_name, [1] + head_shape, [20] + head_shape, [1000] + head_shape)
        return network.add_input(
            name=head_name, dtype=trt.float32, shape=[-1] + head_shape
        )

    # Build network inputs.
    cls_head_inputs = []
    cls_head_strides = [input_image_shape[-1] // cls_head.shape[-1] for cls_head in cls_heads]
    for idx, cls_head in enumerate(cls_heads):
        cls_head_name = "cls_head" + str(idx)
        cls_head_inputs.append(_add_input(cls_head, cls_head_name))

    box_head_inputs = []
    for idx, box_head in enumerate(box_heads):
        box_head_name = "box_head" + str(idx)
        box_head_inputs.append(_add_input(box_head, box_head_name))

    output_blob_names = []
    output_blob_types = []
    # Build decode network.
    for idx, anchors in enumerate(anchors_list):
        field_coll = trt.PluginFieldCollection([
            trt.PluginField("topk_candidates", np.array([test_topk_candidates], dtype=np.int32), trt.PluginFieldType.INT32),
            trt.PluginField("score_thresh", np.array([test_score_thresh], dtype=np.float32), trt.PluginFieldType.FLOAT32),
            trt.PluginField("stride", np.array([cls_head_strides[idx]], dtype=np.int32), trt.PluginFieldType.INT32),
            trt.PluginField("num_anchors", np.array([anchors.numel()], dtype=np.int32), trt.PluginFieldType.INT32),
            trt.PluginField("anchors", anchors.contiguous().cpu().numpy().astype(np.float32), trt.PluginFieldType.FLOAT32),]
        )
        decode_layer = network.add_plugin_v2(
            inputs=[cls_head_inputs[idx], box_head_inputs[idx]],
            plugin=get_trt_plugin("RetinaNetDecode", field_coll),
        )
        decode_scores.append(decode_layer.get_output(0))
        decode_boxes.append(decode_layer.get_output(1))
        decode_class.append(decode_layer.get_output(2))

    # Build NMS network.
    scores_layer = network.add_concatenation(decode_scores)
    boxes_layer = network.add_concatenation(decode_boxes)
    class_layer = network.add_concatenation(decode_class)
    field_coll = trt.PluginFieldCollection([
            trt.PluginField("nms_thresh", np.array([test_nms_thresh], dtype=np.float32), trt.PluginFieldType.FLOAT32),
            trt.PluginField("max_detections_per_image", np.array([max_detections_per_image], dtype=np.int32), trt.PluginFieldType.INT32),]
        )
    nms_layer = network.add_plugin_v2(
       inputs=[scores_layer.get_output(0), boxes_layer.get_output(0), class_layer.get_output(0)],
       plugin=get_trt_plugin("RetinaNetNMS", field_coll),
    )
    nms_layer.get_output(0).name = "scores"
    nms_layer.get_output(1).name = "boxes"
    nms_layer.get_output(2).name = "classes"
    nms_outputs = [network.mark_output(nms_layer.get_output(k)) for k in range(3)]
    config.add_optimization_profile(profile)
    cuda_engine = builder.build_engine(network, config)
    assert cuda_engine is not None
    return cuda_engine

根據RetinaNetBackboneAndHeads的實際結果輸出個數，輸出類型及輸出Shape建立的cuda_engine。

import numpy as np
from detectron2.data.detection_utils import read_image

!wget http://images.cocodataset.org/val2017/000000439715.jpg -q -O input.jpg
img = read_image('./input.jpg')
img = torch.from_numpy(np.ascontiguousarray(img.transpose(2, 0, 1)))

example_inputs = retinanet_bacbone_heads.preprocess(img)
example_outputs = retinanet_bacbone_heads(example_inputs)

cell_anchors = [c.contiguous() for c in retinanet_model.anchor_generator.cell_anchors]
cuda_engine = build_retinanet_decode(
            example_outputs, example_inputs.shape, cell_anchors)

通過Blade擴充支援混合使用PyTorch和TensorRT Engine的模型。

以下代碼中通過RetinaNetWrapper、RetinaNetBackboneAndHeads及RetinaNetPostProcess重新組合了Backbone、Heads及Tensorrt Plugin後處理部分。

import blade.torch

# 使用Blade TensorRT擴充支援的後處理部分。
class RetinaNetPostProcess(torch.nn.Module):
    def __init__(self, cuda_engine):
        super().__init__()
        blob_names = [cuda_engine.get_binding_name(idx) for idx in range(cuda_engine.num_bindings)]
        input_blob_names = blob_names[:-3]
        input_blob_types = ["Float"] * len(input_blob_names)
        output_blob_names = blob_names[-3:]
        output_blob_types = ["Float"] * len(output_blob_names)

        self.trt_ext_plugin = torch.classes.torch_addons.TRTEngineExtension(
            bytes(cuda_engine.serialize()),
            (input_blob_names, output_blob_names, input_blob_types, output_blob_types),
        )

    def forward(self, inputs: List[Tensor]):
        return self.trt_ext_plugin.forward(inputs)

# 混合使用PyTorch和TensorRT Engine。
class RetinaNetWrapper(torch.nn.Module):

    def __init__(self, model, trt_postproc):
        super().__init__()
        self.backbone_and_heads = model
        self.trt_postproc = torch.jit.script(trt_postproc)

    def forward(self, images):
        cls_heads, box_heads = self.backbone_and_heads(images)
        return self.trt_postproc(cls_heads + box_heads)

trt_postproc = RetinaNetPostProcess(cuda_engine)
retinanet_mix_trt = RetinaNetWrapper(retinanet_bacbone_heads, trt_postproc)

# 可以匯出和儲存為TorchScript。
retinanet_script = torch.jit.trace(retinanet_mix_trt, (example_inputs, ), check_trace=False)
torch.jit.save(retinanet_script, 'retinanet_script.pt')
torch.save(example_inputs, 'example_inputs.pth')
outputs = retinanet_script(example_inputs)

新組裝的torch.nn.Module擁有以下特點：

使用了Blade的TensorRT擴充支援torch.classes.torch_addons.TRTEngineExtension介面。
支援TorchScript模型匯出，上述代碼中使用了torch.jit.trace進行匯出。
支援TorchScript格式儲存模型。

步驟二：調用Blade最佳化模型

調用Blade最佳化介面。

調用blade.optimize介面對模型進行最佳化，程式碼範例如下。關於blade.optimize介面詳情，請參見最佳化PyTorch模型。

import blade
import blade.torch
import ctypes
import torch
import os

codebase="retinanet-examples"
ctypes.cdll.LoadLibrary(os.path.join(codebase, 'plugin.so'))

blade_config = blade.Config()
blade_config.gpu_config.disable_fp16_accuracy_check = True

script_model = torch.jit.load('retinanet_script.pt')
example_inputs = torch.load('example_inputs.pth')
test_data = [(example_inputs,)] # PyTorch的輸入資料是List of Tuple。
with blade_config:
    optimized_model, opt_spec, report = blade.optimize(
        script_model,  # 上一步匯出的TorchScript模型。
        'o1',  # 開啟Blade O1層級的最佳化。
        device_type='gpu',  # 目標裝置為GPU。
        test_data=test_data,  # 給定一組測試資料，用於輔助最佳化及測試。
    )

列印最佳化報告並儲存模型。

Blade最佳化後的模型仍然是一個TorchScript模型。完成最佳化後，您可以通過如下代碼列印最佳化報告並儲存最佳化模型。

# 列印最佳化結果報表。
print("Report: {}".format(report))
# 儲存最佳化後的模型。
torch.jit.save(optimized_model, 'optimized.pt')

列印的最佳化報告如下所示，關於最佳化報告中的欄位詳情請參見最佳化報告。

Report: {
  "software_context": [
    {
      "software": "pytorch",
      "version": "1.8.1+cu102"
    },
    {
      "software": "cuda",
      "version": "10.2.0"
    }
  ],
  "hardware_context": {
    "device_type": "gpu",
    "microarchitecture": "T4"
  },
  "user_config": "",
  "diagnosis": {
    "model": "unnamed.pt",
    "test_data_source": "user provided",
    "shape_variation": "undefined",
    "message": "Unable to deduce model inputs information (data type, shape, value range, etc.)",
    "test_data_info": "0 shape: (1, 3, 480, 640) data type: float32"
  },
  "optimizations": [
    {
      "name": "PtTrtPassFp16",
      "status": "effective",
      "speedup": "4.37",
      "pre_run": "40.59 ms",
      "post_run": "9.28 ms"
    }
  ],
  "overall": {
    "baseline": "40.02 ms",
    "optimized": "9.27 ms",
    "speedup": "4.32"
  },
  "model_info": {
    "input_format": "torch_script"
  },
  "compatibility_list": [
    {
      "device_type": "gpu",
      "microarchitecture": "T4"
    }
  ],
  "model_sdk": {}
}

對最佳化前後的模型進行效能測試。

效能測試的程式碼範例如下所示。

import time

@torch.no_grad()
def benchmark(model, inp):
    for i in range(100):
        model(inp)
    torch.cuda.synchronize()
    start = time.time()
    for i in range(200):
        model(inp)
    torch.cuda.synchronize()
    elapsed_ms = (time.time() - start) * 1000
    print("Latency: {:.2f}".format(elapsed_ms / 200))

# 對最佳化前的模型測速。
benchmark(script_model, example_inputs)
# 對最佳化後的模型測速。
benchmark(optimized_model, example_inputs)

本次測試的參考結果值如下。

Latency: 40.71
Latency: 9.35

上述結果表示同樣執行200輪，最佳化前後的模型平均延時分別是40.71 ms和9.35 ms。

步驟三：載入運行最佳化後的模型

可選：在試用階段，您可以設定如下的環境變數，防止因為鑒權失敗而程式退出。
```
export BLADE_AUTH_USE_COUNTING=1
```
擷取鑒權。
```
export BLADE_REGION=<region>
export BLADE_TOKEN=<token>
```
您需要根據實際情況替換以下參數：
- <region>：Blade支援的地區，需要加入Blade使用者群擷取該資訊，使用者群的二維碼詳情請參見擷取Token。
- <token>：鑒權Token，需要加入Blade使用者群擷取該資訊，使用者群的二維碼詳情請參見擷取Token。

載入運行最佳化後的模型。

Blade最佳化後的模型仍然是TorchScript，因此您無需切換環境即可載入最佳化後的結果。

import blade.runtime.torch
import torch

from torch.testing import assert_allclose
import ctypes
import os

codebase="retinanet-examples"
ctypes.cdll.LoadLibrary(os.path.join(codebase, 'plugin.so'))

optimized_model = torch.jit.load('optimized.pt')
example_inputs = torch.load('example_inputs.pth')

with torch.no_grad():
    pred = optimized_model(example_inputs)