結合Blade和CustomC++Operator最佳化模型 - Platform For AI

為了使檢測模型後處理部分更加高效，您可以採用TorchScript Custom C++ Operators將Python代碼實現的邏輯替換成高效的C++實現，然後再匯出TorchScript模型進行Blade最佳化。本文介紹如何使用Blade對TorchScript Custom C++ Operator實現的後處理邏輯的檢測模型進行最佳化。

背景資訊

RetinaNet是一種One-Stage RCNN類型的檢測網路，基本結構由一個Backbone、多個子網及NMS後處理組成。許多訓練架構中均實現了RetinaNet，典型的架構有Detectron2。上一篇中介紹了如何通過scripting_with_instances方式匯出RetinaNet（Detectron2）模型並使用Blade快速完成模型最佳化，詳情請參見RetinaNet最佳化案例1：使用Blade最佳化RetinaNet（Detectron2）模型。

然而，檢測模型的後處理部分代碼通常需要執行計算和篩選boxes、nms等邏輯，通過Python實現該部分邏輯往往不高效。此時，您可以採用TorchScript Custom C++ Operators將Python代碼實現的邏輯替換成高效的C++實現，然後再匯出TorchScript模型並使用Blade進行模型最佳化。

使用限制

本文使用的環境需要滿足以下版本限制：

系統內容：Linux系統中使用Python 3.6及其以上版本、GCC 5.4及其以上版本、Nvidia Tesla T4、CUDA 10.2、CuDNN 8.0.5.39。
架構：PyTorch 1.8.1及其以上版本、Detectron2 0.4.1及其以上版本。
推理最佳化工具：Blade 3.16.0及其以上版本。

操作流程

結合Blade和Custom C++ Operator最佳化模型的流程如下：

步驟一：建立帶有Custom C++ Operators的PyTorch模型
使用TorchScript擴充實現RetinaNet的後處理部分。
步驟二：匯出TorchScript模型
使用Detectron2提供的TracingAdapter或scripting_with_instances任何一種方式匯出模型。
步驟三：調用Blade最佳化模型
調用blade.optimize介面最佳化模型，並儲存最佳化後的模型。
步驟四：載入運行最佳化後的模型
經過對最佳化前後的模型進行效能測試，如果對結果滿意，可以載入最佳化後的模型進行推理。

步驟一：建立帶有Custom C++ Operators的PyTorch模型

Blade工具與PyTorch TorchScript擴充機制無縫銜接，以下介紹如何使用TorchScript擴充實現RetinaNet的後處理部分。關於TorchScript Custom Operator的介紹請參見EXTENDING TORCHSCRIPT WITH CUSTOM C++ OPERATORS。本文使用的RetinaNet後處理部分的程式邏輯來自NVIDIA開源社區，詳情請參見Retinanet-Examples。本文抽取了核心的代碼用於說明開發實現Custom Operator的流程。

下載範例程式碼並解壓。

wget -nv https://pai-blade.oss-cn-zhangjiakou.aliyuncs.com/tutorials/retinanet_example/retinanet-examples.tar.gz -O retinanet-examples.tar.gz
tar xvfz retinanet-examples.tar.gz 1>/dev/null

編譯Custom C++ Operators。

PyTorch官方文檔中（詳情請參見EXTENDING TORCHSCRIPT WITH CUSTOM C++ OPERATORS）提供了三種編譯Custom Operators的方式：Building with CMake、Building with JIT Compilation及Building with Setuptools。這三種編譯方式適用於不同情境，您可以根據自己的需求進行選擇。本文為了簡便，採用Building with JIT Compilation方式，範例程式碼如下所示。

import torch.utils.cpp_extension
import os
codebase="retinanet-examples"
sources=['csrc/extensions.cpp',
         'csrc/cuda/decode.cu',
         'csrc/cuda/nms.cu',]
sources = [os.path.join(codebase,src) for src in sources]
torch.utils.cpp_extension.load(
    name="custom",
    sources=sources,
    build_directory=codebase,
    extra_include_paths=['/usr/local/TensorRT/include/', '/usr/local/cuda/include/', '/usr/local/cuda/include/thrust/system/cuda/detail'],
    extra_cflags=['-std=c++14', '-O2', '-Wall'],
    extra_cuda_cflags=[
        '-std=c++14', '--expt-extended-lambda',
        '--use_fast_math', '-Xcompiler', '-Wall,-fno-gnu-unique',
        '-gencode=arch=compute_75,code=sm_75',],
    is_python_module=False,
    with_cuda=True,
    verbose=False,
)

上述程式執行完成後，編譯產生的custom.so會儲存在retinanet-examples目錄下。

使用Custom C++ Operators替換RetinaNet的後處理部分。

為了簡潔，此處直接使用adapter_forward替換RetinaNet.forward。adapter_forward使用decode_cuda和nms_cuda兩個Custom C++ Operators實現了RetinaNet的後處理部分，範例程式碼如下所示。

import os
import torch
from typing import Tuple, Dict, List, Optional
codebase="retinanet-examples"
torch.ops.load_library(os.path.join(codebase, 'custom.so'))

decode_cuda = torch.ops.retinanet.decode
nms_cuda = torch.ops.retinanet.nms

# 該函數的主要代碼部分和RetinaNet.forward一樣，但是後處理部分替換為通過decode_cuda和nms_cuda實現。
def adapter_forward(self, batched_inputs: Tuple[Dict[str, torch.Tensor]]):
    images = self.preprocess_image(batched_inputs)
    features = self.backbone(images.tensor)
    features = [features[f] for f in self.head_in_features]
    cls_heads, box_heads = self.head(features)
    cls_heads = [cls.sigmoid() for cls in cls_heads]
    box_heads = [b.contiguous() for b in box_heads]

    # 後處理部分。
    strides = [images.tensor.shape[-1] // cls_head.shape[-1] for cls_head in cls_heads]
    decoded = [
        decode_cuda(
            cls_head,
            box_head,
            anchor.view(-1),
            stride,
            self.test_score_thresh,
            self.test_topk_candidates,
        )
        for stride, cls_head, box_head, anchor in zip(
            strides, cls_heads, box_heads, self.cell_anchors
        )
    ]

    # non-maximum suppression部分。
    decoded = [torch.cat(tensors, 1) for tensors in zip(decoded[0], decoded[1], decoded[2])]
    return nms_cuda(decoded[0], decoded[1], decoded[2], self.test_nms_thresh, self.max_detections_per_image)

from detectron2.modeling.meta_arch import retinanet

# 使用adapter_forward替換RetinaNet.forward。
retinanet.RetinaNet.forward = adapter_forward

步驟二：匯出TorchScript模型

Detectron2是FAIR開源的靈活、可擴充、可配置的目標檢測和映像分割訓練架構。由於架構的靈活性，使用常規方法匯出模型可能會失敗或得到錯誤的匯出結果。為了支援TorchScript部署，Detectron2提供了TracingAdapter和scripting_with_instances兩種匯出方式，詳情請參見Detectron2 Usage。

Blade支援輸入任意形式的TorchScript模型，如下以scripting_with_instances為例，介紹匯出模型的過程。

import torch
import numpy as np

from torch import Tensor
from torch.testing import assert_allclose

from detectron2 import model_zoo
from detectron2.export import scripting_with_instances
from detectron2.structures import Boxes
from detectron2.data.detection_utils import read_image

# 使用scripting_with_instances匯出RetinaNet模型。
def load_retinanet(config_path):
    model = model_zoo.get(config_path, trained=True).eval()
    # Set a new cell_anchors attributes to PyTorch model.
    model.cell_anchors = [c.contiguous() for c in model.anchor_generator.cell_anchors]
    fields = {
        "pred_boxes": Boxes,
        "scores": Tensor,
        "pred_classes": Tensor,
    }
    script_model = scripting_with_instances(model, fields)
    return model, script_model

# 下載一張樣本圖片。
!wget http://images.cocodataset.org/val2017/000000439715.jpg -q -O input.jpg
img = read_image('./input.jpg')
img = torch.from_numpy(np.ascontiguousarray(img.transpose(2, 0, 1)))

# 嘗試執行和對比匯出模型前後的結果。
pytorch_model, script_model = load_retinanet("COCO-Detection/retinanet_R_50_FPN_3x.yaml")
with torch.no_grad():
    batched_inputs = [{"image": img.float()}]
    pred1 = pytorch_model(batched_inputs)
    pred2 = script_model(batched_inputs)

assert_allclose(pred1[0], pred2[0])

步驟三：調用Blade最佳化模型

調用Blade最佳化介面。

調用blade.optimize介面對模型進行最佳化，程式碼範例如下。關於blade.optimize介面詳情，請參見最佳化PyTorch模型。

import os
import blade
import torch

# 載入custom c++ operator動態連結程式庫。
codebase="retinanet-examples"
torch.ops.load_library(os.path.join(codebase, 'custom.so'))

blade_config = blade.Config()
blade_config.gpu_config.disable_fp16_accuracy_check = True

test_data = [(batched_inputs,)] # PyTorch的輸入資料是List of Tuple。

with blade_config:
    optimized_model, opt_spec, report = blade.optimize(
    script_model,  # 上一步匯出的TorchScript模型。
    'o1',  # 開啟Blade O1層級的最佳化。
    device_type='gpu',  # 目標裝置為GPU。
    test_data=test_data,  # 給定一組測試資料，用於輔助最佳化及測試。
    )

列印最佳化報告並儲存模型。

Blade最佳化後的模型仍然是一個TorchScript模型。完成最佳化後，您可以通過如下代碼列印最佳化報告並儲存最佳化模型。

# 列印最佳化結果報表。
print("Report: {}".format(report))
# 儲存最佳化後的模型。
torch.jit.save(script_model, 'script_model.pt')
torch.jit.save(optimized_model, 'optimized.pt')

列印的最佳化報告如下所示，關於最佳化報告中的欄位詳情請參見最佳化報告。

Report: {
  "software_context": [
    {
      "software": "pytorch",
      "version": "1.8.1+cu102"
    },
    {
      "software": "cuda",
      "version": "10.2.0"
    }
  ],
  "hardware_context": {
    "device_type": "gpu",
    "microarchitecture": "T4"
  },
  "user_config": "",
  "diagnosis": {
    "model": "unnamed.pt",
    "test_data_source": "user provided",
    "shape_variation": "undefined",
    "message": "Unable to deduce model inputs information (data type, shape, value range, etc.)",
    "test_data_info": "0 shape: (3, 480, 640) data type: float32"
  },
  "optimizations": [
    {
      "name": "PtTrtPassFp16",
      "status": "effective",
      "speedup": "3.92",
      "pre_run": "40.72 ms",
      "post_run": "10.39 ms"
    }
  ],
  "overall": {
    "baseline": "40.64 ms",
    "optimized": "10.41 ms",
    "speedup": "3.90"
  },
  "model_info": {
    "input_format": "torch_script"
  },
  "compatibility_list": [
    {
      "device_type": "gpu",
      "microarchitecture": "T4"
    }
  ],
  "model_sdk": {}
}

對最佳化前後的模型進行效能測試。

效能測試的程式碼範例如下所示。

import time

@torch.no_grad()
def benchmark(model, inp):
    for i in range(100):
        model(inp)
    torch.cuda.synchronize()
    start = time.time()
    for i in range(200):
        model(inp)
    torch.cuda.synchronize()
    elapsed_ms = (time.time() - start) * 1000
    print("Latency: {:.2f}".format(elapsed_ms / 200))

# 對最佳化前的模型測速。
benchmark(script_model, batched_inputs)
# 對最佳化後的模型測速。
benchmark(optimized_model, batched_inputs)

本次測試的參考結果值如下。

Latency: 40.65
Latency: 10.46

上述結果表示同樣執行200輪，最佳化前後的模型平均延時分別是40.65 ms和10.46 ms。

步驟四：載入運行最佳化後的模型

可選：在試用階段，您可以設定如下的環境變數，防止因為鑒權失敗而程式退出。
```
export BLADE_AUTH_USE_COUNTING=1
```
擷取鑒權。
```
export BLADE_REGION=<region>
export BLADE_TOKEN=<token>
```
您需要根據實際情況替換以下參數：
- <region>：Blade支援的地區，需要加入Blade使用者群擷取該資訊，使用者群的二維碼詳情請參見擷取Token。
- <token>：鑒權Token，需要加入Blade使用者群擷取該資訊，使用者群的二維碼詳情請參見擷取Token。

載入運行最佳化後的模型。

Blade最佳化後的模型仍然是TorchScript，因此您無需切換環境即可載入最佳化後的結果。

import blade.runtime.torch
import detectron2
import torch
import numpy as np
import os
from detectron2.data.detection_utils import read_image
from torch.testing import assert_allclose

# 載入custom c++ operator動態連結程式庫。
codebase="retinanet-examples"
torch.ops.load_library(os.path.join(codebase, 'custom.so'))

script_model = torch.jit.load('script_model.pt')
optimized_model = torch.jit.load('optimized.pt')

img = read_image('./input.jpg')
img = torch.from_numpy(np.ascontiguousarray(img.transpose(2, 0, 1)))

# 嘗試執行和對比匯出模型前後的結果。
with torch.no_grad():
    batched_inputs = [{"image": img.float()}]
    pred1 = script_model(batched_inputs)
    pred2 = optimized_model(batched_inputs)

assert_allclose(pred1[0], pred2[0], rtol=1e-3, atol=1e-2)