PAI-Bladeを使用してDetectron2フレームワークにあるRetinaNetモデルを最適化する - Platform For AI

RetinaNetは、1ステージ領域ベースの畳み込みニューラルネットワーク (R-CNN) タイプの検出ネットワークです。 RetinaNetの基本構造は、バックボーン、複数のサブネットワーク、および非最大抑制 (NMS) で構成されています。 NMSは後処理アルゴリズムである。 RetinaNetは多くのトレーニングフレームワークで実装されています。 Detectron2は、RetinaNetを使用する典型的なトレーニングフレームワークです。このトピックでは、Machine Learning Platform for AI (PAI) が提供するPAI-Bladeを使用して、Detectron2フレームワーク内のRetinaNetモデルを最適化する方法について説明します。

制限事項

このトピックで説明する手順で使用する環境は、次のバージョン要件を満たす必要があります。

システム環境: LinuxでのPython 3.6以降とCompute Unified Device Architecture (CUDA) 10.2
フレームワーク: PyTorch 1.8.1以降、およびDetectron2 0.4.1以降
推論最適化ツール: PAI-Blade V3.16.0以降

手順

PAI-Bladeを使用してDetectron2フレームワーク内のRetinaNetモデルを最適化するには、次の手順を実行します。

手順1: 最適化するRetinaNetモデルのエクスポート
Detectron2が提供するTracingAdapterまたはscripting_with_instances APIを呼び出して、最適化するRetinaNetモデルをエクスポートします。
ステップ2: PAI-Bladeを使用してモデルを最適化
モデルを最適化し、最適化されたモデルを保存するには、blade.optimizeメソッドを呼び出します。
ステップ3: 最適化モデルのロードと実行
最適化されたモデルがパフォーマンステストに合格し、期待を満たす場合は、推論のために最適化されたモデルを読み込みます。

ステップ1: 最適化するRetinaNetモデルのエクスポート

Detectron2は、Facebook AI Research (FAIR) によって構築されたオープンソースのトレーニングフレームワークです。 Detectron2は、オブジェクト検出およびセグメンテーションアルゴリズムを実装し、柔軟性、拡張性、および構成可能です。 Detectron2の柔軟性のため、通常の方法でモデルをエクスポートすると、エクスポートが失敗したり、間違ったエクスポート結果が返されたりする可能性があります。モデルをTorchScript形式でデプロイできるようにするため、Detectron2ではTracingAdapterまたはscripting_with_instances APIを呼び出してモデルをエクスポートできます。詳細については、「Usage」をご参照ください。

PAI-Bladeを使用すると、すべてのタイプのモデルをTorchScript形式でインポートできます。次のサンプルコードは、モデルをTorchScript形式でエクスポートする方法の例を示しています。この例では、scripting_with_instances APIが使用されています。

import torch
import numpy as np

from torch import Tensor
from torch.testing import assert_allclose

from detectron2 import model_zoo
from detectron2.export import scripting_with_instances
from detectron2.structures import Boxes
from detectron2.data.detection_utils import read_image

# Call the scripting_with_instances API to export the RetinaNet model. 
def load_retinanet(config_path):
    model = model_zoo.get(config_path, trained=True).eval()
    fields = {
        "pred_boxes": Boxes,
        "scores": Tensor,
        "pred_classes": Tensor,
    }
    script_model = scripting_with_instances(model, fields)
    return model, script_model

# Download a sample image. 
# wget http://images.cocodataset.org/val2017/000000439715.jpg -q -O input.jpg
img = read_image('./input.jpg')
img = torch.from_numpy(np.ascontiguousarray(img.transpose(2, 0, 1)))

# Run the model and compare the latency before and after you export the model. 
pytorch_model, script_model = load_retinanet("COCO-Detection/retinanet_R_50_FPN_3x.yaml")
with torch.no_grad():
    batched_inputs = [{"image": img.float()}]
    pred1 = pytorch_model(batched_inputs)
    pred2 = script_model(batched_inputs)

assert_allclose(pred1[0]['instances'].scores, pred2[0].scores)

ステップ2: PAI-Bladeを使用してモデルを最適化する

PAI-bladeのBlade. optimizeメソッドを呼び出します。

モデルを最適化するには、blade.optimizeメソッドを呼び出します。次のサンプルコードに例を示します。 blade.optimizeメソッドの詳細については、「PyTorchモデルの最適化」をご参照ください。

import blade

test_data = [(batched_inputs,)] # The test data used for a model in PyTorch is a list of tuples of tensors. 
optimized_model, opt_spec, report = blade.optimize(
    script_model,  # The model in the TorchScript format exported in the previous step. 
    'o1',  # The optimization level of PAI-Blade. In this example, the optimization level is o1. 
    device_type='gpu',  # The type of the device on which the model is run. In this example, the device is type GPU. 
    test_data=test_data,  # The given set of test data, which facilitates optimization and testing. 
)

最適化レポートを表示し、最適化モデルを保存します。

最適化されたモデルはまだTorchScript形式です。最適化が完了したら、次のコードを実行して最適化レポートを表示し、最適化モデルを保存します

# Display the optimization report. 
print("Report: {}".format(report))
# Save the optimized model. 
torch.jit.save(optimized_model, 'optimized.pt')

次のサンプルコードは、サンプル最適化レポートを提供します。レポートのパラメーターの詳細については、「最適化レポート」をご参照ください。

Report: {
  "software_context": [
    {
      "software": "pytorch",
      "version": "1.8.1+cu102"
    },
    {
      "software": "cuda",
      "version": "10.2.0"
    }
  ],
  "hardware_context": {
    "device_type": "gpu",
    "microarchitecture": "T4"
  },
  "user_config": "",
  "diagnosis": {
    "model": "unnamed.pt",
    "test_data_source": "user provided",
    "shape_variation": "undefined",
    "message": "Unable to deduce model inputs information (data type, shape, value range, etc.)",
    "test_data_info": "0 shape: (3, 480, 640) data type: float32"
  },
  "optimizations": [
    {
      "name": "PtTrtPassFp16",
      "status": "effective",
      "speedup": "3.77",
      "pre_run": "40.64 ms",
      "post_run": "10.78 ms"
    }
  ],
  "overall": {
    "baseline": "40.73 ms",
    "optimized": "10.76 ms",
    "speedup": "3.79"
  },
  "model_info": {
    "input_format": "torch_script"
  },
  "compatibility_list": [
    {
      "device_type": "gpu",
      "microarchitecture": "T4"
    }
  ],
  "model_sdk": {}
}

元のモデルと最適化されたモデルのパフォーマンスをテストします。
次のサンプルコードは、モデルのパフォーマンスをテストする方法の例を示しています。
```
import time

@torch.no_grad()
def benchmark(model, inp):
    for i in range(100):
        model(inp)
    torch.cuda.synchronize()
    start = time.time()
    for i in range(200):
        model(inp)
    torch.cuda.synchronize()
    elapsed_ms = (time.time() - start) * 1000
    print("Latency: {:.2f}".format(elapsed_ms / 200))

# Test the latency of the original model. 
benchmark(pytorch_model, batched_inputs)
# Test the latency of the optimized model. 
benchmark(optimized_model, batched_inputs)
```
このパフォーマンステストの次の結果は参考のためのものです:
```
Latency: 42.38
Latency: 10.77
```
前述の結果は、両方のモデルが200回実行された後、元のモデルの平均待ち時間は42.38ミリ秒であり、最適化されたモデルの平均待ち時間は10.77ミリ秒であることを示している。

ステップ3: 最適化されたモデルをロードして実行する

オプション: 試用期間中に、次の環境変数設定を追加して、認証の失敗によるプログラムの予期しない停止を防止します。
```
export BLADE_AUTH_USE_COUNTING=1
```
PAI-Bladeを使用するように認証されます。
```
export BLADE_REGION=<region>
export BLADE_TOKEN=<token>
```
ビジネス要件に基づいて、次のパラメーターを設定します。
- <region>: PAI-Bladeを使用するリージョンです。 PAI-BladeユーザーのDingTalkグループに参加して、PAI-Bladeを使用できるリージョンを取得できます。 DingTalkグループのQRコードについては、「アクセストークンの取得」をご参照ください。
- <token>: PAI-Bladeを使用するために必要な認証トークン。 PAI-BladeユーザーのDingTalkグループに参加して、認証トークンを取得できます。 DingTalkグループのQRコードについては、「アクセストークンの取得」をご参照ください。

モデルをデプロイします。

最適化されたモデルはまだTorchScriptにあります。したがって、環境を変更せずに最適化モデルをロードできます。

import blade.runtime.torch
import detectron2
import torch

from torch.testing import assert_allclose
from detectron2.utils.testing import (
    get_sample_coco_image,
)

pytorch_model = model_zoo.get("COCO-Detection/retinanet_R_50_FPN_3x.yaml", trained=True).eval()
optimized_model = torch.jit.load('optimized.pt')

img = read_image('./input.jpg')
img = torch.from_numpy(np.ascontiguousarray(img.transpose(2, 0, 1)))

with torch.no_grad():
    batched_inputs = [{"image": img.float()}]
    pred1 = pytorch_model(batched_inputs)
    pred2 = optimized_model(batched_inputs)

assert_allclose(pred1[0]['instances'].scores, pred2[0].scores, rtol=1e-3, atol=1e-2)