Triton Inference Serverを使用したモデルサービスのデプロイ - Platform For AI

Triton Inference Serverは、AI推論を合理化するオープンソースの推論サービスエンジンです。 TensorRT、TensorFlow、PyTorch、ONNXなど、複数の深層学習および機械学習フレームワークからAIモデルをオンライン推論サービスとしてデプロイできます。 Triton Inference Serverは、マルチモデル管理もサポートし、カスタムバックエンドを追加できるバックエンドAPIを提供します。このトピックでは、Triton Inference Serverイメージを使用してPlatform for AI (PAI) にモデルサービスをデプロイする方法について説明します。

単一モデルサービスのデプロイ

Object Storage Service (OSS) バケットにモデルディレクトリを作成し、モデルディレクトリの形式要件に基づいてモデルファイルとモデル構成ファイルを構成します。詳細については、「ディレクトリの管理」をご参照ください。

各モデルディレクトリには、少なくとも1つのバージョンサブディレクトリと1つのモデル構成ファイルが必要です。

Versionサブディレクトリ: モデルファイルを格納します。バージョンサブディレクトリの名前は、モデルバージョンを示す番号でなければなりません。数字が大きいほど、新しいモデルバージョンを示します。
モデル構成ファイル: モデルに関する基本情報を格納します。ほとんどの場合、このファイルの名前はconfig.pbtxtです。

たとえば、モデルはoss:// examplebucket/models/triton/ ディレクトリに格納され、ディレクトリは次の構造で構成されています。

triton
└──resnet50_pt
    ├── 1
    │   └── model.pt
    ├── 2
    │   └── model.pt
    ├── 3
    │   └── model.pt
    └── config.pbtxt

config.pbtxtファイルは、モデルの設定を指定します。例：

name: "resnet50_pt"
platform: "pytorch_libtorch"
max_batch_size: 128
input [
  {
    name: "INPUT__0"
    data_type: TYPE_FP32
    dims: [ 3, -1, -1 ]
  }
]
output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]

# Use GPU resources for inference.
# instance_group [
#   { 
#     kind: KIND_GPU
#   }
# ]

# Specify the version policy of the model.
# version_policy: { all { }}
# version_policy: { latest: { num_versions: 2}}
# version_policy: { specific: { versions: [1,3]}}

次の表に、config.pbtxtファイルのキーパラメーターを示します。

パラメーター	必須 / 任意	説明
name	不可	モデルの名前。デフォルト値は、モデルディレクトリの名前です。このパラメーターを指定する場合、値はモデルディレクトリの名前と一致する必要があります。
プラットフォーム /バックエンド	可	次のパラメーターの少なくとも1つを指定します。 platform: モデルのトレーニングに使用されるフレームワークです。共通値: tensorrt_plan、onnxruntime_onnx、pytorch_libtorch、tensorflow_savedmodel、tensorflow_graphdef。 backend: モデルの実行に使用するメソッドです。このパラメーターを指定するには、次の方法を使用できます。モデルのトレーニングに使用するフレームワークを指定します。共通値: tensorrt、onnxruntime、pytorch、tensorflow。プラットフォームとバックエンドパラメータの有効な値は、同じフレームワークでは異なることに注意してください。 Pythonコードを使用してカスタム推論ロジックを構成するバックエンドの名前を指定します。詳細については、「カスタムバックエンドの使用」をご参照ください。
max_batch_サイズ	可	モデルが同時に処理できるリクエストの最大数。このパラメーターを0に設定すると、バッチ処理は無効になります。
入力	可	次のプロパティが含まれます。 name: データの名前。 data_type: データの型。 dims: データのディメンション。
出力	可	次のプロパティが含まれます。 name: データの名前。 data_type: データの型。 dims: データのディメンション。
instance_group	不可	モデルの実行に使用されるコンピューティングリソース。 GPUリソースが利用可能な場合、モデルは自動的にGPUリソースを推論に使用します。それ以外の場合、モデルは推論にCPUリソースを使用します。使用するコンピューティングリソースは、次の形式で指定できます。 `instance_group [ { kind: KIND_GPU } ]` kindプロパティの有効な値: KIND_GPUおよびKIND_CPU。
version_policy	不可	モデルのバージョンポリシー。 resnet50_ptモデルのサンプル設定: `version_policy: { all { }} version_policy: { latest: { num_versions: 2}} version_policy: { specific: { versions: [1,3]}}` このパラメーターを空のままにすると、モデルの最新バージョンが読み込まれます。たとえば、resnet50_ptモデルに対してこのパラメーターを空のままにすると、モデルのバージョン3が読み込まれます。 all{}: モデルのすべてのバージョンを読み込みます。上記の例では、resnet50_ptモデルのバージョン1、2、および3が読み込まれています。 latest{num_versions:}: 最新のnバージョンを読み込みます。nはnum_versionsの値です。上記の例では、`num_versions: 2`は、resnet50_ptモデルの最新の2つのバージョン (バージョン2と3) が読み込まれていることを示します。 specific{versions:[]}: 特定のバージョンを読み込みます。上記の例では、resnet50_ptモデルのバージョン1と3が読み込まれています。

Triton Inference Serverサービスを展開します。

Triton Inference Serverは、次の2つのポートをサポートします。デフォルトでは、システムはシナリオベースのモデル展開でポート8000を使用します。ポート8001を使用する場合は、ステップeを実行します。そうでなければ、ステップeを無視します。

8000: HTTPリクエストを受信するために、ポート8000でHTTPサーバーを起動します。
8001: ポート8001でGoogleリモートプロシージャコール (gRPC) サーバーを起動し、gRPCリクエストを受信します。

以下の手順を実行します。

PAI コンソールにログインします。ページ上部のリージョンを選択します。次に、目的のワークスペースを選択し、[Elastic Algorithm Service (EAS) の入力] をクリックします。
[Elastic Algorithm Service (EAS)] ページで、[サービスのデプロイ] をクリックします。 [シナリオベースのモデルの展開] セクションで、[Triton Deployment] をクリックします。

[Triton Deployment] ページで、次のパラメーターを設定します。その他のパラメーターについては、「PAIコンソールでのモデルサービスのデプロイ」をご参照ください。

パラメーター	説明
サービス名	サービスの名前です。
モデル設定	この例では、[タイプ] で [OSS] を選択し、手順1で準備したモデルファイルが保存されるOSSパスに [OSS] を設定します。例: `oss:// Example /models/triton/`

(オプション) ページの右上隅にある [カスタムデプロイに変換] をクリックします。 [環境情報] セクションで、[ポート番号] を8001に変更します。 [サービス設定] セクションで、次の設定を追加します。
説明
デフォルトでは、サービスはHTTPリクエストを受信するためにポート8000でHTTPサービスを開始します。 gRPCリクエストを受信するには、ポート番号を8001に変更する必要があります。システムはポート8001でgRPCサーバーを起動します。
```
"metadata": {
    "enable_http2": true
},
"networking": {
    "path": "/"
}
```
パラメーターを設定したら、[デプロイ] をクリックします。

マルチモデルサービスのデプロイ

マルチモデルサービスを展開する方法は、Elastic Algorithm service (EAS) で単一モデルサービスを展開する方法と同様です。マルチモデルサービスをデプロイするには、複数のモデル用のディレクトリを作成する必要があります。次のコードは例を提供します。サービスはすべてのモデルを読み込み、すべてのモデルを同じサービスにデプロイします。詳細については、「単一モデルサービスのデプロイ」をご参照ください。

triton
├── resnet50_pt
|   ├── 1
|   │   └── model.pt
|   └── config.pbtxt
├── densenet_onnx
|   ├── 1
|   │   └── model.onnx
|   └── config.pbtxt
└── mnist_savedmodel
    ├── 1
    │   └── model.savedmodel
    │       ├── saved_model.pb
    |       └── variables
    |           ├── variables.data-00000-of-00001
    |           └── variables.index
    └── config.pbtxt

カスタムバックエンドを使用する

Tritonバックエンドは、モデルの推論プロセスを実装します。バックエンドは、TensorRT、ONNX Runtime、PyTorch、TensorFlowなどの既存のフレームワークを使用したり、前処理や後処理操作などのカスタム推論ロジックを実装したりできます。

C ++ またはPythonを使用してバックエンドを実装できます。 PythonはC ++ よりも柔軟で便利です。このセクションでは、Pythonバックエンドを実装する方法について説明します。

modelディレクトリの構造を変更します。

次の例は、PyTorchモデルに必要なディレクトリ構造を示しています。

resnet50_pt
├── 1
│   ├── model.pt
│   └── model.py
└── config.pbtxt

カスタムバックエンドを使用するには、モデルバージョンを表すサブディレクトリにmodel.pyファイルを追加し、config.pbtxtファイルを変更する必要があります。

model.pyファイルを追加します。

model.pyファイルには、カスタム推論ロジックが含まれています。 TritonPythonModelという名前のクラスを定義し、ビジネス要件に基づいて初期化、実行、およびファイナライズ関数を実装する必要があります。サンプルコード：

import json
import os
import torch
from torch.utils.dlpack import from_dlpack, to_dlpack

import triton_python_backend_utils as pb_utils


class TritonPythonModel:
    """The class name must be TritonPythonModel."""

    def initialize(self, args):
        """
        The initialize() function is optional. It is called only during the model loading process to initialize information about the model, such as model properties and configurations. 
        Parameters
        ----------
        args: a dictionary that stores data as key-value pairs. The keys and values are of the string type. Valid keys:
          * model_config: the model configurations in the JSON format. 
          * model_instance_kind: the type of the device that is used to run the model. 
          * model_instance_device_id: the ID of the device that is used to run the model. 
          * model_repository: the path of the model repository. 
          * model_version: the version of the model. 
          * model_name: the name of the model. 
        """

        # Convert the JSON string that specifies the model configurations into a Python dictionary. 
        self.model_config = model_config = json.loads(args["model_config"])

        # Extract the properties from the model configuration file. 
        output_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT__0")

        # Convert Triton types into NumPy types. 
        self.output_dtype = pb_utils.triton_string_to_numpy(output_config["data_type"])

        # Obtain the path of the model repository. 
        self.model_directory = os.path.dirname(os.path.realpath(__file__))

        # Obtain the device that is used to run the model. In this example, the model runs on a GPU device. 
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        print("device: ", self.device)

        model_path = os.path.join(self.model_directory, "model.pt")
        if not os.path.exists(model_path):
            raise pb_utils.TritonModelException("Cannot find the pytorch model")
        # Use .to(self.device) to load the PyTorch model to a GPU device. 
        self.model = torch.jit.load(model_path).to(self.device)

        print("Initialized...")

    def execute(self, requests):
        """
        The execute() function is required. It is called each time the model receives an inference request. If you want the model to support batch processing, you must add the batch processing logic in the execute() function.
        Parameters
        ----------
        requests: a list of requests. Each request is of the pb_utils.InferenceRequest type. 

        Returns
        -------
        A list of responses. Each response is of the pb_utils.InferenceResponse type. The length of the response list must be the same as the length of the request list. 
        """

        output_dtype = self.output_dtype

        responses = []

        # Traverse the request list and create a response for each request. 
        for request in requests:
            # Obtain the input tensor. 
            input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT__0")
            # Convert a Triton tensor into a PyTorch tensor. 
            pytorch_tensor = from_dlpack(input_tensor.to_dlpack())

            if pytorch_tensor.shape[2] > 1000 or pytorch_tensor.shape[3] > 1000:
                responses.append(
                    pb_utils.InferenceResponse(
                        output_tensors=[],
                        error=pb_utils.TritonError(
                            "Image shape should not be larger than 1000"
                        ),
                    )
                )
                continue

            # Perform inference on the GPU device. 
            prediction = self.model(pytorch_tensor.to(self.device))

            # Convert the PyTorch output tensor into a Triton tensor. 
            out_tensor = pb_utils.Tensor.from_dlpack("OUTPUT__0", to_dlpack(prediction))

            inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor])
            responses.append(inference_response)

        return responses

    def finalize(self):
        """
        The finalize() function is optional. It is called when the model is unloaded to release resources. 
        """
        print("Cleaning up...")

重要

GPUデバイスで推論を実行する場合は、config.pbtxtファイルでinstance_group.kindプロパティを使用しないでください。代わりに、model.to(torch.de vice("cuda")) 関数を呼び出して、モデルをGPUデバイスにロードします。モデルがリクエストを受信したら、pytorch_tensor.to(torch.de vice("cuda")) 関数を呼び出して、モデル入力テンソルをGPUデバイスに送信します。これにより、モデル展開用のGPUリソースを構成した後、GPUデバイスで推論を実行できます。
モデルでバッチ処理をサポートする場合は、config.pbtxtファイルでmax_batch_sizeパラメーターを使用しないでください。代わりに、execute() 関数にバッチ処理ロジックを実装します。
各リクエストは1つのレスポンスに対応する必要があります。

config.pbtxtファイルを変更します。
サンプル設定:
```
name: "resnet50_pt"
backend: "python"
max_batch_size: 128
input [
  {
    name: "INPUT__0"
    data_type: TYPE_FP32
    dims: [ 3, -1, -1 ]
  }
]
output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]

parameters: {
    key: "FORCE_CPU_ONLY_INPUT_TENSORS"
    value: {string_value: "no"}
}
```
次のパラメーターを変更し、他のパラメーターの設定を保持します。
- backend: 値をpythonに設定します。
- parameters: このパラメーターはオプションです。モデルがGPUデバイスで推論を実行する場合は、FORCE_CPU_ONLY_INPUT_TENSORSをnoに設定します。

モデルをデプロイします。
Pythonバックエンドを使用する場合は、共有メモリを設定する必要があります。次の構成を使用して、カスタム推論ロジックを含むモデルサービスを作成できます。クライアントを使用してモデルサービスをデプロイする方法については、「EASCMDまたはDSWを使用してモデルサービスをデプロイする」をご参照ください。
```
{
  "metadata": {
    "name": "triton_server_test",
    "instance": 1,
  },
  "cloud": {
        "computing": {
            "instance_type": "ml.gu7i.c8m30.1-gu30",
            "instances": null
        }
    },
  "containers": [
    {
      "command": "tritonserver --model-repository=/models",
      "image": "eas-registry-vpc.<region>.cr.aliyuncs.com/pai-eas/tritonserver:23.02-py3",
      "port": 8000,
      "prepare": {
        "pythonRequirements": [
          "torch==2.0.1"
        ]
      }
    }
  ],
  "storage": [
    {
      "mount_path": "/models",
      "oss": {
        "path": "oss://oss-test/models/triton_backend/"
      }
    },
    {
      "empty_dir": {
        "medium": "memory",
        // Set the shared memory to 1 GB. 
        "size_limit": 1
      },
      "mount_path": "/dev/shm"
    }
  ]
}
```
次のパラメータに注意してください。
- name: カスタムロジックを含むモデルの名前。
- storage.oss.path: モデルディレクトリのOSSパス。
- containers.image: デプロイに使用されるイメージ。 <region> を現在のリージョンのIDに置き換えます。たとえば、中国 (上海) リージョンのcn-shanghaiを指定できます。

コールサービス

デプロイされたサービスに推論要求を送信するようにクライアントを設定できます。

HTTPリクエストの送信

ポート番号を8000に設定すると、HTTPリクエストをサービスに送信できます。サンプルPythonコード:

import numpy as np
import tritonclient.http as httpclient

# url specifies the endpoint that is used to access the service that you deployed in EAS. 
url = '1859257******.cn-hangzhou.pai-eas.aliyuncs.com/api/predict/triton_server_test'

triton_client = httpclient.InferenceServerClient(url=url)

image = np.ones((1,3,224,224))
image = image.astype(np.float32)

inputs = []
inputs.append(httpclient.InferInput('INPUT__0', image.shape, "FP32"))
inputs[0].set_data_from_numpy(image, binary_data=False)
outputs = []
outputs.append(httpclient.InferRequestedOutput('OUTPUT__0', binary_data=False)) # Obtain a 1000-dimensional vector.

# Specify the name, token, input, and output of the model. 
results = triton_client.infer(
    model_name="<model_name>",
    model_version="<version_num>",
    inputs=inputs,
    outputs=outputs,
    headers={"Authorization": "<test-token>"},
)
output_data0 = results.as_numpy('OUTPUT__0')
print(output_data0.shape)
print(output_data0)

次の表に、主要なパラメーターを示します。

パラメーター	説明
url	プレフィックス`http://` のないサービスのエンドポイント。エンドポイントを取得するには、次の手順を実行します。[Elastic Algorithm Service (EAS)] ページに移動し、サービスを見つけて、サービス名をクリックします。表示されるページの [サービスの詳細] タブで、[エンドポイント情報の表示] をクリックします。 [呼び出しメソッド] ダイアログボックスの [パブリックエンドポイント] タブで、パブリックエンドポイントを表示します。
model_name	モデルの名前。例: resnet50_pt
モデル_バージョン	使用するモデルバージョン。リクエストは、一度に1つのバージョンのモデルにのみ送信できます。
ヘッダー	サービスのトークン。 <test-token> をサービスのトークンに置き換えます。 [パブリックエンドポイント] タブでトークンを表示できます。

gRPCリクエストの送信

ポート番号を8001に設定すると、必要な設定が追加された後にgRPCリクエストをサービスに送信できます。サンプルPythonコード:

#!/usr/bin/env python
import grpc
from tritonclient.grpc import service_pb2, service_pb2_grpc
import numpy as np

if __name__ == "__main__":
    # Define the endpoint of the service. 
    host = (
        "service_name.115770327099****.cn-beijing.pai-eas.aliyuncs.com:80"
    )
    # Replace test-token with the token of your service. 
    token = "test-token"
    # Specify the model name and version. 
    model_name = "resnet50_pt"
    model_version = "1"
    
    # Create gRPC metadata for token verification. 
    metadata = (("authorization", token),)

    # Create a gRPC channel and a gRPC stub to communicate with the server. 
    channel = grpc.insecure_channel(host)
    grpc_stub = service_pb2_grpc.GRPCInferenceServiceStub(channel)
    
    # Create an inference request. 
    request = service_pb2.ModelInferRequest()
    request.model_name = model_name
    request.model_version = model_version
    
    # Construct the input tensor based on the input parameter that you specify in the model configuration file. 
    input = service_pb2.ModelInferRequest().InferInputTensor()
    input.name = "INPUT__0"
    input.datatype = "FP32"
    input.shape.extend([1, 3, 224, 224])
     # Construct the output tensor based on the output parameter that you specify in the model configuration file. 
    output = service_pb2.ModelInferRequest().InferRequestedOutputTensor()
    output.name = "OUTPUT__0"
    
    # Add the input data to the request. 
    request.inputs.extend([input])
    request.outputs.extend([output])
    # Create the input data by constructing a random array and serializing the array into a sequence of bytes. 
    request.raw_input_contents.append(np.random.rand (1,3, 224, 224).astype(np.float32).tobytes()) # An array of floating-point numbers
        
    # Send an inference request and receive the response. 
    response, _ = grpc_stub.ModelInfer.with_call(request, metadata=metadata)
    
    # Extract the output tensor from the response. 
    output_contents=response.raw_output_contents [0]  # For example, only one output tensor is returned. 
    output_shape = [1, 1000]  # For example, the shape of the output tensor is [1, 1000]. 
    
    # Convert the output bytes into a NumPy array. 
    output_array = np.frombuffer(output_contents, dtype=np.float32)
    output_array = output_array.reshape(output_shape)
    
    # Print the output of the model. 
    print("Model output:\n", output_array)

次の表に、主要なパラメーターを示します。

パラメーター	説明
ホスト	プレフィックス`http://` なし、サフィックス `:80`付きのサービスのエンドポイント。エンドポイントを取得するには、次の操作を実行します。[Elastic Algorithm Service (EAS)] ページに移動し、サービスを見つけて、サービス名をクリックします。表示されるページの [サービスの詳細] タブで、[エンドポイント情報の表示] をクリックします。 [呼び出しメソッド] ダイアログボックスの [パブリックエンドポイント] タブで、パブリックエンドポイントを表示します。
トークン	サービスのトークン。 <test-token> をサービスのトークンに置き換えます。 [パブリックエンドポイント] タブでトークンを表示できます。
model_name	モデルの名前。例: resnet50_pt
モデル_バージョン	使用するモデルバージョン。リクエストは、一度に1つのバージョンのモデルにのみ送信できます。

Platform For AI:Triton Inference Serverイメージを使用したモデルサービスのデプロイ

単一モデルサービスのデプロイ

マルチモデルサービスのデプロイ

カスタムバックエンドを使用する

コールサービス

関連ドキュメント