Deploy model services by using Triton Inference Server - Platform For AI

Triton Inference Server is an open source inference serving engine that streamlines AI inference. It allows you to deploy AI models from multiple deep learning and machine learning frameworks, such as TensorRT, TensorFlow, PyTorch, and ONNX, as online inference services. Triton Inference Server also supports multi-model management and provides a backend API that allows you to add custom backends. This topic describes how to use a Triton Inference Server image to deploy a model service in Platform for AI (PAI).

Deploy a single-model service

Create a model directory in an Object Storage Service (OSS) bucket, and configure the model files and model configuration file based on the format requirements of the model directory. For more information, see Manage directories.

Each model directory must have at least one version sub-directory and one model configuration file.

Version sub-directory: stores model files. The name of a version sub-directory must be a number that indicates the model version. A larger number indicates a later model version.
Model configuration file: stores the basic information about the model. In most cases, this file is named config.pbtxt.

For example, a model is stored in the oss://examplebucket/models/triton/ directory, and the directory is organized in the following structure:

triton
└──resnet50_pt
    ├── 1
    │   └── model.pt
    ├── 2
    │   └── model.pt
    ├── 3
    │   └── model.pt
    └── config.pbtxt

The config.pbtxt file specifies the configurations of the model. Example:

name: "resnet50_pt"
platform: "pytorch_libtorch"
max_batch_size: 128
input [
  {
    name: "INPUT__0"
    data_type: TYPE_FP32
    dims: [ 3, -1, -1 ]
  }
]
output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]

# Use GPU resources for inference.
# instance_group [
#   { 
#     kind: KIND_GPU
#   }
# ]

# Specify the version policy of the model.
# version_policy: { all { }}
# version_policy: { latest: { num_versions: 2}}
# version_policy: { specific: { versions: [1,3]}}

The following table describes the key parameters in the config.pbtxt file.

Parameter	Required	Description
name	No	The name of the model. The default value is the name of the model directory. If you specify this parameter, the value must match the name of the model directory.
platform/backend	Yes	Specifies at least one of the following parameters: platform: the framework that is used to train the model. Common values: tensorrt_plan, onnxruntime_onnx, pytorch_libtorch, tensorflow_savedmodel, and tensorflow_graphdef. backend: the method that you want to use to run the model. To specify this parameter, you can use the following methods: Specify the framework that you want to use to train the model. Common values: tensorrt, onnxruntime, pytorch, and tensorflow. Take note that the valid values for the platform and backend parameters are different for the same framework. Specify the name of the backend that uses Python code to configure custom inference logic. For more information, see Use a custom backend.
max_batch_size	Yes	The maximum number of requests that the model can process at the same time. If you set this parameter to 0, batch processing is disabled.
input	Yes	Contains the following properties: name: the name of the data. data_type: the type of the data. dims: the dimensions of the data.
output	Yes	Contains the following properties: name: the name of the data. data_type: the type of the data. dims: the dimensions of the data.
instance_group	No	The computing resources that are used to run the model. If GPU resources are available, the model automatically uses GPU resources for inference. Otherwise, the model uses CPU resources for inference. You can specify the computing resources that you want to use in the following format: `instance_group [ { kind: KIND_GPU } ]` Valid values for the kind property: KIND_GPU and KIND_CPU.
version_policy	No	The version policy of the model. Sample configurations for the resnet50_pt model: `version_policy: { all { }} version_policy: { latest: { num_versions: 2}} version_policy: { specific: { versions: [1,3]}}` If you leave this parameter empty, the latest version of the model is loaded. For example, if you leave this parameter empty for the resnet50_pt model, version 3 of the model is loaded. all{}: loads all versions of the model. In the preceding example, versions 1, 2, and 3 of the resnet50_pt model are loaded. latest{num_versions:}: loads the latest n versions, where n is the value of num_versions. In the preceding example, `num_versions: 2` indicates that the latest two versions (versions 2 and 3) of the resnet50_pt model are loaded. specific{versions:[]}: loads specific versions. In the preceding example, versions 1 and 3 of the resnet50_pt model are loaded.

Deploy the Triton Inference Server service.

Triton Inference Server supports the following two ports. By default, the system uses port 8000 in the scenario-based model deployment. If you want to use port 8001, perform Step e. Otherwise, ignore Step e.

8000: launches an HTTP server on port 8000 to receive HTTP requests.
8001: launches a Google Remote Procedure Call (gRPC) server on port 8001 to receive gRPC requests.

Perform the following steps:

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).
On the Elastic Algorithm Service (EAS) page, click Deploy Service. In the Scenario-based Model Deployment section, click Triton Deployment.

On the Triton Deployment page, configure the following parameters. For information about other parameters, see Deploy a model service in the PAI console.

Parameter	Description
Service Name	The name of the service.
Model Settings	In this example, select OSS for Type and set OSS to the OSS path where the model files prepared in Step 1 are stored. Example: `oss://example/models/triton/`.

(Optional) In the upper-right corner of the page, click Convert to Custom Deployment. In the Environment Information section, change Port Number to 8001. In the Service Configuration section, add the following configuration.
Note
By default, the service starts the HTTP service on port 8000 to receive HTTP requests. To receive gRPC requests, you must change the port number to 8001. The system starts the gRPC server on port 8001.
```
"metadata": {
    "enable_http2": true
},
"networking": {
    "path": "/"
}
```
After you configure the parameters, click Deploy.

Deploy a multi-model service

The method for deploying a multi-model service is similar to the method for deploying a single-model service in Elastic Algorithm Service (EAS). To deploy a multi-model service, you must create a directory for multiple models. The following code provides an example. The service loads all models and deploys all models in the same service. For more information, see Deploy a single-model service.

triton
├── resnet50_pt
|   ├── 1
|   │   └── model.pt
|   └── config.pbtxt
├── densenet_onnx
|   ├── 1
|   │   └── model.onnx
|   └── config.pbtxt
└── mnist_savedmodel
    ├── 1
    │   └── model.savedmodel
    │       ├── saved_model.pb
    |       └── variables
    |           ├── variables.data-00000-of-00001
    |           └── variables.index
    └── config.pbtxt

Use a custom backend

A Triton backend implements the inference process for a model. A backend can use an existing framework, such as TensorRT, ONNX Runtime, PyTorch, and TensorFlow, or implement custom inference logic, such as pre-processing and post-processing operations.

You can implement a backend by using C++ or Python. Python is more flexible and convenient than C++. This section describes how to implement a Python backend.

Modify the structure of the model directory.

The following example shows the required directory structure for a PyTorch model:

resnet50_pt
├── 1
│   ├── model.pt
│   └── model.py
└── config.pbtxt

To use a custom backend, you must add a model.py file in the sub-directory that represents the model version and modify the config.pbtxt file.

Add a model.py file.

The model.py file contains your custom inference logic. You must define a class named TritonPythonModel and implement the initialize, execute, and finalize functions based on your business requirements. Sample code:

import json
import os
import torch
from torch.utils.dlpack import from_dlpack, to_dlpack

import triton_python_backend_utils as pb_utils


class TritonPythonModel:
    """The class name must be TritonPythonModel."""

    def initialize(self, args):
        """
        The initialize() function is optional. It is called only during the model loading process to initialize information about the model, such as model properties and configurations. 
        Parameters
        ----------
        args: a dictionary that stores data as key-value pairs. The keys and values are of the string type. Valid keys:
          * model_config: the model configurations in the JSON format. 
          * model_instance_kind: the type of the device that is used to run the model. 
          * model_instance_device_id: the ID of the device that is used to run the model. 
          * model_repository: the path of the model repository. 
          * model_version: the version of the model. 
          * model_name: the name of the model. 
        """

        # Convert the JSON string that specifies the model configurations into a Python dictionary. 
        self.model_config = model_config = json.loads(args["model_config"])

        # Extract the properties from the model configuration file. 
        output_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT__0")

        # Convert Triton types into NumPy types. 
        self.output_dtype = pb_utils.triton_string_to_numpy(output_config["data_type"])

        # Obtain the path of the model repository. 
        self.model_directory = os.path.dirname(os.path.realpath(__file__))

        # Obtain the device that is used to run the model. In this example, the model runs on a GPU device. 
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        print("device: ", self.device)

        model_path = os.path.join(self.model_directory, "model.pt")
        if not os.path.exists(model_path):
            raise pb_utils.TritonModelException("Cannot find the pytorch model")
        # Use .to(self.device) to load the PyTorch model to a GPU device. 
        self.model = torch.jit.load(model_path).to(self.device)

        print("Initialized...")

    def execute(self, requests):
        """
        The execute() function is required. It is called each time the model receives an inference request. If you want the model to support batch processing, you must add the batch processing logic in the execute() function.
        Parameters
        ----------
        requests: a list of requests. Each request is of the pb_utils.InferenceRequest type. 

        Returns
        -------
        A list of responses. Each response is of the pb_utils.InferenceResponse type. The length of the response list must be the same as the length of the request list. 
        """

        output_dtype = self.output_dtype

        responses = []

        # Traverse the request list and create a response for each request. 
        for request in requests:
            # Obtain the input tensor. 
            input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT__0")
            # Convert a Triton tensor into a PyTorch tensor. 
            pytorch_tensor = from_dlpack(input_tensor.to_dlpack())

            if pytorch_tensor.shape[2] > 1000 or pytorch_tensor.shape[3] > 1000:
                responses.append(
                    pb_utils.InferenceResponse(
                        output_tensors=[],
                        error=pb_utils.TritonError(
                            "Image shape should not be larger than 1000"
                        ),
                    )
                )
                continue

            # Perform inference on the GPU device. 
            prediction = self.model(pytorch_tensor.to(self.device))

            # Convert the PyTorch output tensor into a Triton tensor. 
            out_tensor = pb_utils.Tensor.from_dlpack("OUTPUT__0", to_dlpack(prediction))

            inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor])
            responses.append(inference_response)

        return responses

    def finalize(self):
        """
        The finalize() function is optional. It is called when the model is unloaded to release resources. 
        """
        print("Cleaning up...")

Important

If you want to run inference on a GPU device, do not use the instance_group.kind property in the config.pbtxt file. Instead, call the model.to(torch.device("cuda")) function to load the model on a GPU device. When the model receives a request, call the pytorch_tensor.to(torch.device("cuda")) function to send the model input tensor to the GPU device. This way, you can run inference on a GPU device after you configure GPU resources for model deployment.
If you want the model to support batch processing, do not use the max_batch_size parameter in the config.pbtxt file. Instead, implement the batch processing logic in the execute() function.
Each request must correspond to one response.

Modify the config.pbtxt file.
Sample configuration:
```
name: "resnet50_pt"
backend: "python"
max_batch_size: 128
input [
  {
    name: "INPUT__0"
    data_type: TYPE_FP32
    dims: [ 3, -1, -1 ]
  }
]
output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]

parameters: {
    key: "FORCE_CPU_ONLY_INPUT_TENSORS"
    value: {string_value: "no"}
}
```
Modify the following parameters and retain the configurations of other parameters.
- backend: Set the value to python.
- parameters: This parameter is optional. If the model runs inference on a GPU device, set FORCE_CPU_ONLY_INPUT_TENSORS to no, which prevents unnecessary overhead caused by copying the input tensor between the CPU and GPU during inference.

Deploy the model.

When you use a Python backend, you must configure a shared memory. You can use the following configurations to create a model service that contains custom inference logic. For information about how to deploy a model service by using a client, see Deploy model services by using EASCMD or DSW.

{
  "metadata": {
    "name": "triton_server_test",
    "instance": 1,
  },
  "cloud": {
        "computing": {
            "instance_type": "ml.gu7i.c8m30.1-gu30",
            "instances": null
        }
    },
  "containers": [
    {
      "command": "tritonserver --model-repository=/models",
      "image": "eas-registry-vpc.<region>.cr.aliyuncs.com/pai-eas/tritonserver:23.02-py3",
      "port": 8000,
      "prepare": {
        "pythonRequirements": [
          "torch==2.0.1"
        ]
      }
    }
  ],
  "storage": [
    {
      "mount_path": "/models",
      "oss": {
        "path": "oss://oss-test/models/triton_backend/"
      }
    },
    {
      "empty_dir": {
        "medium": "memory",
        // Set the shared memory to 1 GB. 
        "size_limit": 1
      },
      "mount_path": "/dev/shm"
    }
  ]
}

Take note of the following parameters:

name: the name of the model that contains custom logic.
storage.oss.path: the OSS path of the model directory.
containers.image: the image that is used for deployment. Replace <region> with the ID of your current region. For example, you can specify cn-shanghai for the China (Shanghai) region.

Call the service

You can configure a client to send inference requests to the deployed service.

Send HTTP requests

If you set the port number to 8000, you can send HTTP requests to the service. Sample Python code:

import numpy as np
import tritonclient.http as httpclient

# url specifies the endpoint that is used to access the service that you deployed in EAS. 
url = '1859257******.cn-hangzhou.pai-eas.aliyuncs.com/api/predict/triton_server_test'

triton_client = httpclient.InferenceServerClient(url=url)

image = np.ones((1,3,224,224))
image = image.astype(np.float32)

inputs = []
inputs.append(httpclient.InferInput('INPUT__0', image.shape, "FP32"))
inputs[0].set_data_from_numpy(image, binary_data=False)
outputs = []
outputs.append(httpclient.InferRequestedOutput('OUTPUT__0', binary_data=False)) # Obtain a 1000-dimensional vector.

# Specify the name, token, input, and output of the model. 
results = triton_client.infer(
    model_name="<model_name>",
    model_version="<version_num>",
    inputs=inputs,
    outputs=outputs,
    headers={"Authorization": "<test-token>"},
)
output_data0 = results.as_numpy('OUTPUT__0')
print(output_data0.shape)
print(output_data0)

The following table describes the key parameters.

Parameter	Description
url	The endpoint of the service without the `http://` prefix. To obtain the endpoint, perform the following steps: Go to the Elastic Algorithm Service (EAS) page, find the service, and then click its name. On the Service Details tab of the page that appears, click View Endpoint Information. On the Public Endpoint tab of the Invocation Method dialog box, view the public endpoint.
model_name	The name of the model. Example: resnet50_pt.
model_version	The model version that you want to use. Requests can be sent only to one version of the model at a time.
headers	The token of the service. Replace <test-token> with the token of the service. You can view the token on the Public Endpoint tab.

Send gRPC requests

If you set the port number to 8001, you can send gRPC requests to the service after the required configuration is added. Sample Python code:

#!/usr/bin/env python
import grpc
from tritonclient.grpc import service_pb2, service_pb2_grpc
import numpy as np

if __name__ == "__main__":
    # Define the endpoint of the service. 
    host = (
        "service_name.115770327099****.cn-beijing.pai-eas.aliyuncs.com:80"
    )
    # Replace test-token with the token of your service. 
    token = "test-token"
    # Specify the model name and version. 
    model_name = "resnet50_pt"
    model_version = "1"
    
    # Create gRPC metadata for token verification. 
    metadata = (("authorization", token),)

    # Create a gRPC channel and a gRPC stub to communicate with the server. 
    channel = grpc.insecure_channel(host)
    grpc_stub = service_pb2_grpc.GRPCInferenceServiceStub(channel)
    
    # Create an inference request. 
    request = service_pb2.ModelInferRequest()
    request.model_name = model_name
    request.model_version = model_version
    
    # Construct the input tensor based on the input parameter that you specify in the model configuration file. 
    input = service_pb2.ModelInferRequest().InferInputTensor()
    input.name = "INPUT__0"
    input.datatype = "FP32"
    input.shape.extend([1, 3, 224, 224])
     # Construct the output tensor based on the output parameter that you specify in the model configuration file. 
    output = service_pb2.ModelInferRequest().InferRequestedOutputTensor()
    output.name = "OUTPUT__0"
    
    # Add the input data to the request. 
    request.inputs.extend([input])
    request.outputs.extend([output])
    # Create the input data by constructing a random array and serializing the array into a sequence of bytes. 
    request.raw_input_contents.append(np.random.rand (1,3, 224, 224).astype(np.float32).tobytes()) # An array of floating-point numbers
        
    # Send an inference request and receive the response. 
    response, _ = grpc_stub.ModelInfer.with_call(request, metadata=metadata)
    
    # Extract the output tensor from the response. 
    output_contents=response.raw_output_contents [0]  # For example, only one output tensor is returned. 
    output_shape = [1, 1000]  # For example, the shape of the output tensor is [1, 1000]. 
    
    # Convert the output bytes into a NumPy array. 
    output_array = np.frombuffer(output_contents, dtype=np.float32)
    output_array = output_array.reshape(output_shape)
    
    # Print the output of the model. 
    print("Model output:\n", output_array)

The following table describes the key parameters.

Parameter	Description
host	The endpoint of the service without the `http://` prefix and with the `:80` suffix. To obtain the endpoint, perform the following operations: Go to the Elastic Algorithm Service (EAS) page, find the service, and then click its name. On the Service Details tab of the page that appears, click View Endpoint Information. On the Public Endpoint tab of the Invocation Method dialog box, view the public endpoint.
token	The token of the service. Replace <test-token> with the token of the service. You can view the token on the Public Endpoint tab.
model_name	The name of the model. Example: resnet50_pt.
model_version	The model version that you want to use. Requests can be sent only to one version of the model at a time.

References

For information about how to deploy a model service in EAS by using TensorFlow Serving, see Use a TensorFlow Serving image to deploy a model service.
You can create a custom image to deploy a model service in EAS. For more information, see Deploy a model service by using a custom image.
You can use an automated stress testing tool to test the performance of the deployed service in EAS. For more information, see Automatic service stress testing.
After you deploy a model service that matches your business scenario, you can call the service to verify the model performance. For information about applicable scenarios, see EAS use cases.