All Products
Search
Document Center

Platform For AI:Use a Triton Inference Server image to deploy a model service

Last Updated:Aug 06, 2024

Triton Inference Server is an open source inference serving engine that streamlines AI inference. It allows you to deploy AI models from multiple deep learning and machine learning frameworks, such as TensorRT, TensorFlow, PyTorch, and ONNX, as online inference services. Triton Inference Server also supports multi-model management and provides a backend API that allows you to add custom backends. This topic describes how to use a Triton Inference Server image to deploy a model service in Platform for AI (PAI).

Deploy a single-model service

  1. Create a model directory in an Object Storage Service (OSS) bucket, and configure the model files and model configuration file based on the format requirements of the model directory. For more information, see Manage directories.

    Each model directory must have at least one version sub-directory and one model configuration file.

    • Version sub-directory: stores model files. The name of a version sub-directory must be a number that indicates the model version. A larger number indicates a later model version.

    • Model configuration file: stores the basic information about the model. In most cases, this file is named config.pbtxt.

    For example, a model is stored in the oss://examplebucket/models/triton/ directory, and the directory is organized in the following structure:

    triton
    └──resnet50_pt
        ├── 1
        │   └── model.pt
        ├── 2
        │   └── model.pt
        ├── 3
        │   └── model.pt
        └── config.pbtxt

    The config.pbtxt file specifies the configurations of the model. Example:

    name: "resnet50_pt"
    platform: "pytorch_libtorch"
    max_batch_size: 128
    input [
      {
        name: "INPUT__0"
        data_type: TYPE_FP32
        dims: [ 3, -1, -1 ]
      }
    ]
    output [
      {
        name: "OUTPUT__0"
        data_type: TYPE_FP32
        dims: [ 1000 ]
      }
    ]
    
    # Use GPU resources for inference.
    # instance_group [
    #   { 
    #     kind: KIND_GPU
    #   }
    # ]
    
    # Specify the version policy of the model.
    # version_policy: { all { }}
    # version_policy: { latest: { num_versions: 2}}
    # version_policy: { specific: { versions: [1,3]}}

    The following table describes the key parameters in the config.pbtxt file.

    Parameter

    Required

    Description

    name

    No

    The name of the model. The default value is the name of the model directory. If you specify this parameter, the value must match the name of the model directory.

    platform/backend

    Yes

    Specifies at least one of the following parameters:

    • platform: the framework that is used to train the model. Common values: tensorrt_plan, onnxruntime_onnx, pytorch_libtorch, tensorflow_savedmodel, and tensorflow_graphdef.

    • backend: the method that you want to use to run the model. To specify this parameter, you can use the following methods:

      • Specify the framework that you want to use to train the model. Common values: tensorrt, onnxruntime, pytorch, and tensorflow. Take note that the valid values for the platform and backend parameters are different for the same framework.

      • Specify the name of the backend that uses Python code to configure custom inference logic. For more information, see Use a custom backend.

    max_batch_size

    Yes

    The maximum number of requests that the model can process at the same time. If you set this parameter to 0, batch processing is disabled.

    input

    Yes

    Contains the following properties:

    • name: the name of the data.

    • data_type: the type of the data.

    • dims: the dimensions of the data.

    output

    Yes

    Contains the following properties:

    • name: the name of the data.

    • data_type: the type of the data.

    • dims: the dimensions of the data.

    instance_group

    No

    The computing resources that are used to run the model. If GPU resources are available, the model automatically uses GPU resources for inference. Otherwise, the model uses CPU resources for inference. You can specify the computing resources that you want to use in the following format:

    instance_group [
       { 
         kind: KIND_GPU
       }
     ]

    Valid values for the kind property: KIND_GPU and KIND_CPU.

    version_policy

    No

    The version policy of the model. Sample configurations for the resnet50_pt model:

    version_policy: { all { }}
    version_policy: { latest: { num_versions: 2}}
    version_policy: { specific: { versions: [1,3]}}
    • If you leave this parameter empty, the latest version of the model is loaded. For example, if you leave this parameter empty for the resnet50_pt model, version 3 of the model is loaded.

    • all{}: loads all versions of the model. In the preceding example, versions 1, 2, and 3 of the resnet50_pt model are loaded.

    • latest{num_versions:}: loads the latest n versions, where n is the value of num_versions. In the preceding example, num_versions: 2 indicates that the latest two versions (versions 2 and 3) of the resnet50_pt model are loaded.

    • specific{versions:[]}: loads specific versions. In the preceding example, versions 1 and 3 of the resnet50_pt model are loaded.

  2. Deploy the Triton Inference Server service.

    Triton Inference Server supports the following two ports. By default, the system uses port 8000 in the scenario-based model deployment. If you want to use port 8001, perform Step e. Otherwise, ignore Step e.

    • 8000: launches an HTTP server on port 8000 to receive HTTP requests.

    • 8001: launches a Google Remote Procedure Call (gRPC) server on port 8001 to receive gRPC requests.

    Perform the following steps:

    1. Go to the Elastic Algorithm Service (EAS) page.

      1. Log on to the PAI console.

      2. In the left-side navigation pane, click Workspaces. On the Workspaces page, click the name of the workspace that you want to manage.

      3. In the left-side navigation pane, choose Model Deployment > Elastic Algorithm Service (EAS). The Elastic Algorithm Service (EAS) page appears.

    2. On the Elastic Algorithm Service (EAS) page, click Deploy Service.

    3. In the Scenario-based Model Deployment section of the page that appears, click Triton Deployment.

    4. On the Triton Deployment page, configure the following parameters. For information about other parameters, see Deploy a model service in the PAI console.

      Parameter

      Description

      Service Name

      The name of the service.

      Model Settings

      In this example, select Mount OSS for Type and set OSS to the OSS path where the model files prepared in Step 1 are stored. Example: oss://example/models/triton/.

    5. (Optional) In the upper-right corner of the page, click Convert to Custom Deployment. In the Model Service Information section, modify the Command to Run parameter to change the port number to 8001. In the Configuration Editor section, add the following configuration.

      Note

      By default, the service starts the HTTP service on port 8000 to receive HTTP requests. To receive gRPC requests, you must change the port number to 8001. The system starts the gRPC server on port 8001.

      "metadata": {
          "enable_http2": true
      },
      "networking": {
          "path": "/"
      }
    6. After you configure the parameters, click Deploy.

Deploy a multi-model service

The method for deploying a multi-model service is similar to the method for deploying a single-model service in Elastic Algorithm Service (EAS). To deploy a multi-model service, you must create a directory for multiple models. The following code provides an example. The service loads all models and deploys all models in the same service. For more information, see Deploy a single-model service.

triton
├── resnet50_pt
|   ├── 1
|   │   └── model.pt
|   └── config.pbtxt
├── densenet_onnx
|   ├── 1
|   │   └── model.onnx
|   └── config.pbtxt
└── mnist_savedmodel
    ├── 1
    │   └── model.savedmodel
    │       ├── saved_model.pb
    |       └── variables
    |           ├── variables.data-00000-of-00001
    |           └── variables.index
    └── config.pbtxt

Use a custom backend

A Triton backend implements the inference process for a model. A backend can use an existing framework, such as TensorRT, ONNX Runtime, PyTorch, and TensorFlow, or implement custom inference logic, such as pre-processing and post-processing operations.

You can implement a backend by using C++ or Python. Python is more flexible and convenient than C++. This section describes how to implement a Python backend.

  1. Modify the structure of the model directory.

    The following example shows the required directory structure for a PyTorch model:

    resnet50_pt
    ├── 1
    │   ├── model.pt
    │   └── model.py
    └── config.pbtxt

    To use a custom backend, you must add a model.py file in the sub-directory that represents the model version and modify the config.pbtxt file.

    • Add a model.py file.

      The model.py file contains your custom inference logic. You must define a class named TritonPythonModel and implement the initialize, execute, and finalize functions based on your business requirements. Sample code:

      import json
      import os
      import torch
      from torch.utils.dlpack import from_dlpack, to_dlpack
      
      import triton_python_backend_utils as pb_utils
      
      
      class TritonPythonModel:
          """The class name must be TritonPythonModel."""
      
          def initialize(self, args):
              """
              The initialize() function is optional. It is called only during the model loading process to initialize information about the model, such as model properties and configurations. 
              Parameters
              ----------
              args: a dictionary that stores data as key-value pairs. The keys and values are of the string type. Valid keys:
                * model_config: the model configurations in the JSON format. 
                * model_instance_kind: the type of the device that is used to run the model. 
                * model_instance_device_id: the ID of the device that is used to run the model. 
                * model_repository: the path of the model repository. 
                * model_version: the version of the model. 
                * model_name: the name of the model. 
              """
      
              # Convert the JSON string that specifies the model configurations into a Python dictionary. 
              self.model_config = model_config = json.loads(args["model_config"])
      
              # Extract the properties from the model configuration file. 
              output_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT__0")
      
              # Convert Triton types into NumPy types. 
              self.output_dtype = pb_utils.triton_string_to_numpy(output_config["data_type"])
      
              # Obtain the path of the model repository. 
              self.model_directory = os.path.dirname(os.path.realpath(__file__))
      
              # Obtain the device that is used to run the model. In this example, the model runs on a GPU device. 
              self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
              print("device: ", self.device)
      
              model_path = os.path.join(self.model_directory, "model.pt")
              if not os.path.exists(model_path):
                  raise pb_utils.TritonModelException("Cannot find the pytorch model")
              # Use .to(self.device) to load the PyTorch model to a GPU device. 
              self.model = torch.jit.load(model_path).to(self.device)
      
              print("Initialized...")
      
          def execute(self, requests):
              """
              The execute() function is required. It is called each time the model receives an inference request. If you want the model to support batch processing, you must add the batch processing logic in the execute() function.
              Parameters
              ----------
              requests: a list of requests. Each request is of the pb_utils.InferenceRequest type. 
      
              Returns
              -------
              A list of responses. Each response is of the pb_utils.InferenceResponse type. The length of the response list must be the same as the length of the request list. 
              """
      
              output_dtype = self.output_dtype
      
              responses = []
      
              # Traverse the request list and create a response for each request. 
              for request in requests:
                  # Obtain the input tensor. 
                  input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT__0")
                  # Convert a Triton tensor into a PyTorch tensor. 
                  pytorch_tensor = from_dlpack(input_tensor.to_dlpack())
      
                  if pytorch_tensor.shape[2] > 1000 or pytorch_tensor.shape[3] > 1000:
                      responses.append(
                          pb_utils.InferenceResponse(
                              output_tensors=[],
                              error=pb_utils.TritonError(
                                  "Image shape should not be larger than 1000"
                              ),
                          )
                      )
                      continue
      
                  # Perform inference on the GPU device. 
                  prediction = self.model(pytorch_tensor.to(self.device))
      
                  # Convert the PyTorch output tensor into a Triton tensor. 
                  out_tensor = pb_utils.Tensor.from_dlpack("OUTPUT__0", to_dlpack(prediction))
      
                  inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor])
                  responses.append(inference_response)
      
              return responses
      
          def finalize(self):
              """
              The finalize() function is optional. It is called when the model is unloaded to release resources. 
              """
              print("Cleaning up...")
      
      Important
      • If you want to run inference on a GPU device, do not use the instance_group.kind property in the config.pbtxt file. Instead, call the model.to(torch.device("cuda")) function to load the model on a GPU device. When the model receives a request, call the pytorch_tensor.to(torch.device("cuda")) function to send the model input tensor to the GPU device. This way, you can run inference on a GPU device after you configure GPU resources for model deployment.

      • If you want the model to support batch processing, do not use the max_batch_size parameter in the config.pbtxt file. Instead, implement the batch processing logic in the execute() function.

      • Each request must correspond to one response.

    • Modify the config.pbtxt file.

      Sample configuration:

      name: "resnet50_pt"
      backend: "python"
      max_batch_size: 128
      input [
        {
          name: "INPUT__0"
          data_type: TYPE_FP32
          dims: [ 3, -1, -1 ]
        }
      ]
      output [
        {
          name: "OUTPUT__0"
          data_type: TYPE_FP32
          dims: [ 1000 ]
        }
      ]
      
      parameters: {
          key: "FORCE_CPU_ONLY_INPUT_TENSORS"
          value: {string_value: "no"}
      }

      Modify the following parameters and retain the configurations of other parameters.

      • backend: Set the value to python.

      • parameters: This parameter is optional. If the model runs inference on a GPU device, set FORCE_CPU_ONLY_INPUT_TENSORS to no, which prevents unnecessary overhead caused by copying the input tensor between the CPU and GPU during inference.

  2. Deploy the model.

    When you use a Python backend, you must configure a shared memory. You can use the following configurations to create a model service that contains custom inference logic. For information about how to deploy a model service by using a client, see Deploy model services by using EASCMD or DSW.

    {
      "metadata": {
        "name": "triton_server_test",
        "instance": 1,
      },
      "cloud": {
            "computing": {
                "instance_type": "ml.gu7i.c8m30.1-gu30",
                "instances": null
            }
        },
      "containers": [
        {
          "command": "tritonserver --model-repository=/models",
          "image": "eas-registry-vpc.<region>.cr.aliyuncs.com/pai-eas/tritonserver:23.02-py3",
          "port": 8000,
          "prepare": {
            "pythonRequirements": [
              "torch==2.0.1"
            ]
          }
        }
      ],
      "storage": [
        {
          "mount_path": "/models",
          "oss": {
            "path": "oss://oss-test/models/triton_backend/"
          }
        },
        {
          "empty_dir": {
            "medium": "memory",
            // Set the shared memory to 1 GB. 
            "size_limit": 1
          },
          "mount_path": "/dev/shm"
        }
      ]
    }

    Take note of the following parameters:

    • name: the name of the model that contains custom logic.

    • storage.oss.path: the OSS path of the model directory.

    • containers.image: the image that is used for deployment. Replace <region> with the ID of your current region. For example, you can specify cn-shanghai for the China (Shanghai) region.

Call the service

You can configure a client to send inference requests to the deployed service.

  • Send HTTP requests

    If you set the port number to 8000, you can send HTTP requests to the service. Sample Python code:

    import numpy as np
    import tritonclient.http as httpclient
    
    # url specifies the endpoint that is used to access the service that you deployed in EAS. 
    url = '1859257******.cn-hangzhou.pai-eas.aliyuncs.com/api/predict/triton_server_test'
    
    triton_client = httpclient.InferenceServerClient(url=url)
    
    image = np.ones((1,3,224,224))
    image = image.astype(np.float32)
    
    inputs = []
    inputs.append(httpclient.InferInput('INPUT__0', image.shape, "FP32"))
    inputs[0].set_data_from_numpy(image, binary_data=False)
    outputs = []
    outputs.append(httpclient.InferRequestedOutput('OUTPUT__0', binary_data=False)) # Obtain a 1000-dimensional vector.
    
    # Specify the name, token, input, and output of the model. 
    results = triton_client.infer(
        model_name="<model_name>",
        model_version="<version_num>",
        inputs=inputs,
        outputs=outputs,
        headers={"Authorization": "<test-token>"},
    )
    output_data0 = results.as_numpy('OUTPUT__0')
    print(output_data0.shape)
    print(output_data0)

    The following table describes the key parameters.

    Parameter

    Description

    url

    The endpoint of the service without the http:// prefix. To obtain the endpoint, perform the following steps: Go to the Elastic Algorithm Service (EAS) page, find the service, and then click its name. On the Service Details tab of the page that appears, click View Endpoint Information. On the Public Endpoint tab of the Invocation Method dialog box, view the public endpoint.

    model_name

    The name of the model. Example: resnet50_pt.

    model_version

    The model version that you want to use. Requests can be sent only to one version of the model at a time.

    headers

    The token of the service. Replace <test-token> with the token of the service. You can view the token on the Public Endpoint tab.

  • Send gRPC requests

    If you set the port number to 8001, you can send gRPC requests to the service after the required configuration is added. Sample Python code:

    #!/usr/bin/env python
    import grpc
    from tritonclient.grpc import service_pb2, service_pb2_grpc
    import numpy as np
    
    if __name__ == "__main__":
        # Define the endpoint of the service. 
        host = (
            "service_name.115770327099****.cn-beijing.pai-eas.aliyuncs.com:80"
        )
        # Replace test-token with the token of your service. 
        token = "test-token"
        # Specify the model name and version. 
        model_name = "resnet50_pt"
        model_version = "1"
        
        # Create gRPC metadata for token verification. 
        metadata = (("authorization", token),)
    
        # Create a gRPC channel and a gRPC stub to communicate with the server. 
        channel = grpc.insecure_channel(host)
        grpc_stub = service_pb2_grpc.GRPCInferenceServiceStub(channel)
        
        # Create an inference request. 
        request = service_pb2.ModelInferRequest()
        request.model_name = model_name
        request.model_version = model_version
        
        # Construct the input tensor based on the input parameter that you specify in the model configuration file. 
        input = service_pb2.ModelInferRequest().InferInputTensor()
        input.name = "INPUT__0"
        input.datatype = "FP32"
        input.shape.extend([1, 3, 224, 224])
         # Construct the output tensor based on the output parameter that you specify in the model configuration file. 
        output = service_pb2.ModelInferRequest().InferRequestedOutputTensor()
        output.name = "OUTPUT__0"
        
        # Add the input data to the request. 
        request.inputs.extend([input])
        request.outputs.extend([output])
        # Create the input data by constructing a random array and serializing the array into a sequence of bytes. 
        request.raw_input_contents.append(np.random.rand (1,3, 224, 224).astype(np.float32).tobytes()) # An array of floating-point numbers
            
        # Send an inference request and receive the response. 
        response, _ = grpc_stub.ModelInfer.with_call(request, metadata=metadata)
        
        # Extract the output tensor from the response. 
        output_contents=response.raw_output_contents [0]  # For example, only one output tensor is returned. 
        output_shape = [1, 1000]  # For example, the shape of the output tensor is [1, 1000]. 
        
        # Convert the output bytes into a NumPy array. 
        output_array = np.frombuffer(output_contents, dtype=np.float32)
        output_array = output_array.reshape(output_shape)
        
        # Print the output of the model. 
        print("Model output:\n", output_array)

    The following table describes the key parameters.

    Parameter

    Description

    host

    The endpoint of the service without the http:// prefix and with the :80 suffix. To obtain the endpoint, perform the following operations: Go to the Elastic Algorithm Service (EAS) page, find the service, and then click its name. On the Service Details tab of the page that appears, click View Endpoint Information. On the Public Endpoint tab of the Invocation Method dialog box, view the public endpoint.

    token

    The token of the service. Replace <test-token> with the token of the service. You can view the token on the Public Endpoint tab.

    model_name

    The name of the model. Example: resnet50_pt.

    model_version

    The model version that you want to use. Requests can be sent only to one version of the model at a time.

References