All Products
Search
Document Center

Platform For AI:Deploy a Triton Inference Server service

Last Updated:Jan 14, 2026

Triton Inference Server, developed by NVIDIA, is an inference serving engine for deep learning and machine learning models. It supports deploying models from various AI frameworks, such as TensorRT, TensorFlow, PyTorch, and ONNX, as online inference services. It also supports features such as multi-model management and custom backends. This topic describes how to deploy a Triton-based inference service on PAI-EAS.

Prerequisites

  • An Object Storage Service (OSS) bucket in the same region as PAI.

  • A trained model file, such as .pt, .onnx, .plan, or .savedmodel.

Getting started: Deploy a single-model service

Step 1: Prepare the model repository

Triton requires a specific directory structure in an Object Storage Service (OSS) bucket. Create the directories in the following format. For more information, see Manage directories and Upload files.

oss://your-bucket/models/triton/
└── your_model_name/
    ├── 1/                    # Version directory (must be a number)
    │   └── model.pt          # Model file
    └── config.pbtxt          # Model configuration file

Key requirements:

  • Version directories must be named with numbers, such as 1, 2, or 3.

  • A higher number indicates a newer version.

  • Each model requires a config.pbtxt configuration file.

Step 2: Create the model configuration file

Create a config.pbtxt file to configure the basic information for the model. The following is an example:

name: "your_model_name"
platform: "pytorch_libtorch"
max_batch_size: 128

input [
  {
    name: "INPUT__0"
    data_type: TYPE_FP32
    dims: [ 3, -1, -1 ]
  }
]

output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]


# Use a GPU for inference
# instance_group [
#   { 
#     kind: KIND_GPU
#   }
# ]

# Model version configuration
# Load only the latest version (default behavior)
# version_policy: { latest: { num_versions: 1 }}

# Load all versions
# version_policy: { all { }}

# Load the two latest versions
# version_policy: { latest: { num_versions: 2 }}

# Load specified versions
# version_policy: { specific: { versions: [1, 3] }}

Parameter descriptions

Parameter

Required

Description

name

No

The model name. If specified, it must be the same as the model directory name.

platform

Yes

The model framework. Valid values: pytorch_libtorch, tensorflow_savedmodel, tensorflow_graphdef, tensorrt_plan, onnxruntime_onnx.

backend

Yes

An alternative to platform. Use python to customize the inference logic.

max_batch_size

Yes

The maximum batch size. Set to 0 to disable batching.

input

Yes

Input tensor configuration: name (name), data_type (data type), dims (dimensions).

output

Yes

Output tensor configuration: name (name), data_type (data type), dims (dimensions).

instance_group

No

Specifies the inference device: KIND_GPU or KIND_CPU. For a configuration example, see the config.pbtxt file above.

version_policy

No

Controls which model versions are loaded. For a configuration example, see the config.pbtxt file above.

Important

You must configure at least one of platform or backend.

Step 3: Deploy the service

  1. Log on to the PAI console. In the top navigation bar, select the destination region.

  2. In the left navigation pane, click Elastic Algorithm Service (EAS). Select the target workspace and click Deploy Service.

  3. In the Scenario-based Model Deployment section, click Triton Deployment.

  4. Configure the deployment parameters:

    • Service Name: Enter a custom service name.

    • Model Configuration: You can set the **Configuration Type** to OSS and enter the model repository path, such as oss://your-bucket/models/triton/.

    • Select values for Instance Count and Resource Type as needed. To estimate the required VRAM for model deployment, see Estimate the VRAM required for a large model.

  5. Click Deploy and wait for the service to start.

Step 4: Enable gRPC (optional)

By default, Triton provides an HTTP service on port 8000. To use gRPC, perform the following steps:

  1. In the upper-right corner of the service configuration page, click Convert to Custom Deployment.

  2. In the Environment Information section, change the Port Number to 8001.

  3. Under Features > Advanced Networking, enable gRPC.

  4. Click Deploy.

After the model is deployed, you can call the service.

Deploy a multi-model service

To deploy multiple models in a single Triton instance, place the models in the same repository directory:

oss://your-bucket/models/triton/
├── resnet50_pytorch/
│   ├── 1/
│   │   └── model.pt
│   └── config.pbtxt
├── densenet_onnx/
│   ├── 1/
│   │   └── model.onnx
│   └── config.pbtxt
└── classifier_tensorflow/
    ├── 1/
    │   └── model.savedmodel/
    │       ├── saved_model.pb
    │       └── variables/
    └── config.pbtxt

The deployment procedure is the same as for a single model. Triton automatically loads all models in the repository.

Use the Python Backend to customize inference logic

You can use Triton's Python Backend to customize pre-processing, post-processing, or inference logic.

Directory structure

your_model_name/
├── 1/
│   ├── model.pt          # Model file
│   └── model.py          # Custom inference logic
└── config.pbtxt

Implement the Python Backend

Create a model.py file and define the TritonPythonModel class:

import json
import os
import torch
from torch.utils.dlpack import from_dlpack, to_dlpack

import triton_python_backend_utils as pb_utils


class TritonPythonModel:
    """The class name must be "TritonPythonModel"."""

    def initialize(self, args):
        """
        The initializer function. This is optional. It is called once when the model is loaded and can be used to initialize information related to model properties and configurations.
        Parameters
        ----------
        args : A dictionary where both keys and values are strings. It includes:
          * model_config: Model configuration information in JSON format.
          * model_instance_kind: The device model.
          * model_instance_device_id: The device ID.
          * model_repository: The model repository path.
          * model_version: The model version.
          * model_name: The model name.
        """

        # Convert the model configuration content from a JSON string to a Python dictionary.
        self.model_config = model_config = json.loads(args["model_config"])

        # Get the properties from the model configuration file.
        output_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT__0")

        # Convert Triton types to numpy types.
        self.output_dtype = pb_utils.triton_string_to_numpy(output_config["data_type"])

        # Get the path of the model repository.
        self.model_directory = os.path.dirname(os.path.realpath(__file__))

        # Get the device used for model inference. This example uses a GPU.
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        print("device: ", self.device)

        model_path = os.path.join(self.model_directory, "model.pt")
        if not os.path.exists(model_path):
            raise pb_utils.TritonModelException("Cannot find the pytorch model")
        # Load the PyTorch model to the GPU using .to(self.device).
        self.model = torch.jit.load(model_path).to(self.device)

        print("Initialized...")

    def execute(self, requests):
        """
        The model execution function. This must be implemented. This function is called for every inference request. If the batch parameter is set, you must also implement the batch processing feature yourself.
        Parameters
        ----------
        requests : A list of requests of the pb_utils.InferenceRequest type.

        Returns
        -------
        A list of responses of the pb_utils.InferenceResponse type. The length of the list must be the same as the length of the request list.
        """

        output_dtype = self.output_dtype

        responses = []

        # Traverse the request list and create a corresponding response for each request.
        for request in requests:
            # Get the input tensor.
            input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT__0")
            # Convert the Triton tensor to a Torch tensor.
            pytorch_tensor = from_dlpack(input_tensor.to_dlpack())

            if pytorch_tensor.shape[2] > 1000 or pytorch_tensor.shape[3] > 1000:
                responses.append(
                    pb_utils.InferenceResponse(
                        output_tensors=[],
                        error=pb_utils.TritonError(
                            "Image shape should not be larger than 1000"
                        ),
                    )
                )
                continue

            # Perform inference computation on the GPU.
            prediction = self.model(pytorch_tensor.to(self.device))

            # Convert the Torch output tensor to a Triton tensor.
            out_tensor = pb_utils.Tensor.from_dlpack("OUTPUT__0", to_dlpack(prediction))

            inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor])
            responses.append(inference_response)

        return responses

    def finalize(self):
        """
        Called when the model is uninstalled. This is optional and can be used for model cleanup tasks.
        """
        print("Cleaning up...")
Important

Note that when you use the Python Backend, some of Triton's behaviors change:

  • max_batch_size has no effect: The max_batch_size parameter in config.pbtxt does not enable dynamic batching in the Python Backend. You must iterate through the requests list in the execute method and manually build the batch for inference.

  • instance_group has no effect: The instance_group in config.pbtxt does not control whether the Python Backend uses a CPU or a GPU. You must explicitly move the model and data to the target device in the initialize and execute methods using code, such as pytorch_tensor.to(torch.device("cuda")).

Update the configuration file

name: "resnet50_pt"
backend: "python"
max_batch_size: 128
input [
  {
    name: "INPUT__0"
    data_type: TYPE_FP32
    dims: [ 3, -1, -1 ]
  }
]
output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]

parameters: {
    key: "FORCE_CPU_ONLY_INPUT_TENSORS"
    value: {string_value: "no"}
}

The key parameters are described as follows:

  • backend: Must be set to `python`.

  • parameters: Optional. If you use a GPU for inference, set the FORCE_CPU_ONLY_INPUT_TENSORS parameter to `no` to avoid the overhead of copying input tensors between the CPU and GPU.

Deploy the service

To use the Python backend, you must configure shared memory. In the Custom Model Deployment > JSON On Premises Deployment section, enter the following JSON configuration and deploy the service to implement custom inference logic.

{
  "metadata": {
    "name": "triton_server_test",
    "instance": 1
  },
  "cloud": {
        "computing": {
            "instance_type": "ml.gu7i.c8m30.1-gu30",
            "instances": null
        }
    },
  "containers": [
    {
      "command": "tritonserver --model-repository=/models",
      "image": "eas-registry-vpc.<region>.cr.aliyuncs.com/pai-eas/tritonserver:23.02-py3",
      "port": 8000,
      "prepare": {
        "pythonRequirements": [
          "torch==2.0.1"
        ]
      }
    }
  ],
  "storage": [
    {
      "mount_path": "/models",
      "oss": {
        "path": "oss://oss-test/models/triton_backend/"
      }
    },
    {
      "empty_dir": {
        "medium": "memory",
        // Configure shared memory as 1 GB.
        "size_limit": 1
      },
      "mount_path": "/dev/shm"
    }
  ]
}

Key JSON configuration descriptions:

  • containers[0].image: The official Triton image. Replace cn-hangzhou with the region where your service is located.

  • containers[0].prepare.pythonRequirements: List your Python dependencies here. EAS automatically installs them before the service starts.

  • storage: Contains two mount items.

    • The first mounts your OSS model repository path to the /models directory in the container.

    • The second is the required shared memory configuration. The Triton Server and the Python Backend process use shared memory at /dev/shm for zero-copy tensor data transfer to maximize performance. The unit for size_limit is GB. Estimate the required size based on your model and concurrency.

Call the service

Get the service endpoint and token

  1. On the Elastic Algorithm Service (EAS) page, click the service name.

  2. On the Service Details tab, click View Endpoint Information. Copy the Internet Endpoint and Token.

Send an HTTP request

If the port is configured to 8000, the service supports HTTP requests.

import numpy as np
# To install the tritonclient package, run the command: pip install tritonclient
import tritonclient.http as httpclient

# The endpoint generated after the service is deployed. Do not include http://
url = '1859257******.cn-hangzhou.pai-eas.aliyuncs.com/api/predict/triton_server_test'

triton_client = httpclient.InferenceServerClient(url=url)

image = np.ones((1,3,224,224))
image = image.astype(np.float32)

inputs = []
inputs.append(httpclient.InferInput('INPUT__0', image.shape, "FP32"))
inputs[0].set_data_from_numpy(image, binary_data=False)
outputs = []
outputs.append(httpclient.InferRequestedOutput('OUTPUT__0', binary_data=False))  # Get a 1000-dimensional vector

# Specify the model name, request token, inputs, and outputs.
results = triton_client.infer(
    model_name="<your-model-name>",
    model_version="<version-num>",
    inputs=inputs,
    outputs=outputs,
    headers={"Authorization": "<your-service-token>"},
)
output_data0 = results.as_numpy('OUTPUT__0')
print(output_data0.shape)
print(output_data0)

Send a gRPC request

If the port is configured to 8001 and gRPC is enabled, the service supports gRPC requests.

Important

Note: The gRPC endpoint is different from the HTTP endpoint. Obtain it again from the service details page.

#!/usr/bin/env python
import grpc
# To install the tritonclient package, run the command: pip install tritonclient
from tritonclient.grpc import service_pb2, service_pb2_grpc
import numpy as np

if __name__ == "__main__":
    # The endpoint generated after the service is deployed. Do not include http://. Append ":80" to the end.
    host = (
        "service_name.115770327099****.cn-beijing.pai-eas.aliyuncs.com:80"
    )
    # The service token. Use a real token in actual applications.
    token = "<your-service-token>"
    # The model name and version.
    model_name = "<your-model-name>"
    model_version = "<version-num>"
    
    # Create gRPC metadata for token authentication.
    metadata = (("authorization", token),)

    # Create a gRPC channel and stub to communicate with the server.
    channel = grpc.insecure_channel(host)
    grpc_stub = service_pb2_grpc.GRPCInferenceServiceStub(channel)
    # Build the inference request.
    request = service_pb2.ModelInferRequest()
    request.model_name = model_name
    request.model_version = model_version

    # Construct the input tensor, which corresponds to the input parameter defined in the model configuration file.
    input = service_pb2.ModelInferRequest().InferInputTensor()
    input.name = "INPUT__0"
    input.datatype = "FP32"
    input.shape.extend([1, 3, 224, 224])
    # Construct the output tensor, which corresponds to the output parameter defined in the model configuration file.
    output = service_pb2.ModelInferRequest().InferRequestedOutputTensor()
    output.name = "OUTPUT__0"

    # Create the input request.
    request.inputs.extend([input])
    request.outputs.extend([output])
    # Construct a random number array and serialize it into a byte sequence as input data.
    request.raw_input_contents.append(np.random.rand(1, 3, 224, 224).astype(np.float32).tobytes()) # Numeric type
            
    # Initiate the inference request and receive the response.
    response, _ = grpc_stub.ModelInfer.with_call(request, metadata=metadata)
    
    # Extract the output tensor from the response.
    output_contents = response.raw_output_contents[0]  # Assume there is only one output tensor.
    output_shape = [1, 1000]  # Assume the shape of the output tensor is [1, 1000].
    
    # Convert the output bytes to a numpy array.
    output_array = np.frombuffer(output_contents, dtype=np.float32)
    output_array = output_array.reshape(output_shape)
    
    # Print the model's output result.
    print("Model output:\n", output_array)

Debugging tips

Enable verbose logging

Set verbose=True to print the JSON data of requests and responses:

client = httpclient.InferenceServerClient(url=url, verbose=True)

Example output:

POST /api/predict/triton_test/v2/models/resnet50_pt/versions/1/infer, headers {'Authorization': '************1ZDY3OTEzNA=='}
b'{"inputs":[{"name":"INPUT__0","shape":[1,3,32,32],"datatype":"FP32","data":[1.0,1.0,1.0,.....,1.0]}],"outputs":[{"name":"OUTPUT__0","parameters":{"binary_data":false}}]}'

Online debugging

You can test the service directly using the online debugging feature in the console. Set the request URL to /api/predict/triton_test/v2/models/resnet50_pt/versions/1/infer and use the JSON request data from the verbose log as the request body.

image

Stress test the service

The following steps describe how to perform a stress test using a single piece of data as an example. For more information about stress testing, see Stress testing for services in common scenarios.

  1. On the Stress Testing Task tab, click Add Stress Testing Task, select the deployed Triton service, and then enter the stress testing endpoint.

  2. Set Data Source to Single Data and run the following code to convert the JSON-formatted request body into a Base64-encoded string.

    import base64
    
    # Existing JSON request body string
    json_str = '{"inputs":[{"name":"INPUT__0","shape":[1,3,32,32],"datatype":"FP32","data":[1.0,1.0,.....,1.0]}]}'
    # Direct encoding
    base64_str = base64.b64encode(json_str.encode('utf-8')).decode('ascii')
    print(base64_str)

    image

FAQ

Q: I receive the error "CUDA error: no kernel image is available for execution on the device". What should I do?

This error occurs because of a compatibility issue between the image version and the GPU. Try switching to a different GPU instance type, such as A10 or T4.

Q: When I make an HTTP call, I receive the error "tritonclient.utils.InferenceServerException: url should not include the scheme". How can I fix this?

This error occurs because the service URL is incorrect. The format of the service endpoint is http://17519301*******.cn-hangzhou.pai-eas.aliyuncs.com/api/predict/wen***** (note that this is different from the gRPC endpoint). Remove the http:// prefix.

Q: When I make a gRPC call, I receive the error "DNS resolution failed for wenyu****.175193***43.cn-hangzhou.pai-eas.aliyuncs.com/:80". How can I fix this?

This error occurs because the service host is incorrect. The format of the service endpoint is http://we*****.1751930*****.cn-hangzhou.pai-eas.aliyuncs.com/ (note that this is different from the HTTP endpoint). Remove the http:// prefix and the trailing /. Then, append :80 to the end. The final format is we*****.1751930*****.cn-hangzhou.pai-eas.aliyuncs.com:80.

References