All Products
Search
Document Center

Platform For AI:Deploy a Triton Inference Server service

Last Updated:Jan 21, 2026

NVIDIA Triton Inference Server is a high-performance inference serving platform that supports models from TensorRT, TensorFlow, PyTorch, ONNX, and other frameworks. This guide walks you through deploying Triton-based inference services on Alibaba Cloud PAI-EAS (Elastic Algorithm Service).

Prerequisites

  • An OSS bucket in the same region as your PAI workspace

  • Trained model files (e.g., .pt, .onnx, .plan, .savedmodel)

Quickstart: Deploy a single-model service

Step 1: Prepare the model repository

Triton requires a specific directory structure within your Object Storage Service (OSS) bucket. Create your directories in the following format. For more information, see Manage directories and Upload files.

oss://your-bucket/models/triton/
└── your_model_name/
    ├── 1/                    # Version directory (must be a number)
    │   └── model.pt          # Model file
    └── config.pbtxt          # Model configuration file

Key requirements:

  • The model version directory must be named with a number (123, etc.).

  • A higher number indicates a newer version.

  • Each model requires a config.pbtxt configuration file.

Step 2: Create the model configuration file

Create a config.pbtxt file to define your model's basic information. The following is an example:

name: "your_model_name"
platform: "pytorch_libtorch"
max_batch_size: 128

input [
  {
    name: "INPUT__0"
    data_type: TYPE_FP32
    dims: [ 3, -1, -1 ]
  }
]

output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]


# Use a GPU for inference
# instance_group [
#   { 
#     kind: KIND_GPU
#   }
# ]

# Model version policy
# Load only the latest version (default behavior)
# version_policy: { latest: { num_versions: 1 }}

# Load all versions
# version_policy: { all { }}

# Load the two latest versions
# version_policy: { latest: { num_versions: 2 }}

# Load specific versions
# version_policy: { specific: { versions: [1, 3] }}

Parameter descriptions

Parameter

Required

Description

name

No

The name of the model. If specified, it must match the model directory name.

platform

Yes

The model framework. Valid values include pytorch_libtorchtensorflow_savedmodeltensorflow_graphdeftensorrt_plan, and onnxruntime_onnx.

Choose this if you are deploying a standard model file (e.g., .pt, .onnx, .savedmodel).

backend

Yes

An alternative to platform. Use python to customize inference logic.

Choose this if you need to write custom Python code for pre/post-processing or core inference.

Note

While Triton's architecture supports developing custom backends in other languages like C++, this is not a common practice and is not covered in this guide.

max_batch_size

Yes

The maximum batch size. Set to 0 to disable batching.

input

Yes

The input tensor configuration: namedata_type, and dims (dimensions).

output

Yes

The output tensor configuration: namedata_type, and dims (dimensions).

instance_group

No

Specifies the inference device: KIND_GPU or KIND_CPU. See the config.pbtxt example for configuration details.

version_policy

No

Controls which model versions are loaded. See the config.pbtxt example for configuration details.

Important

You must specify either platform or backend.

Step 3: Deploy the service

  1. Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Elastic Algorithm Service (EAS).

  2. On the Inference Service tab, click Deploy Service, and in the Scenario-based Model Deployment section, click Triton Deployment.

  3. Configure the deployment parameters:

    • Service Name: Enter a custom service name.

    • Model Settings: For Type, select OSS. enter the path to your model repository (e.g., oss://your-bucket/models/triton/).

    • Instance Count and Resource Group Type: Select the values based on your requirements. To estimate the GPU memory required for your model, see Estimate the VRAM required for a large model.

  4. Click Deploy and wait for the service to start.

Step 4: Enable gRPC (optional)

By default, Triton provides an HTTP service on port 8000. To use gRPC:

  1. In the upper-right corner of the service configuration page, click Convert to Custom Deployment.

  2. In the Environment Information section, change the Port Number to 8001.

  3. Under Features > Advanced Networking, click Enable gRPC.

  4. Click Deploy.

After the service deploys successfully, you can call the service.

Deploy a multi-model service

To deploy multiple models in a single Triton instance, place all models in the same repository directory:

oss://your-bucket/models/triton/
├── resnet50_pytorch/
│   ├── 1/
│   │   └── model.pt
│   └── config.pbtxt
├── densenet_onnx/
│   ├── 1/
│   │   └── model.onnx
│   └── config.pbtxt
└── classifier_tensorflow/
    ├── 1/
    │   └── model.savedmodel/
    │       ├── saved_model.pb
    │       └── variables/
    └── config.pbtxt

The deployment steps are the same as for a single model. Triton automatically loads all models in the repository.

Use the Python backend to customize inference logic

Use Triton's Python backend to customize pre-processing, post-processing, or the core inference logic.

Directory structure

your_model_name/
├── 1/
│   ├── model.pt          # Model file
│   └── model.py          # Custom inference logic
└── config.pbtxt

Implement the Python Backend

Create a model.py file and define the TritonPythonModel class:

import json
import os
import torch
from torch.utils.dlpack import from_dlpack, to_dlpack

import triton_python_backend_utils as pb_utils


class TritonPythonModel:
    """The class name must be 'TritonPythonModel'."""

    def initialize(self, args):
        """
        Optional. Called once when the model is loaded. Use this function to initialize
        model properties and configurations.
        Parameters
        ----------
        args : A dictionary where both keys and values are strings. It includes:
          * model_config: The model configuration in JSON format.
          * model_instance_kind: The device type.
          * model_instance_device_id: The device ID.
          * model_repository: The path to the model repository.
          * model_version: The model version.
          * model_name: The model name.
        """

        # Parse the model configuration from its JSON representation.
        self.model_config = model_config = json.loads(args["model_config"])

        # Get properties from the model configuration file.
        output_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT__0")

        # Convert Triton types to NumPy types.
        self.output_dtype = pb_utils.triton_string_to_numpy(output_config["data_type"])

        # Get the path to the model repository.
        self.model_directory = os.path.dirname(os.path.realpath(__file__))

        # Get the device for model inference. This example uses a GPU.
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        print("device: ", self.device)

        model_path = os.path.join(self.model_directory, "model.pt")
        if not os.path.exists(model_path):
            raise pb_utils.TritonModelException("Cannot find the pytorch model")
        # Load the PyTorch model to the GPU by using .to(self.device).
        self.model = torch.jit.load(model_path).to(self.device)

        print("Initialized...")

    def execute(self, requests):
        """
        Required. This method is called for every inference request. If batching is enabled,
        you must implement the batch processing logic yourself.
        Parameters
        ----------
        requests : A list of pb_utils.InferenceRequest objects.

        Returns
        -------
        A list of pb_utils.InferenceResponse objects. The list must contain one
        response for each request.
        """

        output_dtype = self.output_dtype

        responses = []

        # Iterate through the request list and create a corresponding response for each request.
        for request in requests:
            # Get the input tensor.
            input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT__0")
            # Convert the Triton tensor to a Torch tensor.
            pytorch_tensor = from_dlpack(input_tensor.to_dlpack())

            if pytorch_tensor.shape[2] > 1000 or pytorch_tensor.shape[3] > 1000:
                responses.append(
                    pb_utils.InferenceResponse(
                        output_tensors=[],
                        error=pb_utils.TritonError(
                            "Image shape should not be larger than 1000"
                        ),
                    )
                )
                continue

            # Run inference on the target device.
            prediction = self.model(pytorch_tensor.to(self.device))

            # Convert the Torch output tensor to a Triton tensor.
            out_tensor = pb_utils.Tensor.from_dlpack("OUTPUT__0", to_dlpack(prediction))

            inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor])
            responses.append(inference_response)

        return responses

    def finalize(self):
        """
        Optional. Called when the model is unloaded. Use this for cleanup tasks, such as
        releasing resources.
        """
        print("Cleaning up...")
Important

Note that when you use the Python Backend, some of Triton's behaviors change:

  • max_batch_size has no effect: The max_batch_size parameter in config.pbtxt does not enable dynamic batching in the Python Backend. You must iterate through the requests list in the execute method and manually build the batch for inference.

  • instance_group has no effect: The instance_group in config.pbtxt does not control whether the Python Backend uses a CPU or a GPU. You must explicitly move the model and data to the target device in the initialize and execute methods using code, such as pytorch_tensor.to(torch.device("cuda")).

Update the configuration file

name: "resnet50_pt"
backend: "python"
max_batch_size: 128
input [
  {
    name: "INPUT__0"
    data_type: TYPE_FP32
    dims: [ 3, -1, -1 ]
  }
]
output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]

parameters: {
    key: "FORCE_CPU_ONLY_INPUT_TENSORS"
    value: {string_value: "no"}
}

Key parameter descriptions are as follows:

  • backend: Must be set to python.

  • parameters: When using a GPU for inference, you can optionally set the FORCE_CPU_ONLY_INPUT_TENSORS parameter to no to avoid the overhead of copying input tensors between the CPU and GPU.

Deploy the service

The Python backend requires shared memory. In Custom Model Deployment > JSON On Premises Deployment, enter the following JSON configuration and click Deploy.

{
  "metadata": {
    "name": "triton_server_test",
    "instance": 1
  },
  "cloud": {
        "computing": {
            "instance_type": "ml.gu7i.c8m30.1-gu30",
            "instances": null
        }
    },
  "containers": [
    {
      "command": "tritonserver --model-repository=/models",
      "image": "eas-registry-vpc.<region>.cr.aliyuncs.com/pai-eas/tritonserver:25.03-py3",
      "port": 8000,
      "prepare": {
        "pythonRequirements": [
          "torch==2.0.1"
        ]
      }
    }
  ],
  "storage": [
    {
      "mount_path": "/models",
      "oss": {
        "path": "oss://oss-test/models/triton_backend/"
      }
    },
    {
      "empty_dir": {
        "medium": "memory",
        // Configure 1 GB of shared memory.
        "size_limit": 1
      },
      "mount_path": "/dev/shm"
    }
  ]
}

Key JSON configuration description:

  • containers[0].image: The Triton official Image. Replace <region> with the region where your service is located.

  • containers[0].prepare.pythonRequirements: List your Python dependencies here. EAS automatically installs them before the service starts.

  • storage: Contains two mount points.

    • The first mounts your OSS model repository path to the /models directory in the container.

    • The second storage entry configures shared memory, which is required. The Triton server and Python backend use the /dev/shm path to pass tensor data with zero-copy, maximizing performance. The size_limit is in GB. Estimate the required size based on your model and expected concurrency.

Call the service

Get the service endpoint and token

  1. On the Elastic Algorithm Service (EAS) page, click the service name.

  2. On the Service Details tab, click View Endpoint Information. Copy the Internet Endpoint and Token.

Send an HTTP request

When the port number is set to 8000, the service supports HTTP requests.

import numpy as np
# To install the tritonclient package, run: pip install tritonclient
import tritonclient.http as httpclient

# The service endpoint URL. Do not include the `http://` scheme.
url = '1859257******.cn-hangzhou.pai-eas.aliyuncs.com/api/predict/triton_server_test'

triton_client = httpclient.InferenceServerClient(url=url)

image = np.ones((1,3,224,224))
image = image.astype(np.float32)

inputs = []
inputs.append(httpclient.InferInput('INPUT__0', image.shape, "FP32"))
inputs[0].set_data_from_numpy(image, binary_data=False)
outputs = []
outputs.append(httpclient.InferRequestedOutput('OUTPUT__0', binary_data=False))  # Get a 1000-dimension vector

# Specify the model name, request token, inputs, and outputs.
results = triton_client.infer(
    model_name="<your-model-name>",
    model_version="<version-num>",
    inputs=inputs,
    outputs=outputs,
    headers={"Authorization": "<your-service-token>"},
)
output_data0 = results.as_numpy('OUTPUT__0')
print(output_data0.shape)
print(output_data0)

Send a gRPC request

When the port number is set to 8001 and gRPC-related settings are configured, the service supports gRPC requests.

Important

The gRPC endpoint is different from the HTTP endpoint. Obtain the correct gRPC endpoint from the service details page.

#!/usr/bin/env python
import grpc
# To install the tritonclient package, run: pip install tritonclient
from tritonclient.grpc import service_pb2, service_pb2_grpc
import numpy as np

if __name__ == "__main__":
    # The access URL (service endpoint) generated after service deployment.
    # Do not include the `http://` scheme. Append the port `:80`.
    # Although Triton listens on port 8001 internally, PAI-EAS exposes gRPC via port 80 externally. Use :80 in your client.
    host = (
        "service_name.115770327099****.cn-beijing.pai-eas.aliyuncs.com:80"
    )
    # Service token. Use your actual token in a real application.
    token = "<your-service-token>"
    # Model name and version.
    model_name = "<your-model-name>"
    model_version = "<version-num>"
    
    # Create gRPC metadata for token authentication.
    metadata = (("authorization", token),)

    # Create a gRPC channel and stub to communicate with the server.
    channel = grpc.insecure_channel(host)
    grpc_stub = service_pb2_grpc.GRPCInferenceServiceStub(channel)
    
    # Build the inference request.
    request = service_pb2.ModelInferRequest()
    request.model_name = model_name
    request.model_version = model_version

    # Build the input tensor. It must match the input in the model configuration.
    input_tensor = service_pb2.ModelInferRequest().InferInputTensor()
    input_tensor.name = "INPUT__0"
    input_tensor.datatype = "FP32"
    input_tensor.shape.extend([1, 3, 224, 224])
    # Build the output tensor. It must match the output in the model configuration.
    output_tensor = service_pb2.ModelInferRequest().InferRequestedOutputTensor()
    output_tensor.name = "OUTPUT__0"

    # Add the input and output tensors to the request.
    request.inputs.extend([input_tensor])
    request.outputs.extend([output_tensor])
    
    # Create a random array and serialize it into a byte sequence as input data.
    request.raw_input_contents.append(np.random.rand(1, 3, 224, 224).astype(np.float32).tobytes())
            
    # Send the inference request and receive the response.
    response, _ = grpc_stub.ModelInfer.with_call(request, metadata=metadata)
    
    # Extract the output tensor from the response.
    output_contents = response.raw_output_contents[0]  # Assume there is only one output tensor.
    output_shape = [1, 1000]  # Assume the output tensor shape is [1, 1000].
    
    # Convert the output bytes to a NumPy array.
    output_array = np.frombuffer(output_contents, dtype=np.float32)
    output_array = output_array.reshape(output_shape)
    
    # Print the model's output.
    print("Model output:\n", output_array)

Debugging tips

Enable verbose logging

Set verbose=True to print the JSON data for requests and responses:

client = httpclient.InferenceServerClient(url=url, verbose=True)

Example output:

POST /api/predict/triton_test/v2/models/resnet50_pt/versions/1/infer, headers {'Authorization': '************1ZDY3OTEzNA=='}
b'{"inputs":[{"name":"INPUT__0","shape":[1,3,32,32],"datatype":"FP32","data":[1.0,1.0,1.0,.....,1.0]}],"outputs":[{"name":"OUTPUT__0","parameters":{"binary_data":false}}]}'

Online debugging

You can test directly using online debugging in the console. Complete the request URL to /api/predict/triton_test/v2/models/resnet50_pt/versions/1/infer and use the JSON request data from the verbose logs as the Body.

image

Stress test the service

The following steps describe how to perform a stress test using a single piece of data as an example. For more information about stress testing, see Stress testing for services in common scenarios.

  1. On the One-Clink Stress Testing tab, click Create Stress Testing Task, select your deployed Triton service, and enter the stress test URL.

  2. Set the Data Source to Single Data Entry. Use the following code to Base64-encode your JSON request body:

    import base64
    
    # Existing JSON request body string
    json_str = '{"inputs":[{"name":"INPUT__0","shape":[1,3,32,32],"datatype":"FP32","data":[1.0,1.0,.....,1.0]}]}'
    # Direct encoding
    base64_str = base64.b64encode(json_str.encode('utf-8')).decode('ascii')
    print(base64_str)

    image

FAQ

Q: Why do I get a "CUDA error: no kernel image is available for execution on the device"?

This error indicates a compatibility mismatch between the CUDA version in the Triton image and the architecture of the selected GPU instance.

To resolve this, switch to a different GPU instance type that is compatible with your image's CUDA version. For example, try using an A10 or T4 instance.

Q: How can I fix the "InferenceServerException: url should not include the scheme" for HTTP requests?

This error occurs because the tritonclient.http.InferenceServerClient requires the URL to be provided without the protocol scheme (e.g., http:// or https://).

To fix this, remove the scheme from your URL string.

Q: How do I resolve a "DNS resolution failed" error when making gRPC calls?

This error occurs because the service host is incorrect. The format of the service endpoint is http://we*****.1751930*****.cn-hangzhou.pai-eas.aliyuncs.com/ (note that this is different from the HTTP endpoint). Remove the http:// prefix and the trailing /. Then, append :80 to the end. The final format is we*****.1751930*****.cn-hangzhou.pai-eas.aliyuncs.com:80.

References