Triton Inference Server, developed by NVIDIA, is an inference serving engine for deep learning and machine learning models. It supports deploying models from various AI frameworks, such as TensorRT, TensorFlow, PyTorch, and ONNX, as online inference services. It also supports features such as multi-model management and custom backends. This topic describes how to deploy a Triton-based inference service on PAI-EAS.
Prerequisites
An Object Storage Service (OSS) bucket in the same region as PAI.
A trained model file, such as
.pt,.onnx,.plan, or.savedmodel.
Getting started: Deploy a single-model service
Step 1: Prepare the model repository
Triton requires a specific directory structure in an Object Storage Service (OSS) bucket. Create the directories in the following format. For more information, see Manage directories and Upload files.
oss://your-bucket/models/triton/
└── your_model_name/
├── 1/ # Version directory (must be a number)
│ └── model.pt # Model file
└── config.pbtxt # Model configuration file
Key requirements:
Version directories must be named with numbers, such as
1,2, or3.A higher number indicates a newer version.
Each model requires a
config.pbtxtconfiguration file.
Step 2: Create the model configuration file
Create a config.pbtxt file to configure the basic information for the model. The following is an example:
name: "your_model_name"
platform: "pytorch_libtorch"
max_batch_size: 128
input [
{
name: "INPUT__0"
data_type: TYPE_FP32
dims: [ 3, -1, -1 ]
}
]
output [
{
name: "OUTPUT__0"
data_type: TYPE_FP32
dims: [ 1000 ]
}
]
# Use a GPU for inference
# instance_group [
# {
# kind: KIND_GPU
# }
# ]
# Model version configuration
# Load only the latest version (default behavior)
# version_policy: { latest: { num_versions: 1 }}
# Load all versions
# version_policy: { all { }}
# Load the two latest versions
# version_policy: { latest: { num_versions: 2 }}
# Load specified versions
# version_policy: { specific: { versions: [1, 3] }}Parameter descriptions
Parameter | Required | Description |
| No | The model name. If specified, it must be the same as the model directory name. |
| Yes | The model framework. Valid values: |
| Yes | An alternative to |
| Yes | The maximum batch size. Set to |
| Yes | Input tensor configuration: |
| Yes | Output tensor configuration: |
| No | Specifies the inference device: |
| No | Controls which model versions are loaded. For a configuration example, see the |
You must configure at least one of platform or backend.
Step 3: Deploy the service
Log on to the PAI console. In the top navigation bar, select the destination region.
In the left navigation pane, click Elastic Algorithm Service (EAS). Select the target workspace and click Deploy Service.
In the Scenario-based Model Deployment section, click Triton Deployment.
Configure the deployment parameters:
Service Name: Enter a custom service name.
Model Configuration: You can set the **Configuration Type** to OSS and enter the model repository path, such as
oss://your-bucket/models/triton/.Select values for Instance Count and Resource Type as needed. To estimate the required VRAM for model deployment, see Estimate the VRAM required for a large model.
Click Deploy and wait for the service to start.
Step 4: Enable gRPC (optional)
By default, Triton provides an HTTP service on port 8000. To use gRPC, perform the following steps:
In the upper-right corner of the service configuration page, click Convert to Custom Deployment.
In the Environment Information section, change the Port Number to
8001.Under Features > Advanced Networking, enable gRPC.
Click Deploy.
After the model is deployed, you can call the service.
Deploy a multi-model service
To deploy multiple models in a single Triton instance, place the models in the same repository directory:
oss://your-bucket/models/triton/
├── resnet50_pytorch/
│ ├── 1/
│ │ └── model.pt
│ └── config.pbtxt
├── densenet_onnx/
│ ├── 1/
│ │ └── model.onnx
│ └── config.pbtxt
└── classifier_tensorflow/
├── 1/
│ └── model.savedmodel/
│ ├── saved_model.pb
│ └── variables/
└── config.pbtxt
The deployment procedure is the same as for a single model. Triton automatically loads all models in the repository.
Use the Python Backend to customize inference logic
You can use Triton's Python Backend to customize pre-processing, post-processing, or inference logic.
Directory structure
your_model_name/
├── 1/
│ ├── model.pt # Model file
│ └── model.py # Custom inference logic
└── config.pbtxt
Implement the Python Backend
Create a model.py file and define the TritonPythonModel class:
import json
import os
import torch
from torch.utils.dlpack import from_dlpack, to_dlpack
import triton_python_backend_utils as pb_utils
class TritonPythonModel:
"""The class name must be "TritonPythonModel"."""
def initialize(self, args):
"""
The initializer function. This is optional. It is called once when the model is loaded and can be used to initialize information related to model properties and configurations.
Parameters
----------
args : A dictionary where both keys and values are strings. It includes:
* model_config: Model configuration information in JSON format.
* model_instance_kind: The device model.
* model_instance_device_id: The device ID.
* model_repository: The model repository path.
* model_version: The model version.
* model_name: The model name.
"""
# Convert the model configuration content from a JSON string to a Python dictionary.
self.model_config = model_config = json.loads(args["model_config"])
# Get the properties from the model configuration file.
output_config = pb_utils.get_output_config_by_name(model_config, "OUTPUT__0")
# Convert Triton types to numpy types.
self.output_dtype = pb_utils.triton_string_to_numpy(output_config["data_type"])
# Get the path of the model repository.
self.model_directory = os.path.dirname(os.path.realpath(__file__))
# Get the device used for model inference. This example uses a GPU.
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device: ", self.device)
model_path = os.path.join(self.model_directory, "model.pt")
if not os.path.exists(model_path):
raise pb_utils.TritonModelException("Cannot find the pytorch model")
# Load the PyTorch model to the GPU using .to(self.device).
self.model = torch.jit.load(model_path).to(self.device)
print("Initialized...")
def execute(self, requests):
"""
The model execution function. This must be implemented. This function is called for every inference request. If the batch parameter is set, you must also implement the batch processing feature yourself.
Parameters
----------
requests : A list of requests of the pb_utils.InferenceRequest type.
Returns
-------
A list of responses of the pb_utils.InferenceResponse type. The length of the list must be the same as the length of the request list.
"""
output_dtype = self.output_dtype
responses = []
# Traverse the request list and create a corresponding response for each request.
for request in requests:
# Get the input tensor.
input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT__0")
# Convert the Triton tensor to a Torch tensor.
pytorch_tensor = from_dlpack(input_tensor.to_dlpack())
if pytorch_tensor.shape[2] > 1000 or pytorch_tensor.shape[3] > 1000:
responses.append(
pb_utils.InferenceResponse(
output_tensors=[],
error=pb_utils.TritonError(
"Image shape should not be larger than 1000"
),
)
)
continue
# Perform inference computation on the GPU.
prediction = self.model(pytorch_tensor.to(self.device))
# Convert the Torch output tensor to a Triton tensor.
out_tensor = pb_utils.Tensor.from_dlpack("OUTPUT__0", to_dlpack(prediction))
inference_response = pb_utils.InferenceResponse(output_tensors=[out_tensor])
responses.append(inference_response)
return responses
def finalize(self):
"""
Called when the model is uninstalled. This is optional and can be used for model cleanup tasks.
"""
print("Cleaning up...")
Note that when you use the Python Backend, some of Triton's behaviors change:
max_batch_sizehas no effect: Themax_batch_sizeparameter inconfig.pbtxtdoes not enable dynamic batching in the Python Backend. You must iterate through therequestslist in theexecutemethod and manually build the batch for inference.instance_grouphas no effect: Theinstance_groupinconfig.pbtxtdoes not control whether the Python Backend uses a CPU or a GPU. You must explicitly move the model and data to the target device in theinitializeandexecutemethods using code, such aspytorch_tensor.to(torch.device("cuda")).
Update the configuration file
name: "resnet50_pt"
backend: "python"
max_batch_size: 128
input [
{
name: "INPUT__0"
data_type: TYPE_FP32
dims: [ 3, -1, -1 ]
}
]
output [
{
name: "OUTPUT__0"
data_type: TYPE_FP32
dims: [ 1000 ]
}
]
parameters: {
key: "FORCE_CPU_ONLY_INPUT_TENSORS"
value: {string_value: "no"}
}The key parameters are described as follows:
backend: Must be set to `python`.
parameters: Optional. If you use a GPU for inference, set the
FORCE_CPU_ONLY_INPUT_TENSORSparameter to `no` to avoid the overhead of copying input tensors between the CPU and GPU.
Deploy the service
To use the Python backend, you must configure shared memory. In the section, enter the following JSON configuration and deploy the service to implement custom inference logic.
{
"metadata": {
"name": "triton_server_test",
"instance": 1
},
"cloud": {
"computing": {
"instance_type": "ml.gu7i.c8m30.1-gu30",
"instances": null
}
},
"containers": [
{
"command": "tritonserver --model-repository=/models",
"image": "eas-registry-vpc.<region>.cr.aliyuncs.com/pai-eas/tritonserver:23.02-py3",
"port": 8000,
"prepare": {
"pythonRequirements": [
"torch==2.0.1"
]
}
}
],
"storage": [
{
"mount_path": "/models",
"oss": {
"path": "oss://oss-test/models/triton_backend/"
}
},
{
"empty_dir": {
"medium": "memory",
// Configure shared memory as 1 GB.
"size_limit": 1
},
"mount_path": "/dev/shm"
}
]
}Key JSON configuration descriptions:
containers[0].image: The official Triton image. Replacecn-hangzhouwith the region where your service is located.containers[0].prepare.pythonRequirements: List your Python dependencies here. EAS automatically installs them before the service starts.storage: Contains two mount items.The first mounts your OSS model repository path to the
/modelsdirectory in the container.The second is the required shared memory configuration. The Triton Server and the Python Backend process use shared memory at
/dev/shmfor zero-copy tensor data transfer to maximize performance. The unit forsize_limitis GB. Estimate the required size based on your model and concurrency.
Call the service
Get the service endpoint and token
On the Elastic Algorithm Service (EAS) page, click the service name.
On the Service Details tab, click View Endpoint Information. Copy the Internet Endpoint and Token.
Send an HTTP request
If the port is configured to 8000, the service supports HTTP requests.
import numpy as np
# To install the tritonclient package, run the command: pip install tritonclient
import tritonclient.http as httpclient
# The endpoint generated after the service is deployed. Do not include http://
url = '1859257******.cn-hangzhou.pai-eas.aliyuncs.com/api/predict/triton_server_test'
triton_client = httpclient.InferenceServerClient(url=url)
image = np.ones((1,3,224,224))
image = image.astype(np.float32)
inputs = []
inputs.append(httpclient.InferInput('INPUT__0', image.shape, "FP32"))
inputs[0].set_data_from_numpy(image, binary_data=False)
outputs = []
outputs.append(httpclient.InferRequestedOutput('OUTPUT__0', binary_data=False)) # Get a 1000-dimensional vector
# Specify the model name, request token, inputs, and outputs.
results = triton_client.infer(
model_name="<your-model-name>",
model_version="<version-num>",
inputs=inputs,
outputs=outputs,
headers={"Authorization": "<your-service-token>"},
)
output_data0 = results.as_numpy('OUTPUT__0')
print(output_data0.shape)
print(output_data0)Send a gRPC request
If the port is configured to 8001 and gRPC is enabled, the service supports gRPC requests.
Note: The gRPC endpoint is different from the HTTP endpoint. Obtain it again from the service details page.
#!/usr/bin/env python
import grpc
# To install the tritonclient package, run the command: pip install tritonclient
from tritonclient.grpc import service_pb2, service_pb2_grpc
import numpy as np
if __name__ == "__main__":
# The endpoint generated after the service is deployed. Do not include http://. Append ":80" to the end.
host = (
"service_name.115770327099****.cn-beijing.pai-eas.aliyuncs.com:80"
)
# The service token. Use a real token in actual applications.
token = "<your-service-token>"
# The model name and version.
model_name = "<your-model-name>"
model_version = "<version-num>"
# Create gRPC metadata for token authentication.
metadata = (("authorization", token),)
# Create a gRPC channel and stub to communicate with the server.
channel = grpc.insecure_channel(host)
grpc_stub = service_pb2_grpc.GRPCInferenceServiceStub(channel)
# Build the inference request.
request = service_pb2.ModelInferRequest()
request.model_name = model_name
request.model_version = model_version
# Construct the input tensor, which corresponds to the input parameter defined in the model configuration file.
input = service_pb2.ModelInferRequest().InferInputTensor()
input.name = "INPUT__0"
input.datatype = "FP32"
input.shape.extend([1, 3, 224, 224])
# Construct the output tensor, which corresponds to the output parameter defined in the model configuration file.
output = service_pb2.ModelInferRequest().InferRequestedOutputTensor()
output.name = "OUTPUT__0"
# Create the input request.
request.inputs.extend([input])
request.outputs.extend([output])
# Construct a random number array and serialize it into a byte sequence as input data.
request.raw_input_contents.append(np.random.rand(1, 3, 224, 224).astype(np.float32).tobytes()) # Numeric type
# Initiate the inference request and receive the response.
response, _ = grpc_stub.ModelInfer.with_call(request, metadata=metadata)
# Extract the output tensor from the response.
output_contents = response.raw_output_contents[0] # Assume there is only one output tensor.
output_shape = [1, 1000] # Assume the shape of the output tensor is [1, 1000].
# Convert the output bytes to a numpy array.
output_array = np.frombuffer(output_contents, dtype=np.float32)
output_array = output_array.reshape(output_shape)
# Print the model's output result.
print("Model output:\n", output_array)Debugging tips
Enable verbose logging
Set verbose=True to print the JSON data of requests and responses:
client = httpclient.InferenceServerClient(url=url, verbose=True)
Example output:
POST /api/predict/triton_test/v2/models/resnet50_pt/versions/1/infer, headers {'Authorization': '************1ZDY3OTEzNA=='}
b'{"inputs":[{"name":"INPUT__0","shape":[1,3,32,32],"datatype":"FP32","data":[1.0,1.0,1.0,.....,1.0]}],"outputs":[{"name":"OUTPUT__0","parameters":{"binary_data":false}}]}'Online debugging
You can test the service directly using the online debugging feature in the console. Set the request URL to /api/predict/triton_test/v2/models/resnet50_pt/versions/1/infer and use the JSON request data from the verbose log as the request body.

Stress test the service
The following steps describe how to perform a stress test using a single piece of data as an example. For more information about stress testing, see Stress testing for services in common scenarios.
On the Stress Testing Task tab, click Add Stress Testing Task, select the deployed Triton service, and then enter the stress testing endpoint.
Set Data Source to Single Data and run the following code to convert the JSON-formatted request body into a Base64-encoded string.
import base64 # Existing JSON request body string json_str = '{"inputs":[{"name":"INPUT__0","shape":[1,3,32,32],"datatype":"FP32","data":[1.0,1.0,.....,1.0]}]}' # Direct encoding base64_str = base64.b64encode(json_str.encode('utf-8')).decode('ascii') print(base64_str)
FAQ
Q: I receive the error "CUDA error: no kernel image is available for execution on the device". What should I do?
This error occurs because of a compatibility issue between the image version and the GPU. Try switching to a different GPU instance type, such as A10 or T4.
Q: When I make an HTTP call, I receive the error "tritonclient.utils.InferenceServerException: url should not include the scheme". How can I fix this?
This error occurs because the service URL is incorrect. The format of the service endpoint is http://17519301*******.cn-hangzhou.pai-eas.aliyuncs.com/api/predict/wen***** (note that this is different from the gRPC endpoint). Remove the http:// prefix.
Q: When I make a gRPC call, I receive the error "DNS resolution failed for wenyu****.175193***43.cn-hangzhou.pai-eas.aliyuncs.com/:80". How can I fix this?
This error occurs because the service host is incorrect. The format of the service endpoint is http://we*****.1751930*****.cn-hangzhou.pai-eas.aliyuncs.com/ (note that this is different from the HTTP endpoint). Remove the http:// prefix and the trailing /. Then, append :80 to the end. The final format is we*****.1751930*****.cn-hangzhou.pai-eas.aliyuncs.com:80.
References
To learn how to deploy an EAS service using the TensorFlow Serving inference engine, see TensorFlow Serving image deployment.
You can also develop a custom image and use it to deploy an EAS service. For more information, see Custom images.
For more information about NVIDIA Triton, see the official Triton documentation.