Large language models (LLMs) require GPU-accelerated infrastructure and an optimized serving stack to handle inference at scale. This topic describes how to deploy the Qwen1.5-4B-Chat model as an inference service on Container Service for Kubernetes (ACK) by using NVIDIA Triton Inference Server with the vLLM backend on T4 or A10 GPUs.
Background
Qwen1.5-4B-Chat
Qwen1.5-4B-Chat is a 4-billion-parameter LLM developed by Alibaba Cloud based on the Transformer architecture. The model is trained on large-scale datasets that cover web text, domain-specific books, and code. For more information, see the Qwen GitHub repository.
Triton Inference Server
Triton Inference Server is an open-source inference serving framework developed by NVIDIA. It supports multiple machine learning framework backends, including TensorRT, TensorFlow, PyTorch, ONNX, and vLLM.
Key features:
Multiple ML and deep learning framework support
Concurrent model execution
Continuous batching
Built-in inference metrics: GPU utilization, request latency, and throughput
For more information, see the Triton Inference Server GitHub repository.
vLLM
vLLM is a high-performance LLM inference framework that supports most popular LLMs, including Qwen models. vLLM uses PagedAttention optimization, continuous batching, and model quantization to significantly improve LLM inference throughput. For more information, see the vLLM GitHub repository.
Prerequisites
Before you begin, make sure that you have:
An ACK Pro cluster with GPU-accelerated nodes (Kubernetes 1.22 or later, 16 GB or more GPU memory per node). For more information, see Create an ACK managed cluster.
GPU driver version 525 installed on GPU nodes. Add the
ack.aliyun.com/nvidia-driver-version:525.105.17label to GPU-accelerated nodes to pin the driver version. For more information, see Specify an NVIDIA driver version for nodes by adding a label.The latest Arena client installed. For more information, see Configure the Arena client.
Step 1: Prepare model data
Download the Qwen1.5-4B-Chat model, upload it to Object Storage Service (OSS), and create persistent volumes (PVs) and persistent volume claims (PVCs) in the ACK cluster.
To deploy other vLLM-supported models, see Supported models. To use File Storage NAS instead of OSS, see Mount a statically provisioned NAS volume.
Download the model
Install Git:
# Use yum install git or apt install git yum install gitInstall the Git Large File Storage (LFS) plugin:
# Use yum install git-lfs or apt install git-lfs yum install git-lfsClone the Qwen1.5-4B-Chat repository from ModelScope:
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.gitEnter the directory and pull the large files:
cd Qwen1.5-4B-Chat git lfs pull
Upload the model to OSS
Log on to the OSS console and note the name of your OSS bucket. To create a bucket, see Create a bucket.
Install and configure ossutil. For more information, see Install ossutil.
Create a directory in OSS and upload the model:
ossutil mkdir oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat ossutil cp -r ./Qwen1.5-4B-Chat oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
Create PVs and PVCs
Create a PV and PVC to mount the OSS model data in the cluster. For more information, see Mount a statically provisioned OSS volume.
PV parameters
Parameter | Value |
PV Type | OSS |
Volume Name | llm-model |
Access Certificate | The AccessKey ID and AccessKey secret used to access the OSS bucket |
Bucket ID | The name of your OSS bucket |
OSS Path | The model path, such as /models/Qwen1.5-4B-Chat |
PVC parameters
Parameter | Value |
PVC Type | OSS |
Volume Name | llm-model |
Allocation Mode | Existing Volumes |
Existing Volumes | Select the PV you created |
Step 2: Configure Triton with vLLM
Create two configuration files: config.pbtxt for the Triton backend, and model.json for vLLM engine parameters.
Create the backend configuration
Create a working directory and the config.pbtxt file:
mkdir triton-vllm
cat << EOF > triton-vllm/config.pbtxt
backend: "vllm"
# The usage of device is deferred to the vLLM engine
instance_group [
{
count: 1
kind: KIND_MODEL
}
]
version_policy: { all { }}
EOFCreate the model configuration
The model.json file passes parameters to the vLLM engine. Choose the configuration that matches your GPU type.
vLLM allocates GPU memory aggressively on startup. The gpu_memory_utilization parameter controls this behavior. Setting it to 0.95 reserves 95% of GPU memory for the model. If other workloads share the same GPU, lower this value to avoid out-of-memory errors.
A10 GPU (production)
A10 GPUs deliver higher throughput and support bfloat16 precision. Use A10 for production workloads.
cat << EOF > triton-vllm/model.json
{
"model":"/model/Qwen1.5-4B-Chat",
"disable_log_requests": "true",
"gpu_memory_utilization": 0.95,
"trust_remote_code": "true",
"max_model_len": 16384
}
EOFT4 GPU (testing)
T4 GPUs are widely available and cost-effective, but do not support bfloat16 (bf16) precision. Set dtype to half (FP16) and use a lower max_model_len to fit within the 16 GB memory limit.
cat << EOF > triton-vllm/model.json
{
"model":"/model/Qwen1.5-4B-Chat",
"disable_log_requests": "true",
"gpu_memory_utilization": 0.95,
"trust_remote_code": "true",
"dtype": "half",
"max_model_len": 8192
}
EOFKey parameters
Parameter | Description |
| Maximum token sequence length the model can process. Higher values improve conversation quality but consume more GPU memory. |
| Floating-point precision for model loading. Set to |
| Fraction of GPU memory allocated to the model. Default is 0.9. |
For the full list of configurable parameters, see the vLLM Engine Arguments documentation. For more configuration examples, see Deploying a vLLM model in Triton.
Step 3: Deploy the inference service
Use Arena to deploy the Qwen1.5-4B-Chat inference service with Triton and vLLM.
Export the configuration file paths as environment variables:
export triton_config_file="triton-vllm/config.pbtxt" export model_config_file="triton-vllm/model.json"Deploy the inference service:
Parameters
Parameter
Description
--nameName of the inference service.
--versionVersion of the inference service.
--imageContainer image for the Triton server.
--gpusNumber of GPUs per replica.
--cpuNumber of CPU cores per replica.
--memoryMemory allocation per replica.
--dataPVC mount in
<pvc-name>:<mount-path>format. Runarena data listto list available PVCs.--config-fileLocal file mount in
<local-path>:<container-path>format.--model-repositoryTriton model repository directory. Each subdirectory represents a model and must contain its configuration files. For more information, see Triton model repository.
--http-portHTTP port for the Triton service.
--grpc-portgRPC port for the Triton service.
--allow-metricsExpose inference metrics (GPU utilization, latency, throughput).
arena serve triton \ --name=triton-vllm \ --version=v1 \ --image=ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/tritonserver:24.04-vllm-python-py3-ubuntu22.04 \ --gpus=1 \ --cpu=6 \ --memory=30Gi \ --data="llm-model:/model/Qwen1.5-4B-Chat" \ --model-repository /triton-config \ --config-file="$model_config_file:/triton-config/qwen-4b/1/model.json" \ --config-file="$triton_config_file:/triton-config/qwen-4b/config.pbtxt" \ --http-port=8000 \ --grpc-port=9000 \ --allow-metrics=trueExpected output:
configmap/triton-vllm-v1-4bd5884e6b5b6a3 created configmap/triton-vllm-v1-7815124a8204002 created service/triton-vllm-v1-tritoninferenceserver created deployment.apps/triton-vllm-v1-tritoninferenceserver created INFO[0007] The Job triton-vllm has been submitted successfully INFO[0007] You can run `arena serve get triton-vllm --type triton-serving -n default` to check the job statusVerify that the service is running. Wait until
Availableshows1.arena serve get triton-vllmExpected output:
Name: triton-vllm Namespace: default Type: Triton Version: v1 Desired: 1 Available: 1 Age: 3m Address: 172.16.XX.XX Port: RESTFUL:8000,GRPC:9000 GPU: 1 Instances: NAME STATUS AGE READY RESTARTS GPU NODE ---- ------ --- ----- -------- --- ---- triton-vllm-v1-tritoninferenceserver-b69cb7759-gkwz6 Running 3m 1/1 0 1 cn-beijing.172.16.XX.XX
Step 4: Verify the inference service
Port forwarding (development only)
Port forwarding through kubectl port-forward is intended for development and debugging only. It is not reliable, secure, or scalable for production use. For production networking, see Ingress overview.
Set up port forwarding:
kubectl port-forward svc/triton-vllm-v1-tritoninferenceserver 8000:8000Expected output:
Forwarding from 127.0.0.1:8000 -> 8000 Forwarding from [::1]:8000 -> 8000Send a test request to the Triton generate endpoint. Replace
qwen-4bwith your actual model name if different.curl -X POST localhost:8000/v2/models/qwen-4b/generate \ -d '{"text_input": "What is AI? AI is", "parameters": {"stream": false, "temperature": 0}}'Expected output:
{"model_name":"qwen-4b","model_version":"1","text_output":"What is AI? AI is a branch of computer science that studies how to make computers intelligent. Purpose of AI"}
(Optional) Clean up
Delete the inference service and storage resources when they are no longer needed:
# Delete the inference service
arena serve del triton-vllm
# Delete the PVC and PV
kubectl delete pvc llm-model
kubectl delete pv llm-model