This topic explains how to deploy Qwen2 large language model (LLM) inference services on Alibaba Cloud Container Service for Kubernetes (ACK) using TensorRT-LLM and Triton Inference Server. Use Fluid data orchestration and caching to optimize model loading performance and deploy efficient AI inference services.
Background information
Qwen2 large language model
Qwen2-1.5B-Instruct is a large language model developed by Tongyi Lab at Alibaba Group. It uses the Transformer architecture and has 1.5 billion parameters. The model trains on massive, diverse pre-training data—including web text, professional books, and code—and delivers strong natural language understanding and generation capabilities.
The Qwen2 series has these features:
-
Supports multiple inference tasks such as question answering, text generation, and code understanding
-
Undergoes instruction tuning to better fit real-world use cases
-
Has a moderate model size, making it suitable for GPU-based inference deployment
For more model details and technical specifications, see the Qwen2 official GitHub repository.
Triton Inference Server
Triton Inference Server is an open-source inference service framework developed by NVIDIA. It is designed for production environments. Triton supports multiple machine learning frameworks as backends—including TensorRT, TensorFlow, PyTorch, and ONNX Runtime—and unifies model deployment across frameworks.
Triton’s key advantages:
-
A unified inference interface that simplifies model deployment and management
-
Dynamic batching support to improve GPU utilization
-
Optimized concurrent processing for high-throughput inference
-
Comprehensive monitoring and metrics collection
To learn more about Triton Inference Server, visit the Triton Inference Server official GitHub repository.
TensorRT-LLM optimization engine
TensorRT-LLM is an inference engine optimized by NVIDIA for large language models. It compiles LLMs into highly optimized TensorRT execution engines to deliver exceptional inference performance on NVIDIA GPUs.
Key features include the following:
-
Model quantization and optimization to significantly boost inference speed
-
Support for tensor parallelism and pipeline parallelism
-
Deep integration with Triton through the TensorRT-LLM Backend
-
Support for multiple precision modes—such as FP16 and INT8—to balance performance and accuracy
For more technical details, see the TensorRT-LLM official GitHub repository.
Prerequisites
Before you begin deployment, ensure your environment meets these requirements:
-
GPU Environment Preparation: You must have created an ACK Pro cluster that includes NVIDIA A10 GPUs. The cluster's Kubernetes version must be 1.22 or later.
Recommended driver version: Use NVIDIA driver version 525. Specify this version by adding the label
ack.aliyun.com/nvidia-driver-version:525.105.17to your GPU node pool. For detailed steps, see the GPU driver upgrade documentation. -
Fluid component installation: You have installed the Fluid data orchestration and caching system in your cluster. If not, follow the Fluid installation guide.
-
Arena tool configuration: You have installed and configured the Arena command-line interface (CLI) tool to deploy and manage model services. For installation steps, see the Arena installation documentation.
-
OSS storage setup: You have activated Alibaba Cloud Object Storage Service (OSS) and created a bucket to store model files. For instructions, see the OSS Quick Start and Create a bucket.
-
Permission configuration: Your Alibaba Cloud account has read and write permissions for OSS and management permissions for your ACK cluster.
Step 1: Configure Fluid dataset and cache
In this step, create a Fluid Dataset and JindoRuntime to manage model data and provide high-performance caching. The Dataset organizes data. The JindoRuntime provides distributed caching to significantly improve model loading and inference performance.
Performance tip: With memory caching, model loading time drops from minutes to seconds.
-
Create OSS access credentials
Create a Kubernetes Secret to store your OSS authentication information.
kubectl apply -f-<<EOF apiVersion: v1 kind: Secret metadata: name: fluid-oss-secret stringData: fs.oss.accessKeyId: <YourAccessKey ID> fs.oss.accessKeySecret: <YourAccessKey Secret> EOFSecurity reminder: Replace
<YourAccessKey ID>and<YourAccessKey Secret>with your actual Alibaba Cloud AccessKey pair. To learn how to obtain your AccessKey pair, see the AccessKey management documentation.On success, you see this output:
secret/fluid-oss-secret created -
Configure Dataset and JindoRuntime
Create a file named dataset.yaml to define the Dataset and JindoRuntime resources. The Dataset describes your data source. The JindoRuntime provides distributed caching.
For full configuration details, see the Fluid configuration documentation.
# Create the Dataset resource and configure the OSS data source. apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: qwen2-oss spec: mounts: - mountPoint: oss://<oss_bucket>/qwen2-1.5b # Replace with your actual OSS bucket name. name: qwen2 path: / options: fs.oss.endpoint: <oss_endpoint> # Replace with your OSS endpoint. encryptOptions: - name: fs.oss.accessKeyId valueFrom: secretKeyRef: name: fluid-oss-secret key: fs.oss.accessKeyId - name: fs.oss.accessKeySecret valueFrom: secretKeyRef: name: fluid-oss-secret key: fs.oss.accessKeySecret accessModes: - ReadWriteMany --- # Create the JindoRuntime resource and configure the cache policy. apiVersion: data.fluid.io/v1alpha1 kind: JindoRuntime metadata: name: qwen2-oss spec: replicas: 2 tieredstore: levels: - mediumtype: MEM volumeType: emptyDir path: /dev/shm quota: 20Gi high: "0.95" low: "0.7" fuse: properties: fs.oss.read.buffer.size: "8388608" # 8 MB read buffer. fs.oss.download.thread.concurrency: "200" # Number of concurrent download threads. fs.oss.read.readahead.max.buffer.count: "200" # Number of read-ahead buffers. fs.oss.read.sequence.ambiguity.range: "2147483647" # Sequence read range. args: - -oauto_cache - -oattr_timeout=1 - -oentry_timeout=1 - -onegative_timeout=1Configuration notes:
-
mountPoint: Points to the path in OSS where your model files are stored. -
quota: 20Gi: Allocates 20 GB of memory for caching. -
replicas: 2: Deploys two cache instances to improve availability.
-
-
Deployment Resource Configuration
Apply the configuration file to create the Dataset and JindoRuntime resources:
kubectl apply -f dataset.yamlOn success, you see:
dataset.data.fluid.io/qwen2-oss created jindoruntime.data.fluid.io/qwen2-oss createdThis confirms that the Dataset and JindoRuntime resources are created and running.
-
Verify the deployment status
Check the Dataset deployment status and cache status:
kubectl get dataset qwen2-ossExpected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE qwen2-oss 0.00B 0.00B 20.00GiB 0.0% Bound 57sStatus notes:
PHASE: Boundmeans the Dataset is successfully bound.CACHE CAPACITY: 20.00GiBshows the allocated 20 GB memory cache space.
Step 2: Build the model inference environment
In this step, use Fluid Dataflow to automate key parts of model deployment: download the Qwen2 model from ModelScope, convert it to TensorRT-LLM format, build the inference engine, and preload cached data. This declarative approach ensures consistent and repeatable deployments.
Dataflow packages complex multi-step operations into automated workflows. This reduces manual work and improves deployment efficiency.
-
Create the Dataflow configuration file
Create a file named dataflow.yaml to define a three-step automated workflow:
-
Download the Qwen2-1.5B-Instruct base model from ModelScope
-
Use the TensorRT-LLM toolchain to convert the model and build the inference engine
-
Preload the optimized model data into cache using Dataload
# Step 1: Download the Qwen2 model from ModelScope. apiVersion: data.fluid.io/v1alpha1 kind: DataProcess metadata: name: step1-download-model spec: dataset: name: qwen2-oss namespace: default mountPath: /mnt/models/ processor: script: image: ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/base:ubuntu22.04 imagePullPolicy: IfNotPresent restartPolicy: OnFailure command: - bash source: | #!/bin/bash echo "Start downloading the model..." if [ -d "${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct" ]; then echo "Model directory exists. Skip download." else echo "Install Git LFS and download the model..." apt update && apt install -y git git-lfs git clone https://www.modelscope.cn/qwen/Qwen2-1.5B-Instruct.git Qwen2-1.5B-Instruct mv Qwen2-1.5B-Instruct ${MODEL_MOUNT_PATH} echo "Model download complete." fi env: - name: MODEL_MOUNT_PATH value: "/mnt/models" --- # Step 2: Convert the model and build the TensorRT-LLM engine. apiVersion: data.fluid.io/v1alpha1 kind: DataProcess metadata: name: step2-trtllm-convert spec: runAfter: kind: DataProcess name: step1-download-model namespace: default dataset: name: qwen2-oss namespace: default mountPath: /mnt/models/ processor: script: image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tritonserver-build:24.07-trtllm-python-py3 imagePullPolicy: IfNotPresent restartPolicy: OnFailure command: - bash source: | #!/bin/bash set -ex echo "Start model conversion..." cd /tensorrtllm_backend/tensorrt_llm/examples/qwen # Convert the checkpoint. if [ -d "${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-ckpt" ]; then echo "Checkpoint exists. Skip conversion." else echo "Convert the model checkpoint..." python3 convert_checkpoint.py \ --model_dir ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct \ --output_dir /root/Qwen2-1.5B-Instruct-ckpt \ --dtype float16 mv /root/Qwen2-1.5B-Instruct-ckpt ${MODEL_MOUNT_PATH} echo "Checkpoint conversion complete." fi sleep 2 # Build the TensorRT engine. if [ -d "${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-engine" ]; then echo "Engine exists. Skip build." else echo "Build the TensorRT-LLM engine..." trtllm-build \ --checkpoint_dir ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-ckpt \ --gemm_plugin float16 \ --paged_kv_cache enable \ --output_dir /root/Qwen2-1.5B-Instruct-engine mv /root/Qwen2-1.5B-Instruct-engine ${MODEL_MOUNT_PATH} echo "Engine build complete." fi # Configure the Triton model. if [ -d "${MODEL_MOUNT_PATH}/tensorrtllm_backend" ]; then echo "Configuration exists. Skip configuration." else echo "Configure the Triton model..." cd /tensorrtllm_backend cp all_models/inflight_batcher_llm/ qwen2_ifb -r export QWEN2_MODEL=${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct export ENGINE_PATH=${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-engine # Generate configuration files for each component. python3 tools/fill_template.py -i qwen2_ifb/preprocessing/config.pbtxt \ tokenizer_dir:${QWEN2_MODEL},triton_max_batch_size:8,preprocessing_instance_count:1 python3 tools/fill_template.py -i qwen2_ifb/postprocessing/config.pbtxt \ tokenizer_dir:${QWEN2_MODEL},triton_max_batch_size:8,postprocessing_instance_count:1 python3 tools/fill_template.py -i qwen2_ifb/tensorrt_llm_bls/config.pbtxt \ triton_max_batch_size:8,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False python3 tools/fill_template.py -i qwen2_ifb/ensemble/config.pbtxt \ triton_max_batch_size:8 python3 tools/fill_template.py -i qwen2_ifb/tensorrt_llm/config.pbtxt \ triton_backend:tensorrtllm,triton_max_batch_size:8,decoupled_mode:False,\ max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:1280,\ max_attention_window_size:1280,kv_cache_free_gpu_mem_fraction:0.5,\ exclude_input_in_output:True,enable_kv_cache_reuse:False,\ batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0 mkdir -p ${MODEL_MOUNT_PATH}/tensorrtllm_backend mv /tensorrtllm_backend/qwen2_ifb ${MODEL_MOUNT_PATH}/tensorrtllm_backend echo "Triton configuration complete." fi env: - name: MODEL_MOUNT_PATH value: "/mnt/models" resources: requests: cpu: 2 memory: 10Gi nvidia.com/gpu: 1 limits: cpu: 12 memory: 30Gi nvidia.com/gpu: 1 --- # Step 3: Preload cached data. apiVersion: data.fluid.io/v1alpha1 kind: DataLoad metadata: name: step3-warmup-cache spec: runAfter: kind: DataProcess name: step2-trtllm-convert namespace: default dataset: name: qwen2-oss namespace: default loadMetadata: true target: - path: /Qwen2-1.5B-Instruct-engine - path: /tensorrtllm_backendThis Dataflow configuration automates the end-to-end model deployment process—from raw model acquisition to production-ready inference service configuration.
-
-
Deploy the Dataflow workflow
Apply the Dataflow configuration file to create the automated workflow:
kubectl create -f dataflow.yamlOn success, you see:
dataprocess.data.fluid.io/step1-download-model created dataprocess.data.fluid.io/step2-trtllm-convert created dataload.data.fluid.io/step3-warmup-cache createdThis confirms that the custom resources for all three steps are created.
-
Monitor execution progress
Track the Dataflow execution status until all steps finish:
kubectl get dataprocessDuring execution, statuses change like this:
NAME DATASET PHASE AGE DURATION step1-download-model qwen2-oss Running 2m - step2-trtllm-convert qwen2-oss Pending 0s -When complete, you see:
NAME DATASET PHASE AGE DURATION step1-download-model qwen2-oss Complete 23m 3m2s step2-trtllm-convert qwen2-oss Complete 20m 19m58sStatus notes:
Runningmeans the step is executing.Completemeans the step succeeded.Pendingmeans the step is waiting for its predecessor to finish.
The full model preparation process usually takes 20–30 minutes. Actual time depends on network conditions and GPU performance.
Step 3: Deploy the Triton inference service
Use Arena to deploy the Qwen2 inference service optimized with TensorRT-LLM. Triton Server exposes RESTful and gRPC interfaces.
-
Deploy the service with Arena
Run this command to deploy a custom inference service:
Key configuration notes:
-
Service name:
qwen2-chat. Version:v1 -
Resource allocation: 1 GPU, 1 replica
-
Port configuration: HTTP port 8000, gRPC port 8001, metrics port 8002
-
Data mount: Use the
--dataflag to mount the Fluid PVC to/mnt/models
arena serve custom \ --name=qwen2-chat \ --version=v1 \ --gpus=1 \ --replicas=1 \ --restful-port=8000 \ --readiness-probe-action="tcpSocket" \ --readiness-probe-action-option="port: 8000" \ --readiness-probe-option="initialDelaySeconds: 30" \ --readiness-probe-option="periodSeconds: 30" \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tritonserver:24.07-trtllm-python-py3 \ --data=qwen2-oss:/mnt/models \ "tritonserver \ --model-repository=/mnt/models/tensorrtllm_backend/qwen2_ifb \ --http-port=8000 \ --grpc-port=8001 \ --metrics-port=8002 \ --disable-auto-complete-config \ --backend-config=python,shm-region-prefix-name=prefix0_"On successful deployment, you see:
service/qwen2-chat-v1 created deployment.apps/qwen2-chat-v1-custom-serving created INFO[0003] The Job qwen2-chat has been submitted successfully INFO[0003] You can run `arena serve get qwen2-chat --type custom-serving -n default` to check the job statusThis confirms that the inference service is submitted to the Kubernetes cluster.
-
-
Verify the service status
Check the inference service details and runtime status:
arena serve get qwen2-chatWhen the service runs normally, you see:
Name: qwen2-chat Namespace: default Type: Custom Version: v1 Desired: 1 Available: 1 Age: 2m Address: 192.168.10.15 Port: RESTFUL:8000 GPU: 1 Instances: NAME STATUS AGE READY RESTARTS GPU NODE ---- ------ --- ----- -------- --- ---- qwen2-chat-v1-custom-serving-657869c698-hl665 Running 2m 1/1 0 1 cn-hangzhou.192.168.10.15Readiness indicators:
Available: 1andREADY: 1/1mean the service is fully ready to accept inference requests.
Service startup usually takes 1–2 minutes. During initialization, the Available field changes from 0 to 1.
Step 4: Test the inference service
Test the inference service functionality and performance using local port forwarding and API calls.
-
Set up port forwarding
Create a local port-forwarding channel to test the service:
The port-forwarding command runs in your current terminal session. Press Ctrl+C to stop it.
kubectl port-forward svc/qwen2-chat-v1 8000:8000On success, you see:
Forwarding from 127.0.0.1:8000 -> 8000 Forwarding from [::1]:8000 -> 8000You can now access the inference service at
localhost:8000. -
Send an inference request
Use curl to send an inference request to the model:
curl -X POST localhost:8000/v2/models/ensemble/generate \ -H "Content-Type: application/json" \ -d '{ "text_input": "What is machine learning?", "max_tokens": 50, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2 }'You expect a response like this:
{ "context_logits": 0.0, "cum_log_probs": 0.0, "generation_logits": 0.0, "model_name": "ensemble", "model_version": "1", "output_log_probs": [0.0, 0.0, 0.0, 0.0], "sequence_end": false, "sequence_id": 0, "sequence_start": false, "text_output": " Machine learning is an artificial intelligence technique that enables computer systems to learn patterns and rules from data without being explicitly programmed. By analyzing large amounts of data with algorithms, machine learning models can identify complex relationships and make predictions or decisions." }Success verification: The
text_outputfield contains a relevant answer generated by the model. This confirms the inference service works correctly.
Testing tips:
-
Try different questions to test the model’s generalization ability
-
Adjust the
max_tokensparameter to control output length -
Observe response time and output quality
(Optional) Step 5: Clean up the environment
When you no longer need the inference service, clean up related resources using these steps:
Important reminder: Cleanup deletes all related resources—including model data and service configurations. Back up important data before you proceed.
-
Delete the inference service
Use Arena to delete the deployed service:
arena serve delete qwen2-chatConfirm successful deletion:
INFO[0001] Deleting service: qwen2-chat INFO[0002] Service qwen2-chat deleted successfully -
Clean up Fluid resources
Delete the Dataset and JindoRuntime resources:
kubectl delete dataset qwen2-oss kubectl delete jindoruntime qwen2-oss -
Delete access credentials
Clean up the OSS access key:
kubectl delete secret fluid-oss-secret
After cleanup, verify that all resources are removed with this command: kubectl get all -l app=qwen2-chat