All Products
Search
Document Center

Container Service for Kubernetes:Deploy Qwen2 Model Inference Services Using TensorRT-LLM

Last Updated:Feb 28, 2026

This topic explains how to deploy Qwen2 large language model (LLM) inference services on Alibaba Cloud Container Service for Kubernetes (ACK) using TensorRT-LLM and Triton Inference Server. Use Fluid data orchestration and caching to optimize model loading performance and deploy efficient AI inference services.

Background information

Qwen2 large language model

Qwen2-1.5B-Instruct is a large language model developed by Tongyi Lab at Alibaba Group. It uses the Transformer architecture and has 1.5 billion parameters. The model trains on massive, diverse pre-training data—including web text, professional books, and code—and delivers strong natural language understanding and generation capabilities.

The Qwen2 series has these features:

  • Supports multiple inference tasks such as question answering, text generation, and code understanding

  • Undergoes instruction tuning to better fit real-world use cases

  • Has a moderate model size, making it suitable for GPU-based inference deployment

For more model details and technical specifications, see the Qwen2 official GitHub repository.

Triton Inference Server

Triton Inference Server is an open-source inference service framework developed by NVIDIA. It is designed for production environments. Triton supports multiple machine learning frameworks as backends—including TensorRT, TensorFlow, PyTorch, and ONNX Runtime—and unifies model deployment across frameworks.

Triton’s key advantages:

  • A unified inference interface that simplifies model deployment and management

  • Dynamic batching support to improve GPU utilization

  • Optimized concurrent processing for high-throughput inference

  • Comprehensive monitoring and metrics collection

To learn more about Triton Inference Server, visit the Triton Inference Server official GitHub repository.

TensorRT-LLM optimization engine

TensorRT-LLM is an inference engine optimized by NVIDIA for large language models. It compiles LLMs into highly optimized TensorRT execution engines to deliver exceptional inference performance on NVIDIA GPUs.

Key features include the following:

  • Model quantization and optimization to significantly boost inference speed

  • Support for tensor parallelism and pipeline parallelism

  • Deep integration with Triton through the TensorRT-LLM Backend

  • Support for multiple precision modes—such as FP16 and INT8—to balance performance and accuracy

For more technical details, see the TensorRT-LLM official GitHub repository.

Prerequisites

Before you begin deployment, ensure your environment meets these requirements:

  • GPU Environment Preparation: You must have created an ACK Pro cluster that includes NVIDIA A10 GPUs. The cluster's Kubernetes version must be 1.22 or later.

    Recommended driver version: Use NVIDIA driver version 525. Specify this version by adding the label ack.aliyun.com/nvidia-driver-version:525.105.17 to your GPU node pool. For detailed steps, see the GPU driver upgrade documentation.

  • Fluid component installation: You have installed the Fluid data orchestration and caching system in your cluster. If not, follow the Fluid installation guide.

  • Arena tool configuration: You have installed and configured the Arena command-line interface (CLI) tool to deploy and manage model services. For installation steps, see the Arena installation documentation.

  • OSS storage setup: You have activated Alibaba Cloud Object Storage Service (OSS) and created a bucket to store model files. For instructions, see the OSS Quick Start and Create a bucket.

  • Permission configuration: Your Alibaba Cloud account has read and write permissions for OSS and management permissions for your ACK cluster.

Step 1: Configure Fluid dataset and cache

In this step, create a Fluid Dataset and JindoRuntime to manage model data and provide high-performance caching. The Dataset organizes data. The JindoRuntime provides distributed caching to significantly improve model loading and inference performance.

Performance tip: With memory caching, model loading time drops from minutes to seconds.

  1. Create OSS access credentials

    Create a Kubernetes Secret to store your OSS authentication information.

    kubectl apply -f-<<EOF                                            
    apiVersion: v1
    kind: Secret
    metadata:
      name: fluid-oss-secret
    stringData:
      fs.oss.accessKeyId: <YourAccessKey ID>
      fs.oss.accessKeySecret: <YourAccessKey Secret>
    EOF

    Security reminder: Replace <YourAccessKey ID> and <YourAccessKey Secret> with your actual Alibaba Cloud AccessKey pair. To learn how to obtain your AccessKey pair, see the AccessKey management documentation.

    On success, you see this output:

    secret/fluid-oss-secret created
  2. Configure Dataset and JindoRuntime

    Create a file named dataset.yaml to define the Dataset and JindoRuntime resources. The Dataset describes your data source. The JindoRuntime provides distributed caching.

    For full configuration details, see the Fluid configuration documentation.

    # Create the Dataset resource and configure the OSS data source.
    apiVersion: data.fluid.io/v1alpha1
    kind: Dataset
    metadata:
      name: qwen2-oss
    spec:
      mounts:
      - mountPoint: oss://<oss_bucket>/qwen2-1.5b  # Replace with your actual OSS bucket name.
        name: qwen2
        path: /
        options:
          fs.oss.endpoint: <oss_endpoint>  # Replace with your OSS endpoint.
        encryptOptions:
          - name: fs.oss.accessKeyId
            valueFrom:
              secretKeyRef:
                name: fluid-oss-secret
                key: fs.oss.accessKeyId
          - name: fs.oss.accessKeySecret
            valueFrom:
              secretKeyRef:
                name: fluid-oss-secret
                key: fs.oss.accessKeySecret
      accessModes:
        - ReadWriteMany
    ---
    # Create the JindoRuntime resource and configure the cache policy.
    apiVersion: data.fluid.io/v1alpha1
    kind: JindoRuntime
    metadata:
      name: qwen2-oss
    spec:
      replicas: 2
      tieredstore:
        levels:
          - mediumtype: MEM
            volumeType: emptyDir
            path: /dev/shm
            quota: 20Gi
            high: "0.95"
            low: "0.7"
      fuse:
        properties:
          fs.oss.read.buffer.size: "8388608"  # 8 MB read buffer.
          fs.oss.download.thread.concurrency: "200"  # Number of concurrent download threads.
          fs.oss.read.readahead.max.buffer.count: "200"  # Number of read-ahead buffers.
          fs.oss.read.sequence.ambiguity.range: "2147483647"  # Sequence read range.
        args:
          - -oauto_cache
          - -oattr_timeout=1
          - -oentry_timeout=1
          - -onegative_timeout=1

    Configuration notes:

    • mountPoint: Points to the path in OSS where your model files are stored.

    • quota: 20Gi: Allocates 20 GB of memory for caching.

    • replicas: 2: Deploys two cache instances to improve availability.

  3. Deployment Resource Configuration

    Apply the configuration file to create the Dataset and JindoRuntime resources:

    kubectl apply -f dataset.yaml

    On success, you see:

    dataset.data.fluid.io/qwen2-oss created
    jindoruntime.data.fluid.io/qwen2-oss created

    This confirms that the Dataset and JindoRuntime resources are created and running.

  4. Verify the deployment status

    Check the Dataset deployment status and cache status:

    kubectl get dataset qwen2-oss

    Expected output:

    NAME        UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    qwen2-oss   0.00B            0.00B    20.00GiB         0.0%                Bound   57s

    Status notes: PHASE: Bound means the Dataset is successfully bound. CACHE CAPACITY: 20.00GiB shows the allocated 20 GB memory cache space.

Step 2: Build the model inference environment

In this step, use Fluid Dataflow to automate key parts of model deployment: download the Qwen2 model from ModelScope, convert it to TensorRT-LLM format, build the inference engine, and preload cached data. This declarative approach ensures consistent and repeatable deployments.

Dataflow packages complex multi-step operations into automated workflows. This reduces manual work and improves deployment efficiency.

  1. Create the Dataflow configuration file

    Create a file named dataflow.yaml to define a three-step automated workflow:

    1. Download the Qwen2-1.5B-Instruct base model from ModelScope

    2. Use the TensorRT-LLM toolchain to convert the model and build the inference engine

    3. Preload the optimized model data into cache using Dataload

    # Step 1: Download the Qwen2 model from ModelScope.
    apiVersion: data.fluid.io/v1alpha1
    kind: DataProcess
    metadata:
      name: step1-download-model
    spec:
      dataset:
        name: qwen2-oss
        namespace: default
        mountPath: /mnt/models/
      processor:
        script:
          image: ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/base:ubuntu22.04
          imagePullPolicy: IfNotPresent
          restartPolicy: OnFailure
          command:
          - bash
          source: |
            #!/bin/bash
            echo "Start downloading the model..."
            if [ -d "${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct" ]; then
                echo "Model directory exists. Skip download."
            else
                echo "Install Git LFS and download the model..."
                apt update && apt install -y git git-lfs
                git clone https://www.modelscope.cn/qwen/Qwen2-1.5B-Instruct.git Qwen2-1.5B-Instruct
                mv Qwen2-1.5B-Instruct ${MODEL_MOUNT_PATH}
                echo "Model download complete."
            fi
          env:
          - name: MODEL_MOUNT_PATH
            value: "/mnt/models"
    ---
    # Step 2: Convert the model and build the TensorRT-LLM engine.
    apiVersion: data.fluid.io/v1alpha1
    kind: DataProcess
    metadata:
      name: step2-trtllm-convert
    spec:
      runAfter:
        kind: DataProcess
        name: step1-download-model
        namespace: default
      dataset:
        name: qwen2-oss
        namespace: default
        mountPath: /mnt/models/
      processor:
        script:
          image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tritonserver-build:24.07-trtllm-python-py3
          imagePullPolicy: IfNotPresent
          restartPolicy: OnFailure
          command:
          - bash
          source: |
            #!/bin/bash
            set -ex
            
            echo "Start model conversion..."
            cd /tensorrtllm_backend/tensorrt_llm/examples/qwen
            
            # Convert the checkpoint.
            if [ -d "${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-ckpt" ]; then
                echo "Checkpoint exists. Skip conversion."
            else
                echo "Convert the model checkpoint..."
                python3 convert_checkpoint.py \
                  --model_dir ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct \
                  --output_dir /root/Qwen2-1.5B-Instruct-ckpt \
                  --dtype float16
                mv /root/Qwen2-1.5B-Instruct-ckpt ${MODEL_MOUNT_PATH}
                echo "Checkpoint conversion complete."
            fi
            
            sleep 2
            
            # Build the TensorRT engine.
            if [ -d "${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-engine" ]; then
                echo "Engine exists. Skip build."
            else
                echo "Build the TensorRT-LLM engine..."
                trtllm-build \
                  --checkpoint_dir ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-ckpt \
                  --gemm_plugin float16 \
                  --paged_kv_cache enable \
                  --output_dir /root/Qwen2-1.5B-Instruct-engine
                mv /root/Qwen2-1.5B-Instruct-engine ${MODEL_MOUNT_PATH}
                echo "Engine build complete."
            fi
            
            # Configure the Triton model.
            if [ -d "${MODEL_MOUNT_PATH}/tensorrtllm_backend" ]; then
                echo "Configuration exists. Skip configuration."
            else
                echo "Configure the Triton model..."
                cd /tensorrtllm_backend
                cp all_models/inflight_batcher_llm/ qwen2_ifb -r
                
                export QWEN2_MODEL=${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct
                export ENGINE_PATH=${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-engine
                
                # Generate configuration files for each component.
                python3 tools/fill_template.py -i qwen2_ifb/preprocessing/config.pbtxt \
                  tokenizer_dir:${QWEN2_MODEL},triton_max_batch_size:8,preprocessing_instance_count:1
                python3 tools/fill_template.py -i qwen2_ifb/postprocessing/config.pbtxt \
                  tokenizer_dir:${QWEN2_MODEL},triton_max_batch_size:8,postprocessing_instance_count:1
                python3 tools/fill_template.py -i qwen2_ifb/tensorrt_llm_bls/config.pbtxt \
                  triton_max_batch_size:8,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
                python3 tools/fill_template.py -i qwen2_ifb/ensemble/config.pbtxt \
                  triton_max_batch_size:8
                python3 tools/fill_template.py -i qwen2_ifb/tensorrt_llm/config.pbtxt \
                  triton_backend:tensorrtllm,triton_max_batch_size:8,decoupled_mode:False,\
                  max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:1280,\
                  max_attention_window_size:1280,kv_cache_free_gpu_mem_fraction:0.5,\
                  exclude_input_in_output:True,enable_kv_cache_reuse:False,\
                  batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0
                
                mkdir -p ${MODEL_MOUNT_PATH}/tensorrtllm_backend
                mv /tensorrtllm_backend/qwen2_ifb ${MODEL_MOUNT_PATH}/tensorrtllm_backend
                echo "Triton configuration complete."
            fi
          env:
          - name: MODEL_MOUNT_PATH
            value: "/mnt/models"
          resources:
            requests:
              cpu: 2
              memory: 10Gi
              nvidia.com/gpu: 1
            limits:
              cpu: 12
              memory: 30Gi
              nvidia.com/gpu: 1
    ---
    # Step 3: Preload cached data.
    apiVersion: data.fluid.io/v1alpha1
    kind: DataLoad
    metadata:
      name: step3-warmup-cache
    spec:
      runAfter:
        kind: DataProcess
        name: step2-trtllm-convert
        namespace: default
      dataset:
        name: qwen2-oss
        namespace: default
      loadMetadata: true
      target:
      - path: /Qwen2-1.5B-Instruct-engine
      - path: /tensorrtllm_backend

    This Dataflow configuration automates the end-to-end model deployment process—from raw model acquisition to production-ready inference service configuration.

  2. Deploy the Dataflow workflow

    Apply the Dataflow configuration file to create the automated workflow:

    kubectl create -f dataflow.yaml

    On success, you see:

    dataprocess.data.fluid.io/step1-download-model created
    dataprocess.data.fluid.io/step2-trtllm-convert created
    dataload.data.fluid.io/step3-warmup-cache created

    This confirms that the custom resources for all three steps are created.

  3. Monitor execution progress

    Track the Dataflow execution status until all steps finish:

    kubectl get dataprocess

    During execution, statuses change like this:

    NAME                   DATASET     PHASE      AGE   DURATION
    step1-download-model   qwen2-oss   Running    2m    -
    step2-trtllm-convert   qwen2-oss   Pending    0s    -

    When complete, you see:

    NAME                   DATASET     PHASE      AGE   DURATION
    step1-download-model   qwen2-oss   Complete   23m   3m2s
    step2-trtllm-convert   qwen2-oss   Complete   20m   19m58s

    Status notes: Running means the step is executing. Complete means the step succeeded. Pending means the step is waiting for its predecessor to finish.

The full model preparation process usually takes 20–30 minutes. Actual time depends on network conditions and GPU performance.

Step 3: Deploy the Triton inference service

Use Arena to deploy the Qwen2 inference service optimized with TensorRT-LLM. Triton Server exposes RESTful and gRPC interfaces.

  1. Deploy the service with Arena

    Run this command to deploy a custom inference service:

    Key configuration notes:

    • Service name: qwen2-chat. Version: v1

    • Resource allocation: 1 GPU, 1 replica

    • Port configuration: HTTP port 8000, gRPC port 8001, metrics port 8002

    • Data mount: Use the --data flag to mount the Fluid PVC to /mnt/models

    arena serve custom \
      --name=qwen2-chat \
      --version=v1 \
      --gpus=1 \
      --replicas=1 \
      --restful-port=8000 \
      --readiness-probe-action="tcpSocket" \
      --readiness-probe-action-option="port: 8000" \
      --readiness-probe-option="initialDelaySeconds: 30" \
      --readiness-probe-option="periodSeconds: 30" \
      --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tritonserver:24.07-trtllm-python-py3 \
      --data=qwen2-oss:/mnt/models \
      "tritonserver \
        --model-repository=/mnt/models/tensorrtllm_backend/qwen2_ifb \
        --http-port=8000 \
        --grpc-port=8001 \
        --metrics-port=8002 \
        --disable-auto-complete-config \
        --backend-config=python,shm-region-prefix-name=prefix0_"

    On successful deployment, you see:

    service/qwen2-chat-v1 created
    deployment.apps/qwen2-chat-v1-custom-serving created
    INFO[0003] The Job qwen2-chat has been submitted successfully
    INFO[0003] You can run `arena serve get qwen2-chat --type custom-serving -n default` to check the job status

    This confirms that the inference service is submitted to the Kubernetes cluster.

  2. Verify the service status

    Check the inference service details and runtime status:

    arena serve get qwen2-chat

    When the service runs normally, you see:

    Name:       qwen2-chat
    Namespace:  default
    Type:       Custom
    Version:    v1
    Desired:    1
    Available:  1
    Age:        2m
    Address:    192.168.10.15
    Port:       RESTFUL:8000
    GPU:        1
    
    Instances:
      NAME                                           STATUS   AGE  READY  RESTARTS  GPU  NODE
      ----                                           ------   ---  -----  --------  ---  ----
      qwen2-chat-v1-custom-serving-657869c698-hl665  Running  2m   1/1    0         1    cn-hangzhou.192.168.10.15

    Readiness indicators: Available: 1 and READY: 1/1 mean the service is fully ready to accept inference requests.

Service startup usually takes 1–2 minutes. During initialization, the Available field changes from 0 to 1.

Step 4: Test the inference service

Test the inference service functionality and performance using local port forwarding and API calls.

  1. Set up port forwarding

    Create a local port-forwarding channel to test the service:

    The port-forwarding command runs in your current terminal session. Press Ctrl+C to stop it.

    kubectl port-forward svc/qwen2-chat-v1 8000:8000

    On success, you see:

    Forwarding from 127.0.0.1:8000 -> 8000
    Forwarding from [::1]:8000 -> 8000

    You can now access the inference service at localhost:8000.

  2. Send an inference request

    Use curl to send an inference request to the model:

    curl -X POST localhost:8000/v2/models/ensemble/generate \
      -H "Content-Type: application/json" \
      -d '{
        "text_input": "What is machine learning?",
        "max_tokens": 50,
        "bad_words": "",
        "stop_words": "",
        "pad_id": 2,
        "end_id": 2
      }'

    You expect a response like this:

    {
      "context_logits": 0.0,
      "cum_log_probs": 0.0,
      "generation_logits": 0.0,
      "model_name": "ensemble",
      "model_version": "1",
      "output_log_probs": [0.0, 0.0, 0.0, 0.0],
      "sequence_end": false,
      "sequence_id": 0,
      "sequence_start": false,
      "text_output": " Machine learning is an artificial intelligence technique that enables computer systems to learn patterns and rules from data without being explicitly programmed. By analyzing large amounts of data with algorithms, machine learning models can identify complex relationships and make predictions or decisions."
    }

    Success verification: The text_output field contains a relevant answer generated by the model. This confirms the inference service works correctly.

Testing tips:

  • Try different questions to test the model’s generalization ability

  • Adjust the max_tokens parameter to control output length

  • Observe response time and output quality

(Optional) Step 5: Clean up the environment

When you no longer need the inference service, clean up related resources using these steps:

Important reminder: Cleanup deletes all related resources—including model data and service configurations. Back up important data before you proceed.

  1. Delete the inference service

    Use Arena to delete the deployed service:

    arena serve delete qwen2-chat

    Confirm successful deletion:

    INFO[0001] Deleting service: qwen2-chat
    INFO[0002] Service qwen2-chat deleted successfully
  2. Clean up Fluid resources

    Delete the Dataset and JindoRuntime resources:

    kubectl delete dataset qwen2-oss
    kubectl delete jindoruntime qwen2-oss
  3. Delete access credentials

    Clean up the OSS access key:

    kubectl delete secret fluid-oss-secret

After cleanup, verify that all resources are removed with this command: kubectl get all -l app=qwen2-chat