All Products
Search
Document Center

Container Service for Kubernetes:Use Triton to deploy Qwen inference services in ACK

Last Updated:Mar 03, 2026

Large language models (LLMs) require GPU-accelerated infrastructure and an optimized serving stack to handle inference at scale. This topic describes how to deploy the Qwen1.5-4B-Chat model as an inference service on Container Service for Kubernetes (ACK) by using NVIDIA Triton Inference Server with the vLLM backend on T4 or A10 GPUs.

Background

Qwen1.5-4B-Chat

Qwen1.5-4B-Chat is a 4-billion-parameter LLM developed by Alibaba Cloud based on the Transformer architecture. The model is trained on large-scale datasets that cover web text, domain-specific books, and code. For more information, see the Qwen GitHub repository.

Triton Inference Server

Triton Inference Server is an open-source inference serving framework developed by NVIDIA. It supports multiple machine learning framework backends, including TensorRT, TensorFlow, PyTorch, ONNX, and vLLM.

Key features:

  • Multiple ML and deep learning framework support

  • Concurrent model execution

  • Continuous batching

  • Built-in inference metrics: GPU utilization, request latency, and throughput

For more information, see the Triton Inference Server GitHub repository.

vLLM

vLLM is a high-performance LLM inference framework that supports most popular LLMs, including Qwen models. vLLM uses PagedAttention optimization, continuous batching, and model quantization to significantly improve LLM inference throughput. For more information, see the vLLM GitHub repository.

Prerequisites

Before you begin, make sure that you have:

Step 1: Prepare model data

Download the Qwen1.5-4B-Chat model, upload it to Object Storage Service (OSS), and create persistent volumes (PVs) and persistent volume claims (PVCs) in the ACK cluster.

To deploy other vLLM-supported models, see Supported models. To use File Storage NAS instead of OSS, see Mount a statically provisioned NAS volume.

Download the model

  1. Install Git:

       # Use yum install git or apt install git
       yum install git
  2. Install the Git Large File Storage (LFS) plugin:

       # Use yum install git-lfs or apt install git-lfs
       yum install git-lfs
  3. Clone the Qwen1.5-4B-Chat repository from ModelScope:

       GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git
  4. Enter the directory and pull the large files:

       cd Qwen1.5-4B-Chat
       git lfs pull

Upload the model to OSS

  1. Log on to the OSS console and note the name of your OSS bucket. To create a bucket, see Create a bucket.

  2. Install and configure ossutil. For more information, see Install ossutil.

  3. Create a directory in OSS and upload the model:

       ossutil mkdir oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
       ossutil cp -r ./Qwen1.5-4B-Chat oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat

Create PVs and PVCs

Create a PV and PVC to mount the OSS model data in the cluster. For more information, see Mount a statically provisioned OSS volume.

PV parameters

Parameter

Value

PV Type

OSS

Volume Name

llm-model

Access Certificate

The AccessKey ID and AccessKey secret used to access the OSS bucket

Bucket ID

The name of your OSS bucket

OSS Path

The model path, such as /models/Qwen1.5-4B-Chat

PVC parameters

Parameter

Value

PVC Type

OSS

Volume Name

llm-model

Allocation Mode

Existing Volumes

Existing Volumes

Select the PV you created

Step 2: Configure Triton with vLLM

Create two configuration files: config.pbtxt for the Triton backend, and model.json for vLLM engine parameters.

Create the backend configuration

Create a working directory and the config.pbtxt file:

mkdir triton-vllm

cat << EOF > triton-vllm/config.pbtxt
backend: "vllm"

# The usage of device is deferred to the vLLM engine
instance_group [
  {
    count: 1
    kind: KIND_MODEL
  }
]

version_policy: { all { }}
EOF

Create the model configuration

The model.json file passes parameters to the vLLM engine. Choose the configuration that matches your GPU type.

Important

vLLM allocates GPU memory aggressively on startup. The gpu_memory_utilization parameter controls this behavior. Setting it to 0.95 reserves 95% of GPU memory for the model. If other workloads share the same GPU, lower this value to avoid out-of-memory errors.

A10 GPU (production)

A10 GPUs deliver higher throughput and support bfloat16 precision. Use A10 for production workloads.

cat << EOF > triton-vllm/model.json
{
    "model":"/model/Qwen1.5-4B-Chat",
    "disable_log_requests": "true",
    "gpu_memory_utilization": 0.95,
    "trust_remote_code": "true",
    "max_model_len": 16384
}
EOF

T4 GPU (testing)

T4 GPUs are widely available and cost-effective, but do not support bfloat16 (bf16) precision. Set dtype to half (FP16) and use a lower max_model_len to fit within the 16 GB memory limit.

cat << EOF > triton-vllm/model.json
{
    "model":"/model/Qwen1.5-4B-Chat",
    "disable_log_requests": "true",
    "gpu_memory_utilization": 0.95,
    "trust_remote_code": "true",
    "dtype": "half",
    "max_model_len": 8192
}
EOF

Key parameters

Parameter

Description

max_model_len

Maximum token sequence length the model can process. Higher values improve conversation quality but consume more GPU memory.

dtype

Floating-point precision for model loading. Set to half (FP16) for GPUs that do not support bf16, such as T4.

gpu_memory_utilization

Fraction of GPU memory allocated to the model. Default is 0.9.

For the full list of configurable parameters, see the vLLM Engine Arguments documentation. For more configuration examples, see Deploying a vLLM model in Triton.

Step 3: Deploy the inference service

Use Arena to deploy the Qwen1.5-4B-Chat inference service with Triton and vLLM.

  1. Export the configuration file paths as environment variables:

       export triton_config_file="triton-vllm/config.pbtxt"
       export model_config_file="triton-vllm/model.json"
  2. Deploy the inference service:

    Parameters

    Parameter

    Description

    --name

    Name of the inference service.

    --version

    Version of the inference service.

    --image

    Container image for the Triton server.

    --gpus

    Number of GPUs per replica.

    --cpu

    Number of CPU cores per replica.

    --memory

    Memory allocation per replica.

    --data

    PVC mount in <pvc-name>:<mount-path> format. Run arena data list to list available PVCs.

    --config-file

    Local file mount in <local-path>:<container-path> format.

    --model-repository

    Triton model repository directory. Each subdirectory represents a model and must contain its configuration files. For more information, see Triton model repository.

    --http-port

    HTTP port for the Triton service.

    --grpc-port

    gRPC port for the Triton service.

    --allow-metrics

    Expose inference metrics (GPU utilization, latency, throughput).

       arena serve triton \
           --name=triton-vllm \
           --version=v1 \
           --image=ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/tritonserver:24.04-vllm-python-py3-ubuntu22.04 \
           --gpus=1 \
           --cpu=6 \
           --memory=30Gi \
           --data="llm-model:/model/Qwen1.5-4B-Chat" \
           --model-repository /triton-config \
           --config-file="$model_config_file:/triton-config/qwen-4b/1/model.json" \
           --config-file="$triton_config_file:/triton-config/qwen-4b/config.pbtxt" \
           --http-port=8000 \
           --grpc-port=9000 \
           --allow-metrics=true

    Expected output:

       configmap/triton-vllm-v1-4bd5884e6b5b6a3 created
       configmap/triton-vllm-v1-7815124a8204002 created
       service/triton-vllm-v1-tritoninferenceserver created
       deployment.apps/triton-vllm-v1-tritoninferenceserver created
       INFO[0007] The Job triton-vllm has been submitted successfully
       INFO[0007] You can run `arena serve get triton-vllm --type triton-serving -n default` to check the job status
  3. Verify that the service is running. Wait until Available shows 1.

       arena serve get triton-vllm

    Expected output:

       Name:       triton-vllm
       Namespace:  default
       Type:       Triton
       Version:    v1
       Desired:    1
       Available:  1
       Age:        3m
       Address:    172.16.XX.XX
       Port:       RESTFUL:8000,GRPC:9000
       GPU:        1
    
       Instances:
         NAME                                                  STATUS   AGE  READY  RESTARTS  GPU  NODE
         ----                                                  ------   ---  -----  --------  ---  ----
         triton-vllm-v1-tritoninferenceserver-b69cb7759-gkwz6  Running  3m   1/1    0         1    cn-beijing.172.16.XX.XX

Step 4: Verify the inference service

Port forwarding (development only)

Important

Port forwarding through kubectl port-forward is intended for development and debugging only. It is not reliable, secure, or scalable for production use. For production networking, see Ingress overview.

  1. Set up port forwarding:

       kubectl port-forward svc/triton-vllm-v1-tritoninferenceserver 8000:8000

    Expected output:

       Forwarding from 127.0.0.1:8000 -> 8000
       Forwarding from [::1]:8000 -> 8000
  2. Send a test request to the Triton generate endpoint. Replace qwen-4b with your actual model name if different.

       curl -X POST localhost:8000/v2/models/qwen-4b/generate \
         -d '{"text_input": "What is AI? AI is", "parameters": {"stream": false, "temperature": 0}}'

    Expected output:

       {"model_name":"qwen-4b","model_version":"1","text_output":"What is AI? AI is a branch of computer science that studies how to make computers intelligent. Purpose of AI"}

(Optional) Clean up

Delete the inference service and storage resources when they are no longer needed:

# Delete the inference service
arena serve del triton-vllm

# Delete the PVC and PV
kubectl delete pvc llm-model
kubectl delete pv llm-model