All Products
Search
Document Center

Container Service for Kubernetes:Use vLLM to deploy a Qwen model as an inference service

Last Updated:Feb 28, 2026

Deploy the Qwen1.5-4B-Chat large language model (LLM) as an inference service in Container Service for Kubernetes (ACK) by using vLLM. This tutorial uses Arena to deploy the model on a single GPU node with either an NVIDIA A10 or T4 GPU.

Prerequisites

Before you begin, make sure that you have:

  • An ACK Pro cluster with GPU-accelerated nodes running Kubernetes 1.22 or later, where each GPU-accelerated node provides at least 16 GB of GPU memory. For more information, see Create an ACK managed cluster.

  • The latest version of the Arena client installed. For more information, see Configure the Arena client.

Note

Use a GPU driver with version 525. Add the ack.aliyun.com/nvidia-driver-version:525.105.17 label to your GPU-accelerated nodes to specify the driver version. For more information, see Specify an NVIDIA driver version for nodes by adding a label.

Background

Qwen1.5-4B-Chat

Qwen1.5-4B-Chat is a 4-billion-parameter LLM developed by Alibaba Cloud based on the Transformer architecture. It is trained on large-scale data that covers web text, domain-specific books, and code. For more information, see the Qwen GitHub repository.

vLLM

vLLM is an open-source LLM inference framework that delivers high throughput and low latency. It supports most widely used LLMs, including Qwen models. vLLM uses PagedAttention optimization, continuous batching, and model quantization to improve inference efficiency. For more information, see the vLLM GitHub repository.

Step 1: Prepare the model data

Download the Qwen1.5-4B-Chat model, upload it to Object Storage Service (OSS), and create a persistent volume (PV) and persistent volume claim (PVC) in your ACK cluster.

Note

You can also store the model in File Storage NAS instead of OSS. For more information, see Mount a statically provisioned NAS volume.

Download the model

  1. Install Git:

       # Run yum install git or apt install git.
       yum install git
  2. Install the Git Large File Storage (Git LFS) plug-in:

       # Run yum install git-lfs or apt install git-lfs.
       yum install git-lfs
  3. Clone the Qwen1.5-4B-Chat repository from ModelScope:

       GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git
  4. Enter the Qwen1.5-4B-Chat directory and pull the large files managed by Git LFS:

       cd Qwen1.5-4B-Chat
       git lfs pull

Upload the model to OSS

  1. Log on to the OSS console and note the name of your OSS bucket. To create a bucket, see Create a bucket.

  2. Install and configure ossutil. For more information, see Install ossutil.

  3. Create a directory named Qwen1.5-4B-Chat in your OSS bucket:

       ossutil mkdir oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
  4. Upload the model files to OSS:

       ossutil cp -r ./Qwen1.5-4B-Chat oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat

Create a PV and PVC

Create a PV and PVC in your ACK cluster to mount the model stored in OSS. For detailed instructions, see Mount a statically provisioned OSS volume.

Configure the PV with the following parameters:

ParameterDescription
PV TypeOSS
Volume Namellm-model
Access CertificateThe AccessKey ID and AccessKey secret used to access the OSS bucket
Bucket IDThe name of the OSS bucket that you created
OSS PathThe path to the model, such as /models/Qwen1.5-4B-Chat

Configure the PVC with the following parameters:

ParameterDescription
PVC TypeOSS
Volume Namellm-model
Allocation ModeSelect Existing Volumes
Existing VolumesClick the Existing Volumes hyperlink and select the PV that you created

Step 2: Deploy the inference service

The model parameter files are treated as a dataset. Use the --data parameter provided by Arena to mount the model to a path inside the inference service container. In this example, the model is mounted to /model/Qwen1.5-4B-Chat.

The --max-model-len parameter sets the maximum token length that the model can process. Increasing this value improves the quality of model interactions but consumes more GPU memory. For more information about vLLM parameters, see the vLLM code repository on GitHub.

Choose a GPU type

GPUUse casemax-model-lenAdditional parameters
NVIDIA A10Production (high performance)16384None
NVIDIA T4Testing (lower cost)8192--dtype half

Deploy the service

Choose one of the following commands based on your GPU type.

NVIDIA A10 (recommended for production)

arena serve custom \
    --name=vllm-qwen-4b-chat \
    --version=v1 \
    --gpus=1 \
    --replicas=1 \
    --restful-port=8000 \
    --readiness-probe-action="tcpSocket" \
    --readiness-probe-action-option="port: 8000" \
    --readiness-probe-option="initialDelaySeconds: 30" \
    --readiness-probe-option="periodSeconds: 30" \
    --image=ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/vllm:0.4.1-ubuntu22.04 \
    --data=llm-model:/model/Qwen1.5-4B-Chat \
    "python3 -m vllm.entrypoints.openai.api_server --trust-remote-code --model /model/Qwen1.5-4B-Chat/ --gpu-memory-utilization 0.95 --max-model-len 16384"

NVIDIA T4 (for testing)

arena serve custom \
    --name=vllm-qwen-4b-chat \
    --version=v1 \
    --gpus=1 \
    --replicas=1 \
    --restful-port=8000 \
    --readiness-probe-action="tcpSocket" \
    --readiness-probe-action-option="port: 8000" \
    --readiness-probe-option="initialDelaySeconds: 30" \
    --readiness-probe-option="periodSeconds: 30" \
    --image=ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/vllm:0.4.1-ubuntu22.04 \
    --data=llm-model:/model/Qwen1.5-4B-Chat \
    "python3 -m vllm.entrypoints.openai.api_server --trust-remote-code --model /model/Qwen1.5-4B-Chat/ --gpu-memory-utilization 0.95 --max-model-len 8192 --dtype half"
Note

The T4 configuration uses --max-model-len 8192 (compared to 16384 for A10) and adds --dtype half to fit within the T4's memory constraints.

The following table describes the Arena parameters:

ParameterDescription
--nameThe name of the inference service
--versionThe version of the inference service
--gpusThe number of GPUs for each inference service replica
--replicasThe number of inference service replicas
--restful-portThe port to expose for the inference service
--readiness-probe-actionThe connection type of readiness probes. Valid values: HttpGet, Exec, gRPC, and TCPSocket
--readiness-probe-action-optionThe connection method of readiness probes
--readiness-probe-optionThe readiness probe configuration
--dataMount a shared PVC to the runtime environment. The format is PVCName:MountPath. Run the arena data list command to query available PVCs in the cluster
--imageThe container image for the inference service

Expected output:

service/vllm-qwen-4b-chat-v1 created
deployment.apps/vllm-qwen-4b-chat-v1-custom-serving created
INFO[0008] The Job vllm-qwen-4b-chat has been submitted successfully
INFO[0008] You can run `arena serve get vllm-qwen-4b-chat --type custom-serving -n default` to check the job status

Verify the deployment

Run the following command to check the status of the inference service:

arena serve get vllm-qwen-4b-chat

Expected output:

Name:       vllm-qwen-4b-chat
Namespace:  default
Type:       Custom
Version:    v1
Desired:    1
Available:  1
Age:        36m
Address:    172.16.XX.XX
Port:       RESTFUL:8000
GPU:        1

Instances:
  NAME                                                  STATUS   AGE  READY  RESTARTS  GPU  NODE
  ----                                                  ------   ---  -----  --------  ---  ----
  vllm-qwen-4b-chat-v1-custom-serving-6d7c786b9f-z6nfk  Running  36m  1/1    0         1    cn-beijing.192.168.XX.XX

When the Available count matches the Desired count and the instance status is Running, the inference service is ready.

Step 3: Test the inference service

Set up port forwarding

Important

Port forwarding through kubectl port-forward is intended for development and debugging only. Do not use it in production environments. For production networking, see Ingress overview.

Forward port 8000 from the inference service to your local machine:

kubectl port-forward svc/vllm-qwen-4b-chat-v1 8000:8000

Expected output:

Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000

Send a test request

Open a new terminal and send an inference request to the vLLM inference service:

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json"  -d '{"model": "/model/Qwen1.5-4B-Chat/", "messages": [{"role": "user", "content": "Test"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}'

Expected output:

{"id":"cmpl-503270b21fa44db2b6b3c3e0abaa3c02","object":"chat.completion","created":1717141209,"model":"/model/Qwen1.5-4B-Chat/","options":[{"index":0,"message":{"role":"assistant","content":"OK. What do you want to test?"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":30,"completion_tokens":9}}

The vLLM inference service exposes an OpenAI-compatible API at the /v1/chat/completions endpoint.

(Optional) Clean up resources

If you no longer need the inference service, delete the resources to avoid unnecessary costs.

Delete the inference service:

arena serve delete vllm-qwen-4b-chat

Delete the PVC and PV:

kubectl delete pvc llm-model
kubectl delete pv llm-model

Next steps

  • To expose the inference service in production, configure an Ingress. For more information, see Ingress overview.

  • To deploy other Qwen models, adjust the --max-model-len and --gpu-memory-utilization parameters based on the model size and available GPU memory.

  • For more information about vLLM configuration options, see the vLLM GitHub repository.