Deploy the Qwen1.5-4B-Chat large language model (LLM) as an inference service in Container Service for Kubernetes (ACK) by using vLLM. This tutorial uses Arena to deploy the model on a single GPU node with either an NVIDIA A10 or T4 GPU.
Prerequisites
Before you begin, make sure that you have:
An ACK Pro cluster with GPU-accelerated nodes running Kubernetes 1.22 or later, where each GPU-accelerated node provides at least 16 GB of GPU memory. For more information, see Create an ACK managed cluster.
The latest version of the Arena client installed. For more information, see Configure the Arena client.
Use a GPU driver with version 525. Add the ack.aliyun.com/nvidia-driver-version:525.105.17 label to your GPU-accelerated nodes to specify the driver version. For more information, see Specify an NVIDIA driver version for nodes by adding a label.
Background
Qwen1.5-4B-Chat
Qwen1.5-4B-Chat is a 4-billion-parameter LLM developed by Alibaba Cloud based on the Transformer architecture. It is trained on large-scale data that covers web text, domain-specific books, and code. For more information, see the Qwen GitHub repository.
vLLM
vLLM is an open-source LLM inference framework that delivers high throughput and low latency. It supports most widely used LLMs, including Qwen models. vLLM uses PagedAttention optimization, continuous batching, and model quantization to improve inference efficiency. For more information, see the vLLM GitHub repository.
Step 1: Prepare the model data
Download the Qwen1.5-4B-Chat model, upload it to Object Storage Service (OSS), and create a persistent volume (PV) and persistent volume claim (PVC) in your ACK cluster.
You can also store the model in File Storage NAS instead of OSS. For more information, see Mount a statically provisioned NAS volume.
Download the model
Install Git:
# Run yum install git or apt install git. yum install gitInstall the Git Large File Storage (Git LFS) plug-in:
# Run yum install git-lfs or apt install git-lfs. yum install git-lfsClone the Qwen1.5-4B-Chat repository from ModelScope:
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.gitEnter the Qwen1.5-4B-Chat directory and pull the large files managed by Git LFS:
cd Qwen1.5-4B-Chat git lfs pull
Upload the model to OSS
Log on to the OSS console and note the name of your OSS bucket. To create a bucket, see Create a bucket.
Install and configure ossutil. For more information, see Install ossutil.
Create a directory named
Qwen1.5-4B-Chatin your OSS bucket:ossutil mkdir oss://<Your-Bucket-Name>/Qwen1.5-4B-ChatUpload the model files to OSS:
ossutil cp -r ./Qwen1.5-4B-Chat oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
Create a PV and PVC
Create a PV and PVC in your ACK cluster to mount the model stored in OSS. For detailed instructions, see Mount a statically provisioned OSS volume.
Configure the PV with the following parameters:
| Parameter | Description |
|---|---|
| PV Type | OSS |
| Volume Name | llm-model |
| Access Certificate | The AccessKey ID and AccessKey secret used to access the OSS bucket |
| Bucket ID | The name of the OSS bucket that you created |
| OSS Path | The path to the model, such as /models/Qwen1.5-4B-Chat |
Configure the PVC with the following parameters:
| Parameter | Description |
|---|---|
| PVC Type | OSS |
| Volume Name | llm-model |
| Allocation Mode | Select Existing Volumes |
| Existing Volumes | Click the Existing Volumes hyperlink and select the PV that you created |
Step 2: Deploy the inference service
The model parameter files are treated as a dataset. Use the --data parameter provided by Arena to mount the model to a path inside the inference service container. In this example, the model is mounted to /model/Qwen1.5-4B-Chat.
The --max-model-len parameter sets the maximum token length that the model can process. Increasing this value improves the quality of model interactions but consumes more GPU memory. For more information about vLLM parameters, see the vLLM code repository on GitHub.
Choose a GPU type
| GPU | Use case | max-model-len | Additional parameters |
|---|---|---|---|
| NVIDIA A10 | Production (high performance) | 16384 | None |
| NVIDIA T4 | Testing (lower cost) | 8192 | --dtype half |
Deploy the service
Choose one of the following commands based on your GPU type.
NVIDIA A10 (recommended for production)
arena serve custom \
--name=vllm-qwen-4b-chat \
--version=v1 \
--gpus=1 \
--replicas=1 \
--restful-port=8000 \
--readiness-probe-action="tcpSocket" \
--readiness-probe-action-option="port: 8000" \
--readiness-probe-option="initialDelaySeconds: 30" \
--readiness-probe-option="periodSeconds: 30" \
--image=ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/vllm:0.4.1-ubuntu22.04 \
--data=llm-model:/model/Qwen1.5-4B-Chat \
"python3 -m vllm.entrypoints.openai.api_server --trust-remote-code --model /model/Qwen1.5-4B-Chat/ --gpu-memory-utilization 0.95 --max-model-len 16384"NVIDIA T4 (for testing)
arena serve custom \
--name=vllm-qwen-4b-chat \
--version=v1 \
--gpus=1 \
--replicas=1 \
--restful-port=8000 \
--readiness-probe-action="tcpSocket" \
--readiness-probe-action-option="port: 8000" \
--readiness-probe-option="initialDelaySeconds: 30" \
--readiness-probe-option="periodSeconds: 30" \
--image=ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/vllm:0.4.1-ubuntu22.04 \
--data=llm-model:/model/Qwen1.5-4B-Chat \
"python3 -m vllm.entrypoints.openai.api_server --trust-remote-code --model /model/Qwen1.5-4B-Chat/ --gpu-memory-utilization 0.95 --max-model-len 8192 --dtype half"The T4 configuration uses --max-model-len 8192 (compared to 16384 for A10) and adds --dtype half to fit within the T4's memory constraints.
The following table describes the Arena parameters:
| Parameter | Description |
|---|---|
--name | The name of the inference service |
--version | The version of the inference service |
--gpus | The number of GPUs for each inference service replica |
--replicas | The number of inference service replicas |
--restful-port | The port to expose for the inference service |
--readiness-probe-action | The connection type of readiness probes. Valid values: HttpGet, Exec, gRPC, and TCPSocket |
--readiness-probe-action-option | The connection method of readiness probes |
--readiness-probe-option | The readiness probe configuration |
--data | Mount a shared PVC to the runtime environment. The format is PVCName:MountPath. Run the arena data list command to query available PVCs in the cluster |
--image | The container image for the inference service |
Expected output:
service/vllm-qwen-4b-chat-v1 created
deployment.apps/vllm-qwen-4b-chat-v1-custom-serving created
INFO[0008] The Job vllm-qwen-4b-chat has been submitted successfully
INFO[0008] You can run `arena serve get vllm-qwen-4b-chat --type custom-serving -n default` to check the job statusVerify the deployment
Run the following command to check the status of the inference service:
arena serve get vllm-qwen-4b-chatExpected output:
Name: vllm-qwen-4b-chat
Namespace: default
Type: Custom
Version: v1
Desired: 1
Available: 1
Age: 36m
Address: 172.16.XX.XX
Port: RESTFUL:8000
GPU: 1
Instances:
NAME STATUS AGE READY RESTARTS GPU NODE
---- ------ --- ----- -------- --- ----
vllm-qwen-4b-chat-v1-custom-serving-6d7c786b9f-z6nfk Running 36m 1/1 0 1 cn-beijing.192.168.XX.XXWhen the Available count matches the Desired count and the instance status is Running, the inference service is ready.
Step 3: Test the inference service
Set up port forwarding
Port forwarding through kubectl port-forward is intended for development and debugging only. Do not use it in production environments. For production networking, see Ingress overview.
Forward port 8000 from the inference service to your local machine:
kubectl port-forward svc/vllm-qwen-4b-chat-v1 8000:8000Expected output:
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000Send a test request
Open a new terminal and send an inference request to the vLLM inference service:
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "/model/Qwen1.5-4B-Chat/", "messages": [{"role": "user", "content": "Test"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}'Expected output:
{"id":"cmpl-503270b21fa44db2b6b3c3e0abaa3c02","object":"chat.completion","created":1717141209,"model":"/model/Qwen1.5-4B-Chat/","options":[{"index":0,"message":{"role":"assistant","content":"OK. What do you want to test?"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":30,"completion_tokens":9}}The vLLM inference service exposes an OpenAI-compatible API at the /v1/chat/completions endpoint.
(Optional) Clean up resources
If you no longer need the inference service, delete the resources to avoid unnecessary costs.
Delete the inference service:
arena serve delete vllm-qwen-4b-chatDelete the PVC and PV:
kubectl delete pvc llm-model
kubectl delete pv llm-modelNext steps
To expose the inference service in production, configure an Ingress. For more information, see Ingress overview.
To deploy other Qwen models, adjust the
--max-model-lenand--gpu-memory-utilizationparameters based on the model size and available GPU memory.For more information about vLLM configuration options, see the vLLM GitHub repository.