Use LMDeploy to deploy the Qwen model inference service - Container Service for Kubernetes

This topic uses the Qwen1.5-4B-Chat model and A10 GPU to demonstrate how to use the LMDeploy framework to deploy the Qwen model inference service in Container Service for Kubernetes (ACK).

Background information

Qwen1.5-4B-Chat

Qwen1.5-4B-Chat is a 4-billion-parameter model large language model (LLM) developed by Alibaba Cloud based on Transformer. This model is trained based on ultra-large amounts of training data, which covers a variety of web-based text, books of specialized sectors, and code. For more information, see Qwen GitHub repository.

LMDeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs as follows:

Model compression and optimization: LMDeploy can perform weight quantization and key-value (KV) cache quantization on LLMs to reduce model size and memory usage. It can also improve model inference efficiency and throughput using optimization methods such as tensor parallelism and KV cache.
Deployment convenience: LMDeploy supports deploying optimized models to different environments, including single-machine, multi-machine, and multi-GPU environments. It also supports distributed deployment to ensure service extensibility and high availability.
Service management: LMDeploy can reduce repeated computing and improve response speed through caching technologies.

For more information about the LMDeploy frame, see LMDeploy GitHub repository.

Prerequisites

An ACK Pro cluster that contains GPU-accelerated nodes is created. The Kubernetes version of the cluster is 1.22 or later. Each GPU-accelerated node provides 16 GB of GPU memory or above. For more information, see Create an ACK managed cluster.
We recommend that you install a GPU driver whose version is 525. You can add the ack.aliyun.com/nvidia-driver-version:525.105.17 label to GPU-accelerated nodes to specify the GPU driver version as 525.105.17. For more information, see Specify an NVIDIA driver version for nodes by adding a label.
The latest version of the Arena client is installed. For more information, see Configure the Arena client.

Step 1: Prepare model data

This topic uses the Qwen1.5-4B-Chat model to demonstrate how to download the model, upload the model to Object Storage Service (OSS), and create the corresponding persistent volume (PV) and persistent volume claim (PVC) in the ACK cluster.

For more information about how to upload the model to NAS, see Use NAS static persistent volume.

Download the model file.
1. Run the following command to install Git:
```
# Run yum install git or apt install git. 
yum install git
```
2. Run the following command to install the Git Large File Support (LFS) plug-in:
```
# Run yum install git-lfs or apt install git-lfs. 
yum install git-lfs
```
3. Run the following command to clone the Qwen1.5-4B-Chat repository on ModelScope to the local environment:
```
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git
```
4. Run the following command to enter the Qwen1.5-4B-Chat directory and pull large files managed by LFS:
```
cd Qwen1.5-4B-Chat
git lfs pull
```
Upload the Qwen1.5-4B-Chat model file to OSS.
1. Log on to the OSS console, and view and record the name of the OSS bucket that you created.
  For more information about how to create an OSS bucket, see Create a bucket.
2. Install and configure ossutil to manage OSS resources. For more information, see Install ossutil.
3. Run the following command to create a directory named Qwen1.5-4B-Chat in OSS:
```
ossutil mkdir oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
```
4. Run the following command to upload the model file to OSS:
```
ossutil cp -r ./Qwen1.5-4B-Chat oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
```

Configure PVs and PVCs in the destination cluster. For more information, see Mount a statically provisioned OSS volume.

The following table describes the parameters of the PV.

Parameter	Description
PV Type	OSS
Volume Name	llm-model
Access Certificate	Specify the AccessKey ID and AccessKey secret used to access the OSS bucket.
Bucket ID	Specify the name of the OSS bucket that you created.
OSS Path	Select the path of the model, such as /models/Qwen1.5-4B-Chat.

The following table describes the parameters of the PVC.
Parameter
Description
PVC Type
OSS
Volume Name
llm-model
Allocation Mode
Select Existing Volumes.
Existing Volumes
Click the Existing Volumes hyperlink and select the PV that you created.

Step 2: Deploy the inference service

Run the following command to deploy a custom inference service named Qwen1.5-4B-Chat based on the LMDeploy tool.

arena serve custom \
    --name=lmdeploy-qwen \
    --version=v1 \
    --gpus=1 \
    --replicas=1 \
    --restful-port=8000 \
    --readiness-probe-action="tcpSocket" \
    --readiness-probe-action-option="port: 8000" \
    --readiness-probe-option="initialDelaySeconds: 30" \
    --readiness-probe-option="periodSeconds: 30" \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/lmdeploy:v0.4.2 \
    --data=llm-model:/model/Qwen1.5-4B-Chat \
    "lmdeploy serve api_server /model/Qwen1.5-4B-Chat --server-port 8000"

The following table describes the parameters:

Parameter	Description
--name	The name of the inference service.
--version	The version of the inference service.
--gpus	The number of GPUs for each inference service replica.
--replicas	The number of inference service replicas.
--restful-port	The port of the inference service to be exposed.
--readiness-probe-action	The connection type of readiness probes. Valid values: HttpGet, Exec, gRPC, and TCPSocket.
--readiness-probe-action-option	The connection method of readiness probes.
--readiness-probe-option	The readiness probe configuration.
--data	Mount a shared PVC to the runtime environment. The value consists of two parts separated by a colon (:). Specify the name of the PVC on the left side of the colon. You can run the `arena data list` command to query the list of existing PVCs in the cluster. Specify the runtime environment on the right side of the colon. You can also specify the local path of the training data or model. This way, your script can access the data or model in the specified PV.
--image	The address of the inference service image.

Expected output:

service/lmdeploy-qwen-v1 created
deployment.apps/lmdeploy-qwen-v1-custom-serving created
INFO[0002] The Job lmdeploy-qwen has been submitted successfully
INFO[0002] You can run `arena serve get lmdeploy-qwen --type custom-serving -n default` to check the job status

The output indicates that the inference service was deployed successfully.

Run the following command to query the detailed information of the service and wait until the service is ready:

arena serve get lmdeploy-qwen

Expected output:

Name:       lmdeploy-qwen
Namespace:  default
Type:       Custom
Version:    v1
Desired:    1
Available:  1
Age:        1m
Address:    192.168.XX.XX
Port:       RESTFUL:8000
GPU:        1

Instances:
  NAME                                              STATUS   AGE  READY  RESTARTS  GPU  NODE
  ----                                              ------   ---  -----  --------  ---  ----
  lmdeploy-qwen-v1-custom-serving-8476b9dd8c-8b4d2  Running  1m   1/1    0         1    cn-beijing.172.16.XX.XX

The output indicates that a pod (lmdeploy-qwen-v1-custom-serving-8476b9dd8c-8b4d2) is deployed for the inference service and ready to provide services.

Step 3: Verify the inference service

Run the following command to set up port forwarding between the inference service and local environment:
Important
Port forwarding set up by using kubectl port-forward is not reliable, secure, or extensible in production environments. It is only for development and debugging. Do not use this command to set up port forwarding in production environments. For more information about networking solutions used for production in ACK clusters, see Ingress overview.
```
kubectl port-forward svc/lmdeploy-qwen-v1 8000:8000
```
Expected output:
```
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
```

Run the following command to send a request to the model inference service:

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json"  -d '{"model": "qwen", "messages": [{"role": "user", "content": "Test it out."}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}'

Expected output:

{"id":"1","object":"chat.completion","created":1719833349,"model":"qwen","choices":[{"index":0,"message":{"role":"assistant","content":"Sure, do you have any testing requirements or issues?"},"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":21,"total_tokens":32,"completion_tokens":11}}

The output indicates that the model can generate an appropriate response based on the given prompt. In this example, the prompt is a test request.

(Optional) Step 4: Clear data

If you no longer need the resources, delete the resources promptly.

Run the following command to delete the inference service:
```
arena serve del lmdeploy-qwen
```

Run the following command to delete the PV and PVC:

kubectl delete pvc llm-model
kubectl delete pv llm-model

Parameter	Description
PVC Type	OSS
Volume Name	llm-model
Allocation Mode	Select Existing Volumes.
Existing Volumes	Click the Existing Volumes hyperlink and select the PV that you created.