Use Triton to deploy Qwen inference services in ACK - Container Service for Kubernetes

This topic uses the Qwen1.5-4B-Chat model and the T4 and A10 GPUs as an example to demonstrate how to use Triton and the Versatile Large Language Model (vLLM) inference framework to deploy a Qwen model as an inference service in Container Service for Kubernetes (ACK).

Background information

Qwen1.5-4B-Chat

Qwen1.5-4B-Chat is a 4-billion-parameter model large language model (LLM) developed by Alibaba Cloud based on Transformer. This model is trained based on ultra-large amounts of training data, which covers a variety of web-based text, books of specialized sectors, and code. For more information, see Qwen GitHub repository.

Triton (Triton Inference Server)

Triton (Triton Inference Server) is an open source inference service framework provided by NVIDIA to help you quickly develop AI inference applications. Triton supports various machine learning frameworks serving as backends, including TensorRT, TensorFlow, PyTorch, ONNX, and vLLM. Triton is optimized for real-time inference, batch inference, and audio/video streaming inference to provide improved performance. Key features of Triton:

Supports various machine learning and deep learning frameworks.
Supports concurrency models.
Supports continuous batching.
Supports key inference service metrics, such as the GPU utilization, request latency, and request throughput.

For more information about the Triton inference service framework, see Triton Inference Server GitHub repository.

vLLM

vLLM is a high-performance and easy-to-use LLM inference service framework. vLLM supports most commonly used LLMs, including Qwen models. vLLM is powered by technologies such as PagedAttention optimization, continuous batching, and model quantification to greatly improve the inference efficiency of LLMs. For more information about the vLLM framework, see vLLM GitHub repository.

Prerequisites

An ACK Pro cluster that contains GPU-accelerated nodes is created. The Kubernetes version of the cluster is 1.22 or later. Each GPU-accelerated node provides 16 GB of GPU memory or above. For more information, see Create an ACK managed cluster.
We recommend that you install a GPU driver whose version is 525. You can add the ack.aliyun.com/nvidia-driver-version:525.105.17 label to GPU-accelerated nodes to specify the GPU driver version as 525.105.17. For more information, see Specify an NVIDIA driver version for nodes by adding a label.
The latest version of the Arena client is installed. For more information, see Configure the Arena client.

Step 1: Prepare model data

This section uses the Qwen1.5-4B-Chat model as an example to demonstrate how to download models from and upload models to Object Storage Service (OSS) and how to create persistent volumes (PVs) and persistent volume claims (PVCs) in ACK clusters.

For more information about how to deploy other models, see Supported Models. For more information about how to upload models to File Storage NAS (NAS), see Mount a statically provisioned NAS volume.

Download the model file.
1. Run the following command to install Git:
```
# Run yum install git or apt install git. 
yum install git
```
2. Run the following command to install the Git Large File Support (LFS) plug-in:
```
# Run yum install git-lfs or apt install git-lfs. 
yum install git-lfs
```
3. Run the following command to clone the Qwen1.5-4B-Chat repository on ModelScope to the local environment:
```
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git
```
4. Run the following command to enter the Qwen1.5-4B-Chat directory and pull large files managed by LFS:
```
cd Qwen1.5-4B-Chat
git lfs pull
```
Upload the Qwen1.5-4B-Chat model file to OSS.
1. Log on to the OSS console, and view and record the name of the OSS bucket that you created.
  For more information about how to create an OSS bucket, see Create a bucket.
2. Install and configure ossutil to manage OSS resources. For more information, see Install ossutil.
3. Run the following command to create a directory named Qwen1.5-4B-Chat in OSS:
```
ossutil mkdir oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
```
4. Run the following command to upload the model file to OSS:
```
ossutil cp -r ./Qwen1.5-4B-Chat oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
```

Configure PVs and PVCs in the destination cluster. For more information, see Mount a statically provisioned OSS volume.

The following table describes the parameters of the PV.

Parameter	Description
PV Type	OSS
Volume Name	llm-model
Access Certificate	Specify the AccessKey ID and AccessKey secret used to access the OSS bucket.
Bucket ID	Specify the name of the OSS bucket that you created.
OSS Path	Select the path of the model, such as /models/Qwen1.5-4B-Chat.

The following table describes the parameters of the PVC.
Parameter
Description
PVC Type
OSS
Volume Name
llm-model
Allocation Mode
Select Existing Volumes.
Existing Volumes
Click the Existing Volumes hyperlink and select the PV that you created.

Step 2: Configure a Triton inference service framework

Create a vLLM configuration file named config.pbtxt and a Triton configuration file named model.json, which are required for configure a Triton inference service framework.

Run the following command to create a working directory:
```
mkdir triton-vllm
```

Run the following command to create a vLLM configuration file named config.pbtxt:

cat << EOF > triton-vllm/config.pbtxt
backend: "vllm"

# The usage of device is deferred to the vLLM engine
instance_group [
  {
    count: 1
    kind: KIND_MODEL
  }
]

version_policy: { all { }}
EOF

Run the following command to create a Triton configuration file named model.json:
LLMs are GPU-memory-heavy. We recommend that you use the high-performance A10 GPU to ensure the optimal performance in a production environment. For testing purposes, you can use the T4 GPU, which is widely adopted and cost-effective. However, A10 greatly surpasses T4 in performance.
Use a single A10 GPU
```
cat << EOF > triton-vllm/model.json
{
    "model":"/model/Qwen1.5-4B-Chat",
    "disable_log_requests": "true",
    "gpu_memory_utilization": 0.95,
    "trust_remote_code": "true",
    "max_model_len": 16384
}
EOF
```
In the preceding configuration files, set the max_model_len parameter to specify the maximum length of tokens that can be processed by the model. You can increase the value to improve the conversation experience of the model but more GPU memory will be consumed. For more information about the configuration of the vLLM + Triton inference service framework, see Deploying a vLLM model in Triton.
Use a single T4 GPU
```
cat << EOF > triton-vllm/model.json
{
    "model":"/model/Qwen1.5-4B-Chat",
    "disable_log_requests": "true",
    "gpu_memory_utilization": 0.95,
    "trust_remote_code": "true",
    "dtype": "half",
    "max_model_len": 8192
}
EOF
```
In the preceding configuration files, set the max_model_len parameter to specify the maximum length of tokens that can be processed by the model. You can increase the value to improve the conversation experience of the model but more GPU memory will be consumed. Set the dtype parameter to specify the floating point precision used during model loading. The T4 GPU does not support the bfloat16 (bf16) precision. Therefore, the dtype parameter is set to half (half-precision floating points) in the preceding configuration. For more information about the configuration of the vLLM + Triton inference service framework, see Deploying a vLLM model in Triton.

Step 3: Deploy an inference service

In the following example, Arena is used to deploy an inference service from the Qwen1.5-4B-Chat model. The inference service uses the Triton inference service framework and uses the vLLM model inference framework.

Run the following command to export the triton_config_file and model_config_file environment variables to the Triton and vLLM configuration files in Step 2. This allows you to configure and deploy inference services in different environments without hard-coding file paths to each command or script.
```
export triton_config_file="triton-vllm/config.pbtxt"
export model_config_file="triton-vllm/model.json"
```

Run the following command to deploy an inference service:

arena serve triton \
    --name=triton-vllm \
    --version=v1 \
    --image=ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/tritonserver:24.04-vllm-python-py3-ubuntu22.04 \
    --gpus=1 \
    --cpu=6 \
    --memory=30Gi \
    --data="llm-model:/model/Qwen1.5-4B-Chat" \
    --model-repository /triton-config \
    --config-file="$model_config_file:/triton-config/qwen-4b/1/model.json" \
    --config-file="$triton_config_file:/triton-config/qwen-4b/config.pbtxt" \
    --http-port=8000 \
    --grpc-port=9000 \
    --allow-metrics=true

The following table describes the parameters.

Parameter	Description
--name	The name of the inference service.
--version	The version of the inference service.
--image	The image address of the inference service.
--gpus	The number of GPUs used by each inference service replica.
--cpu	The number of CPUs used by each inference service replica.
--memory	The amount of memory used by each inference service replica.
--data	Mount a shared PVC to the runtime. The value consists of two parts separated by a colon (:). Specify the name of the PVC on the left side of the colon. You can run the arena data list command to view PVCs in the current cluster. Specify the path to which the PVC is mounted on the right side of the colon. The training data will be read from the specified path. This way, your training job can retrieve the data stored in the PV claimed by the PVC.
--config-file	Mount the local configuration file to the runtime. The value consists of two parts separated by a colon (:). Specify the name of the configuration file on the left side of the colon and the path to which the configuration is mounted on the right side of the colon.
--model-repository	The directory of the Triton model repository. The directory can contain multiple subdirectories. Each subdirectory corresponds to a model to be loaded to the Triton inference service framework. Therefore, each subdirectory must contain the corresponding model configuration file. For more information, see Triton official documentation.
--http-port	The HTTP port exposed by the Triton inference service.
--grpc-port	The gRPC port exposed by the Triton inference service.
--allow-metrics	Specify whether to expose the metrics of the Triton inference service.

Expected output:

configmap/triton-vllm-v1-4bd5884e6b5b6a3 created
configmap/triton-vllm-v1-7815124a8204002 created
service/triton-vllm-v1-tritoninferenceserver created
deployment.apps/triton-vllm-v1-tritoninferenceserver created
INFO[0007] The Job triton-vllm has been submitted successfully
INFO[0007] You can run `arena serve get triton-vllm --type triton-serving -n default` to check the job status

The output indicates that the inference service is deployed.

Run the following command to view the detailed information of the inference service. Wait until the service is ready.

arena serve get triton-vllm

Expected output:

Name:       triton-vllm
Namespace:  default
Type:       Triton
Version:    v1
Desired:    1
Available:  1
Age:        3m
Address:    172.16.XX.XX
Port:       RESTFUL:8000,GRPC:9000
GPU:        1

Instances:
  NAME                                                  STATUS   AGE  READY  RESTARTS  GPU  NODE
  ----                                                  ------   ---  -----  --------  ---  ----
  triton-vllm-v1-tritoninferenceserver-b69cb7759-gkwz6  Running  3m   1/1    0         1    cn-beijing.172.16.XX.XX

The output indicates that a pod (triton-vllm-v1-tritoninferenceserver-b69cb7759-gkwz6) is running for the inference service and ready to provide services.

Step 4: Verify the inference service

Run the following command to set up port forwarding between the inference service and local environment.
Important
Port forwarding set up by using kubectl port-forward is not reliable, secure, or extensible in production environments. It is only for development and debugging. Do not use this command to set up port forwarding in production environments. For more information about networking solutions used for production in ACK clusters, see Ingress overview.
```
kubectl port-forward svc/triton-vllm-v1-tritoninferenceserver 8000:8000
```
Expected output:
```
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
```

Run the following command to send a request to the Triton inference service.

curl -X POST localhost:8000/v2/models/qwen-4b/generate -d '{"text_input": "What is AI? AI is", "parameters": {"stream": false, "temperature": 0}}'

Replace qwen-4b in the URL with the actual inference model name.

Expected output:

{"model_name":"qwen-4b","model_version":"1","text_output":"What is AI? AI is a branch of computer science that studies how to make computers intelligent. Purpose of AI"}

The output indicates that the model can provide the definition of AI.

(Optional) Step 5: Clear the environment

If you no longer need the resources, delete the resources at the earliest opportunity.

Run the following command to delete the inference service:
```
arena serve del triton-vllm
```

Run the following command to delete the PV and PVC:

kubectl delete pvc llm-model
kubectl delete pv llm-model

Parameter	Description
PVC Type	OSS
Volume Name	llm-model
Allocation Mode	Select Existing Volumes.
Existing Volumes	Click the Existing Volumes hyperlink and select the PV that you created.

Background information

Qwen1.5-4B-Chat

Triton (Triton Inference Server)

vLLM

Prerequisites

Step 1: Prepare model data

Step 2: Configure a Triton inference service framework

Use a single A10 GPU

Use a single T4 GPU

Step 3: Deploy an inference service

Step 4: Verify the inference service

(Optional) Step 5: Clear the environment