This topic uses the Qwen1.5-4B-Chat model and the T4 and A10 GPUs as an example to demonstrate how to use Triton and the Versatile Large Language Model (vLLM) inference framework to deploy a Qwen model as an inference service in Container Service for Kubernetes (ACK).
Background information
Qwen1.5-4B-Chat
Qwen1.5-4B-Chat is a 4-billion-parameter model large language model (LLM) developed by Alibaba Cloud based on Transformer. This model is trained based on ultra-large amounts of training data, which covers a variety of web-based text, books of specialized sectors, and code. For more information, see Qwen GitHub repository.
Triton (Triton Inference Server)
Triton (Triton Inference Server) is an open source inference service framework provided by NVIDIA to help you quickly develop AI inference applications. Triton supports various machine learning frameworks serving as backends, including TensorRT, TensorFlow, PyTorch, ONNX, and vLLM. Triton is optimized for real-time inference, batch inference, and audio/video streaming inference to provide improved performance. Key features of Triton:
Supports various machine learning and deep learning frameworks.
Supports concurrency models.
Supports continuous batching.
Supports key inference service metrics, such as the GPU utilization, request latency, and request throughput.
For more information about the Triton inference service framework, see Triton Inference Server GitHub repository.
vLLM
vLLM is a high-performance and easy-to-use LLM inference service framework. vLLM supports most commonly used LLMs, including Qwen models. vLLM is powered by technologies such as PagedAttention optimization, continuous batching, and model quantification to greatly improve the inference efficiency of LLMs. For more information about the vLLM framework, see vLLM GitHub repository.
Prerequisites
An ACK Pro cluster that contains GPU-accelerated nodes is created. The Kubernetes version of the cluster is 1.22 or later. Each GPU-accelerated node provides 16 GB of GPU memory or above. For more information, see Create an ACK managed cluster.
We recommend that you install a GPU driver whose version is 525. You can add the
ack.aliyun.com/nvidia-driver-version:525.105.17
label to GPU-accelerated nodes to specify the GPU driver version as 525.105.17. For more information, see Specify an NVIDIA driver version for nodes by adding a label.The latest version of the Arena client is installed. For more information, see Configure the Arena client.
Step 1: Prepare model data
This section uses the Qwen1.5-4B-Chat model as an example to demonstrate how to download models from and upload models to Object Storage Service (OSS) and how to create persistent volumes (PVs) and persistent volume claims (PVCs) in ACK clusters.
For more information about how to deploy other models, see Supported Models. For more information about how to upload models to File Storage NAS (NAS), see Mount a statically provisioned NAS volume.
Download the model file.
Run the following command to install Git:
# Run yum install git or apt install git. yum install git
Run the following command to install the Git Large File Support (LFS) plug-in:
# Run yum install git-lfs or apt install git-lfs. yum install git-lfs
Run the following command to clone the Qwen1.5-4B-Chat repository on ModelScope to the local environment:
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git
Run the following command to enter the Qwen1.5-4B-Chat directory and pull large files managed by LFS:
cd Qwen1.5-4B-Chat git lfs pull
Upload the Qwen1.5-4B-Chat model file to OSS.
Log on to the OSS console, and view and record the name of the OSS bucket that you created.
For more information about how to create an OSS bucket, see Create a bucket.
Install and configure ossutil to manage OSS resources. For more information, see Install ossutil.
Run the following command to create a directory named Qwen1.5-4B-Chat in OSS:
ossutil mkdir oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
Run the following command to upload the model file to OSS:
ossutil cp -r ./Qwen1.5-4B-Chat oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
Configure PVs and PVCs in the destination cluster. For more information, see Mount a statically provisioned OSS volume.
The following table describes the parameters of the PV.
Parameter
Description
PV Type
OSS
Volume Name
llm-model
Access Certificate
Specify the AccessKey ID and AccessKey secret used to access the OSS bucket.
Bucket ID
Specify the name of the OSS bucket that you created.
OSS Path
Select the path of the model, such as /models/Qwen1.5-4B-Chat.
The following table describes the parameters of the PVC.
Parameter
Description
PVC Type
OSS
Volume Name
llm-model
Allocation Mode
Select Existing Volumes.
Existing Volumes
Click the Existing Volumes hyperlink and select the PV that you created.
Step 2: Configure a Triton inference service framework
Create a vLLM configuration file named config.pbtxt
and a Triton configuration file named model.json
, which are required for configure a Triton inference service framework.
Run the following command to create a working directory:
mkdir triton-vllm
Run the following command to create a vLLM configuration file named
config.pbtxt
:cat << EOF > triton-vllm/config.pbtxt backend: "vllm" # The usage of device is deferred to the vLLM engine instance_group [ { count: 1 kind: KIND_MODEL } ] version_policy: { all { }} EOF
Run the following command to create a Triton configuration file named
model.json
:LLMs are GPU-memory-heavy. We recommend that you use the high-performance A10 GPU to ensure the optimal performance in a production environment. For testing purposes, you can use the T4 GPU, which is widely adopted and cost-effective. However, A10 greatly surpasses T4 in performance.
Use a single A10 GPU
cat << EOF > triton-vllm/model.json { "model":"/model/Qwen1.5-4B-Chat", "disable_log_requests": "true", "gpu_memory_utilization": 0.95, "trust_remote_code": "true", "max_model_len": 16384 } EOF
In the preceding configuration files, set the
max_model_len
parameter to specify the maximum length of tokens that can be processed by the model. You can increase the value to improve the conversation experience of the model but more GPU memory will be consumed. For more information about the configuration of the vLLM + Triton inference service framework, see Deploying a vLLM model in Triton.Use a single T4 GPU
cat << EOF > triton-vllm/model.json { "model":"/model/Qwen1.5-4B-Chat", "disable_log_requests": "true", "gpu_memory_utilization": 0.95, "trust_remote_code": "true", "dtype": "half", "max_model_len": 8192 } EOF
In the preceding configuration files, set the
max_model_len
parameter to specify the maximum length of tokens that can be processed by the model. You can increase the value to improve the conversation experience of the model but more GPU memory will be consumed. Set thedtype
parameter to specify the floating point precision used during model loading. The T4 GPU does not support the bfloat16 (bf16) precision. Therefore, thedtype
parameter is set tohalf
(half-precision floating points) in the preceding configuration. For more information about the configuration of the vLLM + Triton inference service framework, see Deploying a vLLM model in Triton.
Step 3: Deploy an inference service
In the following example, Arena is used to deploy an inference service from the Qwen1.5-4B-Chat model. The inference service uses the Triton inference service framework and uses the vLLM model inference framework.
Run the following command to export the
triton_config_file
andmodel_config_file
environment variables to the Triton and vLLM configuration files in Step 2. This allows you to configure and deploy inference services in different environments without hard-coding file paths to each command or script.export triton_config_file="triton-vllm/config.pbtxt" export model_config_file="triton-vllm/model.json"
Run the following command to deploy an inference service:
arena serve triton \ --name=triton-vllm \ --version=v1 \ --image=ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/tritonserver:24.04-vllm-python-py3-ubuntu22.04 \ --gpus=1 \ --cpu=6 \ --memory=30Gi \ --data="llm-model:/model/Qwen1.5-4B-Chat" \ --model-repository /triton-config \ --config-file="$model_config_file:/triton-config/qwen-4b/1/model.json" \ --config-file="$triton_config_file:/triton-config/qwen-4b/config.pbtxt" \ --http-port=8000 \ --grpc-port=9000 \ --allow-metrics=true
The following table describes the parameters.
Parameter
Description
--name
The name of the inference service.
--version
The version of the inference service.
--image
The image address of the inference service.
--gpus
The number of GPUs used by each inference service replica.
--cpu
The number of CPUs used by each inference service replica.
--memory
The amount of memory used by each inference service replica.
--data
Mount a shared PVC to the runtime. The value consists of two parts separated by a colon (:). Specify the name of the PVC on the left side of the colon. You can run the arena data list command to view PVCs in the current cluster. Specify the path to which the PVC is mounted on the right side of the colon. The training data will be read from the specified path. This way, your training job can retrieve the data stored in the PV claimed by the PVC.
--config-file
Mount the local configuration file to the runtime. The value consists of two parts separated by a colon (:). Specify the name of the configuration file on the left side of the colon and the path to which the configuration is mounted on the right side of the colon.
--model-repository
The directory of the Triton model repository. The directory can contain multiple subdirectories. Each subdirectory corresponds to a model to be loaded to the Triton inference service framework. Therefore, each subdirectory must contain the corresponding model configuration file. For more information, see Triton official documentation.
--http-port
The HTTP port exposed by the Triton inference service.
--grpc-port
The gRPC port exposed by the Triton inference service.
--allow-metrics
Specify whether to expose the metrics of the Triton inference service.
Expected output:
configmap/triton-vllm-v1-4bd5884e6b5b6a3 created configmap/triton-vllm-v1-7815124a8204002 created service/triton-vllm-v1-tritoninferenceserver created deployment.apps/triton-vllm-v1-tritoninferenceserver created INFO[0007] The Job triton-vllm has been submitted successfully INFO[0007] You can run `arena serve get triton-vllm --type triton-serving -n default` to check the job status
The output indicates that the inference service is deployed.
Run the following command to view the detailed information of the inference service. Wait until the service is ready.
arena serve get triton-vllm
Expected output:
Name: triton-vllm Namespace: default Type: Triton Version: v1 Desired: 1 Available: 1 Age: 3m Address: 172.16.XX.XX Port: RESTFUL:8000,GRPC:9000 GPU: 1 Instances: NAME STATUS AGE READY RESTARTS GPU NODE ---- ------ --- ----- -------- --- ---- triton-vllm-v1-tritoninferenceserver-b69cb7759-gkwz6 Running 3m 1/1 0 1 cn-beijing.172.16.XX.XX
The output indicates that a pod (triton-vllm-v1-tritoninferenceserver-b69cb7759-gkwz6) is running for the inference service and ready to provide services.
Step 4: Verify the inference service
Run the following command to set up port forwarding between the inference service and local environment.
ImportantPort forwarding set up by using kubectl port-forward is not reliable, secure, or extensible in production environments. It is only for development and debugging. Do not use this command to set up port forwarding in production environments. For more information about networking solutions used for production in ACK clusters, see Ingress overview.
kubectl port-forward svc/triton-vllm-v1-tritoninferenceserver 8000:8000
Expected output:
Forwarding from 127.0.0.1:8000 -> 8000 Forwarding from [::1]:8000 -> 8000
Run the following command to send a request to the Triton inference service.
curl -X POST localhost:8000/v2/models/qwen-4b/generate -d '{"text_input": "What is AI? AI is", "parameters": {"stream": false, "temperature": 0}}'
Replace
qwen-4b
in the URL with the actual inference model name.Expected output:
{"model_name":"qwen-4b","model_version":"1","text_output":"What is AI? AI is a branch of computer science that studies how to make computers intelligent. Purpose of AI"}
The output indicates that the model can provide the definition of AI.
(Optional) Step 5: Clear the environment
If you no longer need the resources, delete the resources at the earliest opportunity.
Run the following command to delete the inference service:
arena serve del triton-vllm
Run the following command to delete the PV and PVC:
kubectl delete pvc llm-model kubectl delete pv llm-model