This topic uses the Qwen2-1.5B-Instruct model and the A10 GPUs as an example to demonstrate how to use Triton and TensorRT-LLM to deploy a Qwen2 model as an inference service in Container Service for Kubernetes (ACK). In this example, Fluid Dataflow is used to prepare data during the model deployment and Fluid is used to accelerate model loading.
Background information
Qwen2-1.5B-Instruct
Qwen2-1.5B-Instruct is a 1.5-billion-parameter model large language model (LLM) developed by Alibaba Cloud based on Transformer. This model is trained based on ultra-large amounts of training data, which covers a variety of web-based text, books of specialized sectors, and code.
For more information, see Qwen2 GitHub repository.
Triton (Triton Inference Server)
Triton (Triton Inference Server) is an open source inference service framework provided by NVIDIA to help you quickly develop AI inference applications. Triton supports various machine learning frameworks serving as backends, including TensorRT, TensorFlow, PyTorch, ONNX, and vLLM. Triton is optimized for real-time inference, batch inference, and audio/video streaming inference to provide improved performance.
For more information about the Triton inference service framework, see Triton Inference Server GitHub repository.
TensorRT-LLM
TensorRT-LLM is an open source engine provided by NVIDIA to optimize LLM inference performance. TensorRT-LLM is used to define LLMs and build TensorRT engines to optimize LLM inference performance on NVIDIA GPUs. TensorRT-LLM can be integrated with Triton to serve as the backend of Triton: TensorRT-LLM Backend. Models built with TensorRT-LLM can run on one or more GPUs and support Tensor Parallelism and Pipeline Parallelism.
For more information about TensorRT-LLM, see TensorRT-LLM Github repository.
Prerequisites
An ACK Pro cluster that contains nodes equipped with A10 GPUs is created. The Kubernetes version of the cluster is 1.22 or later. For more information, see Create an ACK managed cluster.
We recommend that you install a GPU driver whose version is 525. You can add the
ack.aliyun.com/nvidia-driver-version:525.105.17
label to GPU-accelerated nodes to specify the GPU driver version as 525.105.17. For more information, see Specify an NVIDIA driver version for nodes by adding a label.The cloud-native AI suite is installed and the ack-fluid component is deployed.
ImportantIf you have already installed open source Fluid, uninstall Fluid and deploy the ack-fluid component.
If you have not installed the cloud-native AI suite, enable Fluid acceleration when you install the suite. For more information, see Deploy the cloud-native AI suite.
If you have already installed the cloud-native AI suite, go to the Cloud-native AI Suite page of the ACK console and deploy the ack-fluid component.
The latest version of the Arena client is installed. For more information, see Configure the Arena client.
Object Storage Service (OSS) is activated and a bucket is created. For more information, see Activate OSS and Create a bucket.
Step 1: Create a Dataset and a JindoRuntime
A Dataset can be used to efficiently organize and process data. A JindoRuntime can further accelerate data access based on a data cache policy. You can use the Dataset and the JindoRuntime together to greatly improve the performance of data processing and model inference services.
Run the following command to create a Secret to store the AccessKey pair used to access the OSS bucket:
kubectl apply -f-<<EOF apiVersion: v1 kind: Secret metadata: name: fluid-oss-secret stringData: fs.oss.accessKeyId: <YourAccessKey ID> fs.oss.accessKeySecret: <YourAccessKey Secret> EOF
In the preceding code, the
fs.oss.accessKeyId
parameter specifies the AccessKey ID and thefs.oss.accessKeySecret
parameter specifies the AccessKey secret. For more information about how to obtain an AccessKey pair, see Obtain an AccessKey pair.Expected output:
secret/fluid-oss-secret created
Create a file named dataset.yaml and copy the following content to the file. The file is used to create a Dataset and a JindoRuntime for data caching. For more information about how to configure a Dataset and a JindoRuntime, see Use JindoFS to accelerate access to OSS.
# Create a Dataset that describes the dataset stored in the OSS bucket and the underlying file system (UFS). apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: qwen2-oss spec: mounts: - mountPoint: oss://<oss_bucket>/qwen2-1.5b # Replace the value with the endpoint of the OSS bucket where the model file is stored. name: qwen2 path: / options: fs.oss.endpoint: <oss_endpoint> # Replace the value with the actual endpoint of the OSS bucket. encryptOptions: - name: fs.oss.accessKeyId valueFrom: secretKeyRef: name: fluid-oss-secret key: fs.oss.accessKeyId - name: fs.oss.accessKeySecret valueFrom: secretKeyRef: name: fluid-oss-secret key: fs.oss.accessKeySecret accessModes: - ReadWriteMany Create a JindoRuntime to enable JindoFS for data caching in the cluster. --- apiVersion: data.fluid.io/v1alpha1 kind: JindoRuntime metadata: name: qwen2-oss spec: replicas: 2 tieredstore: levels: - mediumtype: MEM volumeType: emptyDir path: /dev/shm quota: 20Gi high: "0.95" low: "0.7" fuse: properties: fs.oss.read.buffer.size: "8388608" # 8M fs.oss.download.thread.concurrency: "200" fs.oss.read.readahead.max.buffer.count: "200" fs.oss.read.sequence.ambiguity.range: "2147483647" args: - -oauto_cache - -oattr_timeout=1 - -oentry_timeout=1 - -onegative_timeout=1
Run the following command to create the Dataset and the JindoRuntime:
kubectl apply -f dataset.yaml
Expected output:
dataset.data.fluid.io/qwen2-oss created jindoruntime.data.fluid.io/qwen2-oss created
The output shows that the Dataset and the JindoRuntime are created.
Run the following command to check whether the Dataset is deployed:
kubectl get dataset qwen2-oss
Expected output:
NAME UFS TOTAL SIZE CACHED CACHE CAPACITY CACHED PERCENTAGE PHASE AGE qwen2-oss 0.00B 0.00B 20.00GiB 0.0% Bound 57s
Step 2: Create a Dataflow
When you use TensorRT-LLM to accelerate model inference, you must first download the model file. Then, you need to convert the model file format, build the TensorRT engine, and modify the configuration file. In this example, Fluid Dataflow is used to perform the preceding operations.
Create a file named dataflow.yaml and copy the following content to the file. The file is used to create a Dataflow that consists of the following steps:
Download the Qwen2-1.5B-Instruct model file from ModelScope.
Use TensorRT-LLM to convert the model file format and build the TensorRT engine.
Use a DataLoad to update the Dataset.
# Download the Qwen2-1.5B-Instruct model file from ModelScope and save the file to the specified path. apiVersion: data.fluid.io/v1alpha1 kind: DataProcess metadata: name: step1-download-model spec: dataset: name: qwen2-oss namespace: default mountPath: /mnt/models/ processor: script: image: ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/base imageTag: ubuntu22.04 imagePullPolicy: IfNotPresent restartPolicy: OnFailure command: - bash source: | #!/bin/bash echo "download model..." if [ -d "${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct" ]; then echo "directory ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct exists, skip download model" else apt update && apt install -y git git-lfs git clone https://www.modelscope.cn/qwen/Qwen2-1.5B-Instruct.git Qwen2-1.5B-Instruct mv Qwen2-1.5B-Instruct ${MODEL_MOUNT_PATH} fi env: - name: MODEL_MOUNT_PATH value: "/mnt/models" # Convert the model file format to the format required by TensorRT-LLM and build the TensorRT engine. --- apiVersion: data.fluid.io/v1alpha1 kind: DataProcess metadata: name: step2-trtllm-convert spec: runAfter: kind: DataProcess name: step1-download-model namespace: default dataset: name: qwen2-oss namespace: default mountPath: /mnt/models/ processor: script: image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tritonserver-build imageTag: 24.07-trtllm-python-py3 imagePullPolicy: IfNotPresent restartPolicy: OnFailure command: - bash source: | #!/bin/bash set -ex cd /tensorrtllm_backend/tensorrt_llm/examples/qwen if [ -d "${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-ckpt" ]; then echo "directory ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-ckpt exists, skip convert checkpoint" else echo "covert checkpoint..." python3 convert_checkpoint.py --model_dir ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct --output_dir /root/Qwen2-1.5B-Instruct-ckpt --dtype float16 echo "Writing trtllm model ckpt to OSS Bucket..." mv /root/Qwen2-1.5B-Instruct-ckpt ${MODEL_MOUNT_PATH} fi sleep 2 if [ -d "${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-engine" ]; then echo "directory $OUTPUT_DIR/Qwen2-1.5B-Instruct-engine exists, skip build engine" else echo "build trtllm engine..." trtllm-build --checkpoint_dir ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-ckpt \ --gemm_plugin float16 \ --paged_kv_cache enable \ --output_dir /root/Qwen2-1.5B-Instruct-engine echo "Writing trtllm engine to OSS Bucket..." mv /root/Qwen2-1.5B-Instruct-engine ${MODEL_MOUNT_PATH} fi if [ -d "${MODEL_MOUNT_PATH}/tensorrtllm_backend" ]; then echo "directory $OUTPUT_DIR/tensorrtllm_backend exists, skip config tensorrtllm_backend" else echo "config model..." cd /tensorrtllm_backend cp all_models/inflight_batcher_llm/ qwen2_ifb -r export QWEN2_MODEL=${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct export ENGINE_PATH=${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-engine python3 tools/fill_template.py -i qwen2_ifb/preprocessing/config.pbtxt tokenizer_dir:${QWEN2_MODEL},triton_max_batch_size:8,preprocessing_instance_count:1 python3 tools/fill_template.py -i qwen2_ifb/postprocessing/config.pbtxt tokenizer_dir:${QWEN2_MODEL},triton_max_batch_size:8,postprocessing_instance_count:1 python3 tools/fill_template.py -i qwen2_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:8,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False python3 tools/fill_template.py -i qwen2_ifb/ensemble/config.pbtxt triton_max_batch_size:8 python3 tools/fill_template.py -i qwen2_ifb/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:8,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:1280,max_attention_window_size:1280,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0 echo "Writing trtllm config to OSS Bucket..." mkdir -p ${MODEL_MOUNT_PATH}/tensorrtllm_backend mv /tensorrtllm_backend/qwen2_ifb ${MODEL_MOUNT_PATH}/tensorrtllm_backend fi env: - name: MODEL_MOUNT_PATH value: "/mnt/models" resources: requests: cpu: 2 memory: 10Gi nvidia.com/gpu: 1 limits: cpu: 12 memory: 30Gi nvidia.com/gpu: 1 # Load the converted and optimized model and model configurations to memory to deploy a responsive inference service. --- apiVersion: data.fluid.io/v1alpha1 kind: DataLoad metadata: name: step3-warmup-cahce spec: runAfter: kind: DataProcess name: step2-trtllm-convert namespace: default dataset: name: qwen2-oss namespace: default loadMetadata: true target: - path: /Qwen2-1.5B-Instruct-engine - path: /tensorrtllm_backend
The preceding code block orchestrates an automated and scalable model deployment procedure that consists of the following steps: downloading the model file, converting the model file format, optimizing the model, and preloading the model to cache.
Run the following command to create the Dataflow:
kubectl create -f dataflow.yaml
Expected output:
dataprocess.data.fluid.io/step1-download-model created dataprocess.data.fluid.io/step2-trtllm-convert created dataload.data.fluid.io/step3-warmup-cahce created
The output shows that the custom resource objects defined in the
dataflow.yaml
file are created.Run the following command to query the execution progress of the Dataflow. Wait until the execution is completed.
kubectl get dataprocess
Expected output:
NAME DATASET PHASE AGE DURATION step1-download-model qwen2-oss Complete 23m 3m2s step2-trtllm-convert qwen2-oss Complete 23m 19m58s
The output shows that two tasks related to the
qwen2-oss
Dataset are completed. This means that the model file is downloaded and converted to thetrtllm
format.
Step 3: Deploy an inference service
Run the following Arena command to deploy a custom serving job to run an inference service:
The inference service is named qwen2-chat and the service version is v1. The service uses one GPU and runs one replica. In addition, readiness probing is enabled for the service. A model is a special type of data. Therefore, the
--data
parameter is added to mount the model file persistent volume claim (PVC)qwen2-oss
created by Fluid to the/mnt/models
directory of the container.arena serve custom \ --name=qwen2-chat \ --version=v1 \ --gpus=1 \ --replicas=1 \ --restful-port=8000 \ --readiness-probe-action="tcpSocket" \ --readiness-probe-action-option="port: 8000" \ --readiness-probe-option="initialDelaySeconds: 30" \ --readiness-probe-option="periodSeconds: 30" \ --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tritonserver:24.07-trtllm-python-py3 \ --data=qwen2-oss:/mnt/models \ "tritonserver --model-repository=/mnt/models/tensorrtllm_backend/qwen2_ifb --http-port=8000 --grpc-port=8001 --metrics-port=8002 --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_"
The following table describes the parameters in the preceding code block.
Parameter
Description
--name
The name of the inference service.
--version
The version of the inference service.
--gpus
The number of GPUs for each inference service replica.
--replicas
The number of inference service replicas.
--restful-port
The port of the inference service to be exposed.
--readiness-probe-action
The connection type of readiness probes. Valid values: HttpGet, Exec, gRPC, and TCPSocket.
--readiness-probe-action-option
The connection method of readiness probes.
--readiness-probe-option
The readiness probe configuration.
--data
Mount a shared PVC to the runtime environment. The value consists of two parts separated by a colon (:). Specify the name of the PVC on the left side of the colon. You can run the
arena data list
command to query the list of existing PVCs in the cluster. Specify the runtime environment on the right side of the colon. You can also specify the local path of the training data or model. This way, your script can access the data or model in the specified PV.--image
The address of the inference service image.
Expected output:
service/qwen2-chat-v1 created deployment.apps/qwen2-chat-v1-custom-serving created INFO[0003] The Job qwen2-chat has been submitted successfully INFO[0003] You can run `arena serve get qwen2-chat --type custom-serving -n default` to check the job status
The output shows that the inference service is deployed.
Run the following command to query the details of the inference service:
arena serve get qwen2-chat
Expected output:
Name: qwen2-chat Namespace: default Type: Custom Version: v1 Desired: 1 Available: 1 Age: 1m Address: 192.XX.XX.XX Port: RESTFUL:8000 GPU: 1 Instances: NAME STATUS AGE READY RESTARTS GPU NODE ---- ------ --- ----- -------- --- ---- qwen2-chat-v1-custom-serving-657869c698-hl665 Running 1m 1/1 0 1 ap-southeast-1.192.XX.XX.XX
The output shows that a pod (qwen2-chat-v1-custom-serving-657869c698-hl665) is running for the inference service and is ready to provide services.
Step 4: Verify the inference service
Run the following command to set up port forwarding between the inference service and the local environment.
ImportantPort forwarding set up by using kubectl port-forward is not reliable, secure, or extensible in production environments. It is only for development and debugging. Do not use this command to set up port forwarding in production environments. For more information about networking solutions used for production in ACK clusters, see Ingress overview.
kubectl port-forward svc/qwen2-chat-v1 8000:8000
Expected output:
Forwarding from 127.0.0.1:8000 -> 8000 Forwarding from [::1]:8000 -> 8000
Run the following command to send a request to the inference service:
curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'
Expected output:
{"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":" Machine learning is an AI technology that enables computer systems to learn from data without using specific programs"}
The output shows that the model can generate a response based on the given prompt.
(Optional) Step 5: Clear the environment
If you no longer need the model inference service, run the following command to delete the service:
arena serve delete qwen2-chat