Use TensorRT-LLM to deploy a Qwen2 model as an inference service - Container Service for Kubernetes

This topic uses the Qwen2-1.5B-Instruct model and the A10 GPUs as an example to demonstrate how to use Triton and TensorRT-LLM to deploy a Qwen2 model as an inference service in Container Service for Kubernetes (ACK). In this example, Fluid Dataflow is used to prepare data during the model deployment and Fluid is used to accelerate model loading.

Background information

Qwen2-1.5B-Instruct

Qwen2-1.5B-Instruct is a 1.5-billion-parameter model large language model (LLM) developed by Alibaba Cloud based on Transformer. This model is trained based on ultra-large amounts of training data, which covers a variety of web-based text, books of specialized sectors, and code.

For more information, see Qwen2 GitHub repository.

Triton (Triton Inference Server)

Triton (Triton Inference Server) is an open source inference service framework provided by NVIDIA to help you quickly develop AI inference applications. Triton supports various machine learning frameworks serving as backends, including TensorRT, TensorFlow, PyTorch, ONNX, and vLLM. Triton is optimized for real-time inference, batch inference, and audio/video streaming inference to provide improved performance.

For more information about the Triton inference service framework, see Triton Inference Server GitHub repository.

TensorRT-LLM

TensorRT-LLM is an open source engine provided by NVIDIA to optimize LLM inference performance. TensorRT-LLM is used to define LLMs and build TensorRT engines to optimize LLM inference performance on NVIDIA GPUs. TensorRT-LLM can be integrated with Triton to serve as the backend of Triton: TensorRT-LLM Backend. Models built with TensorRT-LLM can run on one or more GPUs and support Tensor Parallelism and Pipeline Parallelism.

For more information about TensorRT-LLM, see TensorRT-LLM Github repository.

Prerequisites

An ACK Pro cluster that contains nodes equipped with A10 GPUs is created. The Kubernetes version of the cluster is 1.22 or later. For more information, see Create an ACK managed cluster.
We recommend that you install a GPU driver whose version is 525. You can add the ack.aliyun.com/nvidia-driver-version:525.105.17 label to GPU-accelerated nodes to specify the GPU driver version as 525.105.17. For more information, see Specify an NVIDIA driver version for nodes by adding a label.
The cloud-native AI suite is installed and the ack-fluid component is deployed.
Important
If you have already installed open source Fluid, uninstall Fluid and deploy the ack-fluid component.
- If you have not installed the cloud-native AI suite, enable Fluid acceleration when you install the suite. For more information, see Deploy the cloud-native AI suite.
- If you have already installed the cloud-native AI suite, go to the Cloud-native AI Suite page of the ACK console and deploy the ack-fluid component.
The latest version of the Arena client is installed. For more information, see Configure the Arena client.
Object Storage Service (OSS) is activated and a bucket is created. For more information, see Activate OSS and Create a bucket.

Step 1: Create a Dataset and a JindoRuntime

A Dataset can be used to efficiently organize and process data. A JindoRuntime can further accelerate data access based on a data cache policy. You can use the Dataset and the JindoRuntime together to greatly improve the performance of data processing and model inference services.

Run the following command to create a Secret to store the AccessKey pair used to access the OSS bucket:
```
kubectl apply -f-<<EOF                                            
apiVersion: v1
kind: Secret
metadata:
  name: fluid-oss-secret
stringData:
  fs.oss.accessKeyId: <YourAccessKey ID>
  fs.oss.accessKeySecret: <YourAccessKey Secret>
EOF
```
In the preceding code, the fs.oss.accessKeyId parameter specifies the AccessKey ID and the fs.oss.accessKeySecret parameter specifies the AccessKey secret. For more information about how to obtain an AccessKey pair, see Obtain an AccessKey pair.
Expected output:
```
secret/fluid-oss-secret created
```

Create a file named dataset.yaml and copy the following content to the file. The file is used to create a Dataset and a JindoRuntime for data caching. For more information about how to configure a Dataset and a JindoRuntime, see Use JindoFS to accelerate access to OSS.

# Create a Dataset that describes the dataset stored in the OSS bucket and the underlying file system (UFS). 
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: qwen2-oss
spec:
  mounts:
  - mountPoint: oss://<oss_bucket>/qwen2-1.5b # Replace the value with the endpoint of the OSS bucket where the model file is stored. 
    name: qwen2
    path: /
    options:
      fs.oss.endpoint: <oss_endpoint> # Replace the value with the actual endpoint of the OSS bucket. 
    encryptOptions:
      - name: fs.oss.accessKeyId
        valueFrom:
          secretKeyRef:
            name: fluid-oss-secret
            key: fs.oss.accessKeyId
      - name: fs.oss.accessKeySecret
        valueFrom:
          secretKeyRef:
            name: fluid-oss-secret
            key: fs.oss.accessKeySecret
  accessModes:
    - ReadWriteMany
Create a JindoRuntime to enable JindoFS for data caching in the cluster. 
---
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
  name: qwen2-oss
spec:
  replicas: 2
  tieredstore:
    levels:
      - mediumtype: MEM
        volumeType: emptyDir
        path: /dev/shm
        quota: 20Gi
        high: "0.95"
        low: "0.7"
  fuse:
    properties:
      fs.oss.read.buffer.size: "8388608" # 8M
      fs.oss.download.thread.concurrency: "200"
      fs.oss.read.readahead.max.buffer.count: "200"
      fs.oss.read.sequence.ambiguity.range: "2147483647"
    args:
      - -oauto_cache
      - -oattr_timeout=1
      - -oentry_timeout=1
      - -onegative_timeout=1

Run the following command to create the Dataset and the JindoRuntime:
```
kubectl apply -f dataset.yaml
```
Expected output:
```
dataset.data.fluid.io/qwen2-oss created
jindoruntime.data.fluid.io/qwen2-oss created
```
The output shows that the Dataset and the JindoRuntime are created.

Run the following command to check whether the Dataset is deployed:

kubectl get dataset qwen2-oss

Expected output:

NAME        UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
qwen2-oss   0.00B            0.00B    20.00GiB         0.0%                Bound   57s

Step 2: Create a Dataflow

When you use TensorRT-LLM to accelerate model inference, you must first download the model file. Then, you need to convert the model file format, build the TensorRT engine, and modify the configuration file. In this example, Fluid Dataflow is used to perform the preceding operations.

Create a file named dataflow.yaml and copy the following content to the file. The file is used to create a Dataflow that consists of the following steps:

Download the Qwen2-1.5B-Instruct model file from ModelScope.
Use TensorRT-LLM to convert the model file format and build the TensorRT engine.
Use a DataLoad to update the Dataset.

# Download the Qwen2-1.5B-Instruct model file from ModelScope and save the file to the specified path. 
apiVersion: data.fluid.io/v1alpha1
kind: DataProcess
metadata:
  name: step1-download-model
spec:
  dataset:
    name: qwen2-oss
    namespace: default
    mountPath: /mnt/models/
  processor:
    script:
      image: ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/base
      imageTag: ubuntu22.04
      imagePullPolicy: IfNotPresent
      restartPolicy: OnFailure
      command:
      - bash
      source: |
        #!/bin/bash
        echo "download model..."
        if [ -d "${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct" ]; then
            echo "directory ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct exists, skip download model"
        else
            apt update && apt install -y git git-lfs
            git clone https://www.modelscope.cn/qwen/Qwen2-1.5B-Instruct.git Qwen2-1.5B-Instruct
            mv Qwen2-1.5B-Instruct ${MODEL_MOUNT_PATH}
        fi
      env:
      - name: MODEL_MOUNT_PATH
        value: "/mnt/models"
# Convert the model file format to the format required by TensorRT-LLM and build the TensorRT engine. 
---
apiVersion: data.fluid.io/v1alpha1
kind: DataProcess
metadata:
  name: step2-trtllm-convert
spec:
  runAfter:
    kind: DataProcess
    name: step1-download-model
    namespace: default
  dataset:
    name: qwen2-oss
    namespace: default
    mountPath: /mnt/models/
  processor:
    script:
      image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tritonserver-build
      imageTag: 24.07-trtllm-python-py3
      imagePullPolicy: IfNotPresent
      restartPolicy: OnFailure
      command:
      - bash
      source: |
        #!/bin/bash
        set -ex

        cd /tensorrtllm_backend/tensorrt_llm/examples/qwen
        if [ -d "${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-ckpt" ]; then
            echo "directory ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-ckpt exists, skip convert checkpoint"
        else
            echo "covert checkpoint..."
            python3 convert_checkpoint.py --model_dir ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct --output_dir /root/Qwen2-1.5B-Instruct-ckpt --dtype float16

            echo "Writing trtllm model ckpt to OSS Bucket..."
            mv /root/Qwen2-1.5B-Instruct-ckpt ${MODEL_MOUNT_PATH}
        fi

        sleep 2
        
        if [ -d "${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-engine" ]; then
            echo "directory $OUTPUT_DIR/Qwen2-1.5B-Instruct-engine exists, skip build engine"
        else
            echo "build trtllm engine..."
            trtllm-build --checkpoint_dir ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-ckpt \
            --gemm_plugin float16 \
            --paged_kv_cache enable \
            --output_dir /root/Qwen2-1.5B-Instruct-engine

            echo "Writing trtllm engine to OSS Bucket..."
            mv /root/Qwen2-1.5B-Instruct-engine ${MODEL_MOUNT_PATH}
        fi

        if [ -d "${MODEL_MOUNT_PATH}/tensorrtllm_backend" ]; then
            echo "directory $OUTPUT_DIR/tensorrtllm_backend exists, skip config tensorrtllm_backend"
        else
            echo "config model..."
            cd /tensorrtllm_backend
            cp all_models/inflight_batcher_llm/ qwen2_ifb -r
            export QWEN2_MODEL=${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct
            export ENGINE_PATH=${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-engine

            python3 tools/fill_template.py -i qwen2_ifb/preprocessing/config.pbtxt tokenizer_dir:${QWEN2_MODEL},triton_max_batch_size:8,preprocessing_instance_count:1
            python3 tools/fill_template.py -i qwen2_ifb/postprocessing/config.pbtxt tokenizer_dir:${QWEN2_MODEL},triton_max_batch_size:8,postprocessing_instance_count:1
            python3 tools/fill_template.py -i qwen2_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:8,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
            python3 tools/fill_template.py -i qwen2_ifb/ensemble/config.pbtxt triton_max_batch_size:8
            python3 tools/fill_template.py -i qwen2_ifb/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:8,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:1280,max_attention_window_size:1280,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0

            echo "Writing trtllm config to OSS Bucket..."
            mkdir -p ${MODEL_MOUNT_PATH}/tensorrtllm_backend
            mv /tensorrtllm_backend/qwen2_ifb ${MODEL_MOUNT_PATH}/tensorrtllm_backend
        fi
      env:
      - name: MODEL_MOUNT_PATH
        value: "/mnt/models"
      resources:
        requests:
          cpu: 2
          memory: 10Gi
          nvidia.com/gpu: 1
        limits:
          cpu: 12
          memory: 30Gi
          nvidia.com/gpu: 1
# Load the converted and optimized model and model configurations to memory to deploy a responsive inference service. 
---
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: step3-warmup-cahce
spec:
  runAfter:
    kind: DataProcess
    name: step2-trtllm-convert
    namespace: default
  dataset:
    name: qwen2-oss
    namespace: default
  loadMetadata: true
  target:
  - path: /Qwen2-1.5B-Instruct-engine
  - path: /tensorrtllm_backend

The preceding code block orchestrates an automated and scalable model deployment procedure that consists of the following steps: downloading the model file, converting the model file format, optimizing the model, and preloading the model to cache.

Run the following command to create the Dataflow:

kubectl create -f dataflow.yaml

Expected output:

dataprocess.data.fluid.io/step1-download-model created
dataprocess.data.fluid.io/step2-trtllm-convert created
dataload.data.fluid.io/step3-warmup-cahce created

The output shows that the custom resource objects defined in the dataflow.yaml file are created.

Run the following command to query the execution progress of the Dataflow. Wait until the execution is completed.
```
kubectl get dataprocess
```
Expected output:
```
NAME                   DATASET     PHASE      AGE   DURATION
step1-download-model   qwen2-oss   Complete   23m   3m2s
step2-trtllm-convert   qwen2-oss   Complete   23m   19m58s
```
The output shows that two tasks related to the qwen2-oss Dataset are completed. This means that the model file is downloaded and converted to the trtllm format.

Step 3: Deploy an inference service

Run the following Arena command to deploy a custom serving job to run an inference service:

The inference service is named qwen2-chat and the service version is v1. The service uses one GPU and runs one replica. In addition, readiness probing is enabled for the service. A model is a special type of data. Therefore, the --data parameter is added to mount the model file persistent volume claim (PVC) qwen2-oss created by Fluid to the /mnt/models directory of the container.

arena serve custom \
--name=qwen2-chat \
--version=v1 \
--gpus=1 \
--replicas=1 \
--restful-port=8000 \
--readiness-probe-action="tcpSocket" \
--readiness-probe-action-option="port: 8000" \
--readiness-probe-option="initialDelaySeconds: 30" \
--readiness-probe-option="periodSeconds: 30" \
--image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tritonserver:24.07-trtllm-python-py3 \
--data=qwen2-oss:/mnt/models \
"tritonserver --model-repository=/mnt/models/tensorrtllm_backend/qwen2_ifb --http-port=8000 --grpc-port=8001 --metrics-port=8002 --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_"

The following table describes the parameters in the preceding code block.

Parameter	Description
--name	The name of the inference service.
--version	The version of the inference service.
--gpus	The number of GPUs for each inference service replica.
--replicas	The number of inference service replicas.
--restful-port	The port of the inference service to be exposed.
--readiness-probe-action	The connection type of readiness probes. Valid values: HttpGet, Exec, gRPC, and TCPSocket.
--readiness-probe-action-option	The connection method of readiness probes.
--readiness-probe-option	The readiness probe configuration.
--data	Mount a shared PVC to the runtime environment. The value consists of two parts separated by a colon (:). Specify the name of the PVC on the left side of the colon. You can run the `arena data list` command to query the list of existing PVCs in the cluster. Specify the runtime environment on the right side of the colon. You can also specify the local path of the training data or model. This way, your script can access the data or model in the specified PV.
--image	The address of the inference service image.

Expected output:

service/qwen2-chat-v1 created
deployment.apps/qwen2-chat-v1-custom-serving created
INFO[0003] The Job qwen2-chat has been submitted successfully
INFO[0003] You can run `arena serve get qwen2-chat --type custom-serving -n default` to check the job status

The output shows that the inference service is deployed.

Run the following command to query the details of the inference service:

arena serve get qwen2-chat

Expected output:

Name:       qwen2-chat
Namespace:  default
Type:       Custom
Version:    v1
Desired:    1
Available:  1
Age:        1m
Address:    192.XX.XX.XX
Port:       RESTFUL:8000
GPU:        1

Instances:
  NAME                                           STATUS   AGE  READY  RESTARTS  GPU  NODE
  ----                                           ------   ---  -----  --------  ---  ----
  qwen2-chat-v1-custom-serving-657869c698-hl665  Running  1m   1/1    0         1    ap-southeast-1.192.XX.XX.XX

The output shows that a pod (qwen2-chat-v1-custom-serving-657869c698-hl665) is running for the inference service and is ready to provide services.

Step 4: Verify the inference service

Run the following command to set up port forwarding between the inference service and the local environment.
Important
Port forwarding set up by using kubectl port-forward is not reliable, secure, or extensible in production environments. It is only for development and debugging. Do not use this command to set up port forwarding in production environments. For more information about networking solutions used for production in ACK clusters, see Ingress overview.
```
kubectl port-forward svc/qwen2-chat-v1 8000:8000
```
Expected output:
```
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
```

Run the following command to send a request to the inference service:

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'

Expected output:

{"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":" Machine learning is an AI technology that enables computer systems to learn from data without using specific programs"}

The output shows that the model can generate a response based on the given prompt.

(Optional) Step 5: Clear the environment

If you no longer need the model inference service, run the following command to delete the service:

arena serve delete qwen2-chat