All Products
Search
Document Center

Container Service for Kubernetes:Use TensorRT-LLM to deploy a Qwen2 model as an inference service

Last Updated:Sep 03, 2024

This topic uses the Qwen2-1.5B-Instruct model and the A10 GPUs as an example to demonstrate how to use Triton and TensorRT-LLM to deploy a Qwen2 model as an inference service in Container Service for Kubernetes (ACK). In this example, Fluid Dataflow is used to prepare data during the model deployment and Fluid is used to accelerate model loading.

Background information

Qwen2-1.5B-Instruct

Qwen2-1.5B-Instruct is a 1.5-billion-parameter model large language model (LLM) developed by Alibaba Cloud based on Transformer. This model is trained based on ultra-large amounts of training data, which covers a variety of web-based text, books of specialized sectors, and code.

For more information, see Qwen2 GitHub repository.

Triton (Triton Inference Server)

Triton (Triton Inference Server) is an open source inference service framework provided by NVIDIA to help you quickly develop AI inference applications. Triton supports various machine learning frameworks serving as backends, including TensorRT, TensorFlow, PyTorch, ONNX, and vLLM. Triton is optimized for real-time inference, batch inference, and audio/video streaming inference to provide improved performance.

For more information about the Triton inference service framework, see Triton Inference Server GitHub repository.

TensorRT-LLM

TensorRT-LLM is an open source engine provided by NVIDIA to optimize LLM inference performance. TensorRT-LLM is used to define LLMs and build TensorRT engines to optimize LLM inference performance on NVIDIA GPUs. TensorRT-LLM can be integrated with Triton to serve as the backend of Triton: TensorRT-LLM Backend. Models built with TensorRT-LLM can run on one or more GPUs and support Tensor Parallelism and Pipeline Parallelism.

For more information about TensorRT-LLM, see TensorRT-LLM Github repository.

Prerequisites

  • An ACK Pro cluster that contains nodes equipped with A10 GPUs is created. The Kubernetes version of the cluster is 1.22 or later. For more information, see Create an ACK managed cluster.

    We recommend that you install a GPU driver whose version is 525. You can add the ack.aliyun.com/nvidia-driver-version:525.105.17 label to GPU-accelerated nodes to specify the GPU driver version as 525.105.17. For more information, see Specify an NVIDIA driver version for nodes by adding a label.

  • The cloud-native AI suite is installed and the ack-fluid component is deployed.

    Important

    If you have already installed open source Fluid, uninstall Fluid and deploy the ack-fluid component.

    • If you have not installed the cloud-native AI suite, enable Fluid acceleration when you install the suite. For more information, see Deploy the cloud-native AI suite.

    • If you have already installed the cloud-native AI suite, go to the Cloud-native AI Suite page of the ACK console and deploy the ack-fluid component.

  • The latest version of the Arena client is installed. For more information, see Configure the Arena client.

  • Object Storage Service (OSS) is activated and a bucket is created. For more information, see Activate OSS and Create a bucket.

Step 1: Create a Dataset and a JindoRuntime

A Dataset can be used to efficiently organize and process data. A JindoRuntime can further accelerate data access based on a data cache policy. You can use the Dataset and the JindoRuntime together to greatly improve the performance of data processing and model inference services.

  1. Run the following command to create a Secret to store the AccessKey pair used to access the OSS bucket:

    kubectl apply -f-<<EOF                                            
    apiVersion: v1
    kind: Secret
    metadata:
      name: fluid-oss-secret
    stringData:
      fs.oss.accessKeyId: <YourAccessKey ID>
      fs.oss.accessKeySecret: <YourAccessKey Secret>
    EOF

    In the preceding code, the fs.oss.accessKeyId parameter specifies the AccessKey ID and the fs.oss.accessKeySecret parameter specifies the AccessKey secret. For more information about how to obtain an AccessKey pair, see Obtain an AccessKey pair.

    Expected output:

    secret/fluid-oss-secret created
  2. Create a file named dataset.yaml and copy the following content to the file. The file is used to create a Dataset and a JindoRuntime for data caching. For more information about how to configure a Dataset and a JindoRuntime, see Use JindoFS to accelerate access to OSS.

    # Create a Dataset that describes the dataset stored in the OSS bucket and the underlying file system (UFS). 
    apiVersion: data.fluid.io/v1alpha1
    kind: Dataset
    metadata:
      name: qwen2-oss
    spec:
      mounts:
      - mountPoint: oss://<oss_bucket>/qwen2-1.5b # Replace the value with the endpoint of the OSS bucket where the model file is stored. 
        name: qwen2
        path: /
        options:
          fs.oss.endpoint: <oss_endpoint> # Replace the value with the actual endpoint of the OSS bucket. 
        encryptOptions:
          - name: fs.oss.accessKeyId
            valueFrom:
              secretKeyRef:
                name: fluid-oss-secret
                key: fs.oss.accessKeyId
          - name: fs.oss.accessKeySecret
            valueFrom:
              secretKeyRef:
                name: fluid-oss-secret
                key: fs.oss.accessKeySecret
      accessModes:
        - ReadWriteMany
    Create a JindoRuntime to enable JindoFS for data caching in the cluster. 
    ---
    apiVersion: data.fluid.io/v1alpha1
    kind: JindoRuntime
    metadata:
      name: qwen2-oss
    spec:
      replicas: 2
      tieredstore:
        levels:
          - mediumtype: MEM
            volumeType: emptyDir
            path: /dev/shm
            quota: 20Gi
            high: "0.95"
            low: "0.7"
      fuse:
        properties:
          fs.oss.read.buffer.size: "8388608" # 8M
          fs.oss.download.thread.concurrency: "200"
          fs.oss.read.readahead.max.buffer.count: "200"
          fs.oss.read.sequence.ambiguity.range: "2147483647"
        args:
          - -oauto_cache
          - -oattr_timeout=1
          - -oentry_timeout=1
          - -onegative_timeout=1
  3. Run the following command to create the Dataset and the JindoRuntime:

    kubectl apply -f dataset.yaml

    Expected output:

    dataset.data.fluid.io/qwen2-oss created
    jindoruntime.data.fluid.io/qwen2-oss created

    The output shows that the Dataset and the JindoRuntime are created.

  4. Run the following command to check whether the Dataset is deployed:

    kubectl get dataset qwen2-oss

    Expected output:

    NAME        UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
    qwen2-oss   0.00B            0.00B    20.00GiB         0.0%                Bound   57s

Step 2: Create a Dataflow

When you use TensorRT-LLM to accelerate model inference, you must first download the model file. Then, you need to convert the model file format, build the TensorRT engine, and modify the configuration file. In this example, Fluid Dataflow is used to perform the preceding operations.

  1. Create a file named dataflow.yaml and copy the following content to the file. The file is used to create a Dataflow that consists of the following steps:

    1. Download the Qwen2-1.5B-Instruct model file from ModelScope.

    2. Use TensorRT-LLM to convert the model file format and build the TensorRT engine.

    3. Use a DataLoad to update the Dataset.

    # Download the Qwen2-1.5B-Instruct model file from ModelScope and save the file to the specified path. 
    apiVersion: data.fluid.io/v1alpha1
    kind: DataProcess
    metadata:
      name: step1-download-model
    spec:
      dataset:
        name: qwen2-oss
        namespace: default
        mountPath: /mnt/models/
      processor:
        script:
          image: ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/base
          imageTag: ubuntu22.04
          imagePullPolicy: IfNotPresent
          restartPolicy: OnFailure
          command:
          - bash
          source: |
            #!/bin/bash
            echo "download model..."
            if [ -d "${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct" ]; then
                echo "directory ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct exists, skip download model"
            else
                apt update && apt install -y git git-lfs
                git clone https://www.modelscope.cn/qwen/Qwen2-1.5B-Instruct.git Qwen2-1.5B-Instruct
                mv Qwen2-1.5B-Instruct ${MODEL_MOUNT_PATH}
            fi
          env:
          - name: MODEL_MOUNT_PATH
            value: "/mnt/models"
    # Convert the model file format to the format required by TensorRT-LLM and build the TensorRT engine. 
    ---
    apiVersion: data.fluid.io/v1alpha1
    kind: DataProcess
    metadata:
      name: step2-trtllm-convert
    spec:
      runAfter:
        kind: DataProcess
        name: step1-download-model
        namespace: default
      dataset:
        name: qwen2-oss
        namespace: default
        mountPath: /mnt/models/
      processor:
        script:
          image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tritonserver-build
          imageTag: 24.07-trtllm-python-py3
          imagePullPolicy: IfNotPresent
          restartPolicy: OnFailure
          command:
          - bash
          source: |
            #!/bin/bash
            set -ex
    
            cd /tensorrtllm_backend/tensorrt_llm/examples/qwen
            if [ -d "${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-ckpt" ]; then
                echo "directory ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-ckpt exists, skip convert checkpoint"
            else
                echo "covert checkpoint..."
                python3 convert_checkpoint.py --model_dir ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct --output_dir /root/Qwen2-1.5B-Instruct-ckpt --dtype float16
    
                echo "Writing trtllm model ckpt to OSS Bucket..."
                mv /root/Qwen2-1.5B-Instruct-ckpt ${MODEL_MOUNT_PATH}
            fi
    
            sleep 2
            
            if [ -d "${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-engine" ]; then
                echo "directory $OUTPUT_DIR/Qwen2-1.5B-Instruct-engine exists, skip build engine"
            else
                echo "build trtllm engine..."
                trtllm-build --checkpoint_dir ${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-ckpt \
                --gemm_plugin float16 \
                --paged_kv_cache enable \
                --output_dir /root/Qwen2-1.5B-Instruct-engine
    
                echo "Writing trtllm engine to OSS Bucket..."
                mv /root/Qwen2-1.5B-Instruct-engine ${MODEL_MOUNT_PATH}
            fi
    
            if [ -d "${MODEL_MOUNT_PATH}/tensorrtllm_backend" ]; then
                echo "directory $OUTPUT_DIR/tensorrtllm_backend exists, skip config tensorrtllm_backend"
            else
                echo "config model..."
                cd /tensorrtllm_backend
                cp all_models/inflight_batcher_llm/ qwen2_ifb -r
                export QWEN2_MODEL=${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct
                export ENGINE_PATH=${MODEL_MOUNT_PATH}/Qwen2-1.5B-Instruct-engine
    
                python3 tools/fill_template.py -i qwen2_ifb/preprocessing/config.pbtxt tokenizer_dir:${QWEN2_MODEL},triton_max_batch_size:8,preprocessing_instance_count:1
                python3 tools/fill_template.py -i qwen2_ifb/postprocessing/config.pbtxt tokenizer_dir:${QWEN2_MODEL},triton_max_batch_size:8,postprocessing_instance_count:1
                python3 tools/fill_template.py -i qwen2_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:8,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
                python3 tools/fill_template.py -i qwen2_ifb/ensemble/config.pbtxt triton_max_batch_size:8
                python3 tools/fill_template.py -i qwen2_ifb/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:8,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:1280,max_attention_window_size:1280,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0
    
                echo "Writing trtllm config to OSS Bucket..."
                mkdir -p ${MODEL_MOUNT_PATH}/tensorrtllm_backend
                mv /tensorrtllm_backend/qwen2_ifb ${MODEL_MOUNT_PATH}/tensorrtllm_backend
            fi
          env:
          - name: MODEL_MOUNT_PATH
            value: "/mnt/models"
          resources:
            requests:
              cpu: 2
              memory: 10Gi
              nvidia.com/gpu: 1
            limits:
              cpu: 12
              memory: 30Gi
              nvidia.com/gpu: 1
    # Load the converted and optimized model and model configurations to memory to deploy a responsive inference service. 
    ---
    apiVersion: data.fluid.io/v1alpha1
    kind: DataLoad
    metadata:
      name: step3-warmup-cahce
    spec:
      runAfter:
        kind: DataProcess
        name: step2-trtllm-convert
        namespace: default
      dataset:
        name: qwen2-oss
        namespace: default
      loadMetadata: true
      target:
      - path: /Qwen2-1.5B-Instruct-engine
      - path: /tensorrtllm_backend

    The preceding code block orchestrates an automated and scalable model deployment procedure that consists of the following steps: downloading the model file, converting the model file format, optimizing the model, and preloading the model to cache.

  2. Run the following command to create the Dataflow:

    kubectl create -f dataflow.yaml

    Expected output:

    dataprocess.data.fluid.io/step1-download-model created
    dataprocess.data.fluid.io/step2-trtllm-convert created
    dataload.data.fluid.io/step3-warmup-cahce created

    The output shows that the custom resource objects defined in the dataflow.yaml file are created.

  3. Run the following command to query the execution progress of the Dataflow. Wait until the execution is completed.

    kubectl get dataprocess

    Expected output:

    NAME                   DATASET     PHASE      AGE   DURATION
    step1-download-model   qwen2-oss   Complete   23m   3m2s
    step2-trtllm-convert   qwen2-oss   Complete   23m   19m58s

    The output shows that two tasks related to the qwen2-oss Dataset are completed. This means that the model file is downloaded and converted to the trtllm format.

Step 3: Deploy an inference service

  1. Run the following Arena command to deploy a custom serving job to run an inference service:

    The inference service is named qwen2-chat and the service version is v1. The service uses one GPU and runs one replica. In addition, readiness probing is enabled for the service. A model is a special type of data. Therefore, the --data parameter is added to mount the model file persistent volume claim (PVC) qwen2-oss created by Fluid to the /mnt/models directory of the container.

    arena serve custom \
    --name=qwen2-chat \
    --version=v1 \
    --gpus=1 \
    --replicas=1 \
    --restful-port=8000 \
    --readiness-probe-action="tcpSocket" \
    --readiness-probe-action-option="port: 8000" \
    --readiness-probe-option="initialDelaySeconds: 30" \
    --readiness-probe-option="periodSeconds: 30" \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/tritonserver:24.07-trtllm-python-py3 \
    --data=qwen2-oss:/mnt/models \
    "tritonserver --model-repository=/mnt/models/tensorrtllm_backend/qwen2_ifb --http-port=8000 --grpc-port=8001 --metrics-port=8002 --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_"

    The following table describes the parameters in the preceding code block.

    Parameter

    Description

    --name

    The name of the inference service.

    --version

    The version of the inference service.

    --gpus

    The number of GPUs for each inference service replica.

    --replicas

    The number of inference service replicas.

    --restful-port

    The port of the inference service to be exposed.

    --readiness-probe-action

    The connection type of readiness probes. Valid values: HttpGet, Exec, gRPC, and TCPSocket.

    --readiness-probe-action-option

    The connection method of readiness probes.

    --readiness-probe-option

    The readiness probe configuration.

    --data

    Mount a shared PVC to the runtime environment. The value consists of two parts separated by a colon (:). Specify the name of the PVC on the left side of the colon. You can run the arena data list command to query the list of existing PVCs in the cluster. Specify the runtime environment on the right side of the colon. You can also specify the local path of the training data or model. This way, your script can access the data or model in the specified PV.

    --image

    The address of the inference service image.

    Expected output:

    service/qwen2-chat-v1 created
    deployment.apps/qwen2-chat-v1-custom-serving created
    INFO[0003] The Job qwen2-chat has been submitted successfully
    INFO[0003] You can run `arena serve get qwen2-chat --type custom-serving -n default` to check the job status

    The output shows that the inference service is deployed.

  2. Run the following command to query the details of the inference service:

    arena serve get qwen2-chat

    Expected output:

    Name:       qwen2-chat
    Namespace:  default
    Type:       Custom
    Version:    v1
    Desired:    1
    Available:  1
    Age:        1m
    Address:    192.XX.XX.XX
    Port:       RESTFUL:8000
    GPU:        1
    
    Instances:
      NAME                                           STATUS   AGE  READY  RESTARTS  GPU  NODE
      ----                                           ------   ---  -----  --------  ---  ----
      qwen2-chat-v1-custom-serving-657869c698-hl665  Running  1m   1/1    0         1    ap-southeast-1.192.XX.XX.XX

    The output shows that a pod (qwen2-chat-v1-custom-serving-657869c698-hl665) is running for the inference service and is ready to provide services.

Step 4: Verify the inference service

  1. Run the following command to set up port forwarding between the inference service and the local environment.

    Important

    Port forwarding set up by using kubectl port-forward is not reliable, secure, or extensible in production environments. It is only for development and debugging. Do not use this command to set up port forwarding in production environments. For more information about networking solutions used for production in ACK clusters, see Ingress overview.

    kubectl port-forward svc/qwen2-chat-v1 8000:8000

    Expected output:

    Forwarding from 127.0.0.1:8000 -> 8000
    Forwarding from [::1]:8000 -> 8000
  2. Run the following command to send a request to the inference service:

    curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'

    Expected output:

    {"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":" Machine learning is an AI technology that enables computer systems to learn from data without using specific programs"}

    The output shows that the model can generate a response based on the given prompt.

(Optional) Step 5: Clear the environment

If you no longer need the model inference service, run the following command to delete the service:

arena serve delete qwen2-chat