Best Practices for Large Model Inference in ACK: TensorRT-LLM

By Zibai

This article uses a Llama-2-7b-hf model as an example to introduce how to deploy the Triton framework using KServe on Alibaba Cloud Container Service for Kubernetes (ACK). Triton uses TensorRT-LLM as its backend.

Overview

1. KServe

KServe is an open source cloud-native model service platform designed to simplify the process of deploying and running machine learning (ML) models in Kubernetes. It supports multiple ML frameworks and provides scaling capabilities. KServe makes it easier to configure and manage model services by defining simple YAML files and providing declarative APIs to deploy models.

For more information about KServe, see the KServe documentation.

2. Triton Inference Server

NVIDIA Triton Inference Server, or Triton for short, is an open source inference serving framework that helps users quickly build AI inference applications. Triton provides backend support for many ML frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, and vLLM. Triton delivers optimized performance for many query types, including real time, batched, and audio or video streaming.

For more information about Triton, see the Triton Inference Server GitHub code library.

3. TensorRT-LLM

NVIDIA TensorRT-LLM is an open source library for optimizing LLM inference. This framework is used to define LLMs and build TensorRT engines to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM can also work as a backend for integration with Triton, such as tensorrtllm_backend. Models built with TensorRT-LLM can run on a single GPU or multiple GPUs, and support Tensor Parallelism and Pipeline Parallelism.

For more information about TensorRT-LLM, see the TensorRT-LLM GitHub bode library.

Prerequisites

• An ACK cluster that contains GPU-accelerated nodes is created. For more information, see Create an ACK cluster with GPU-accelerated nodes.

• GPU nodes have a memory of 24 GB or more.

• KServe is installed. For more information, see Install the ack-kserve component.

1. Prepare the Model Data and Compilation Script

1.1 Download the Llama-2-7b-hf Model from HuggingFace/ModelScope

For more information about the models supported by TensorRT-LLM, see TensorRT-LLM support matrix.

1.2 Prepare the Model Compilation Script

Create a trtllm-llama-2-7b.sh file with the following content:

#!/bin/sh
set -e
# The script is applicable to the nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3 image.
MODEL_MOUNT_PATH=/mnt/models
OUTPUT_DIR=/root/trt-llm
TRT_BACKEND_DIR=/root/tensorrtllm_backend
# clone tensorrtllm_backend
echo "clone tensorrtllm_backend..."
if [ -d "$TRT_BACKEND_DIR" ]; then
    echo "directory $TRT_BACKEND_DIR exists, skip clone tensorrtllm_backend"
else
  cd /root
  git clone -b v0.9.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
  cd $TRT_BACKEND_DIR
  git submodule update --init --recursive
  git lfs install
  git lfs pull
fi
# covert checkpoint
if [ -d "$OUTPUT_DIR/llama-2-7b-ckpt" ]; then
    echo "directory $OUTPUT_DIR/llama-2-7b-ckpt exists, skip convert checkpoint"
else
  echo "covert checkpoint..."
  python3 $TRT_BACKEND_DIR/tensorrt_llm/examples/llama/convert_checkpoint.py \
  --model_dir $MODEL_MOUNT_PATH/Llama-2-7b-hf \
  --output_dir $OUTPUT_DIR/llama-2-7b-ckpt \
  --dtype float16
fi
# build trtllm engine
if [ -d "$OUTPUT_DIR/llama-2-7b-engine" ]; then
    echo "directory $OUTPUT_DIR/llama-2-7b-engine exists, skip convert checkpoint"
else
  echo "build trtllm engine..."
  trtllm-build --checkpoint_dir $OUTPUT_DIR/llama-2-7b-ckpt \
               --remove_input_padding enable \
               --gpt_attention_plugin float16 \
               --context_fmha enable \
               --gemm_plugin float16 \
               --output_dir $OUTPUT_DIR/llama-2-7b-engine \
               --paged_kv_cache enable \
               --max_batch_size 8
fi
# config model
echo "config model..."
cd $TRT_BACKEND_DIR
cp all_models/inflight_batcher_llm/ llama_ifb -r
export HF_LLAMA_MODEL=$MODEL_MOUNT_PATH/Llama-2-7b-hf
export ENGINE_PATH=$OUTPUT_DIR/llama-2-7b-engine
python3 tools/fill_template.py -i llama_ifb/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:8,preprocessing_instance_count:1
python3 tools/fill_template.py -i llama_ifb/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:8,postprocessing_instance_count:1
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:8,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i llama_ifb/ensemble/config.pbtxt triton_max_batch_size:8
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:8,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:1280,max_attention_window_size:1280,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0
# run server
echo "run server..."
pip install SentencePiece
tritonserver --model-repository=$TRT_BACKEND_DIR/llama_ifb --http-port=8080 --grpc-port=9000 --metrics-port=8002 --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_

1.3 Upload to OSS and Create a PV or PVC in the Cluster

# Create a directory
ossutil mkdir oss://<your-bucket-name>/Llama-2-7b-hf

# Upload the model file
ossutil cp -r ./Llama-2-7b-hf oss://<your-bucket-name>/Llama-2-7b-hf

# Upload the script file
chmod +x trtllm-llama-2-7b.sh
ossutil cp -r ./trtllm-llama-2-7b.sh oss://<your-bucket-name>/trtllm-llama-2-7b.sh

The file path in OSS is as follows:

tree -L 1
.
├── Llama-2-7b-hf
└── trtllm-llama-2-7b.sh

1.4 Create a PV or PVC

Replace ${your-accesskey-id}, ${your-accesskey-secret}, ${your-bucket-name}, and ${your-bucket-endpoint} variables in the file.

kubectl apply -f- << EOF
apiVersion: v1
kind: Secret
metadata:
  name: oss-secret
stringData:
  akId: ${your-accesskey-id} # The AccessKey ID used to access OSS.
  akSecret: ${your-accesskey-secert} # The AccessKey secret used to access OSS.
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: llm-model
  labels:
    alicloud-pvname: llm-model
spec:
  capacity:
    storage: 30Gi 
  accessModes:
    - ReadOnlyMany
  persistentVolumeReclaimPolicy: Retain
  csi:
    driver: ossplugin.csi.alibabacloud.com
    volumeHandle: model-oss
    nodePublishSecretRef:
      name: oss-secret
      namespace: default
    volumeAttributes:
      bucket: ${your-bucket-name}
      url: ${your-bucket-endpoint} # e.g. oss-cn-hangzhou.aliyuncs.com
      otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
      path: "/"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llm-model
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 30Gi
  selector:
    matchLabels:
      alicloud-pvname: llm-model
EOF

2. Create ClusterServerRuntime

kubectl apply -f- <<EOF
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: triton-trtllm
spec:
  annotations:
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: "8002"
  containers:
  - args:
    - tritonserver
    - --model-store=/mnt/models
    - --grpc-port=9000
    - --http-port=8080
    - --allow-grpc=true
    - --allow-http=true
    image: nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
    name: kserve-container
    resources:
      requests:
        cpu: "4"
        memory: 12Gi
  protocolVersions:
  - v2
  - grpc-v2
  supportedModelFormats:
  - name: triton
    version: "2"
<< EOF

3. Deploy the Application

3.1 Deploy KServe

kubectl apply -f- << EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-2-7b
spec:
  predictor:
    model:
      modelFormat:
        name: triton
        version: "2"
      runtime: triton-trtllm
      storageUri: pvc://llm-model/
      name: kserve-container
      resources:
        limits:
          nvidia.com/gpu: "1"
        requests:
          cpu: "4"
          memory: 12Gi
          nvidia.com/gpu: "1"
      command:
      - sh
      - -c
      - /mnt/models/trtllm-llama-2-7b.sh
EOF

Run the following command to check whether the application is ready:

kubectl get isvc llama-2-7b

Expected output:

NAME         URL                                     READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION   AGE
llama-2-7b   http://llama-2-7b-default.example.com   True                                                                  29m

3.2 Access the Application

3.2.1 Access the application inside the container

curl -X POST localhost:8080/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'

3.2.2 Access the application from a node in the cluster

NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.spec.clusterIP}'`

# If the service is not deployed in the default namespace, you must modify the namespace name.
SERVICE_HOSTNAME=$(kubectl get inferenceservice llama-2-7b -n default  -o jsonpath='{.status.url}' | cut -d "/" -f 3)

curl -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" \
http://$NGINX_INGRESS_IP:80/v2/models/ensemble/generate \
-d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'

3.2.3 Access the application from outside the cluster

NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'`

# If the service is not deployed in the default namespace, you must modify the namespace name.
SERVICE_HOSTNAME=$(kubectl get inferenceservice llama-2-7b -n default  -o jsonpath='{.status.url}' | cut -d "/" -f 3)

curl -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" \
http://$NGINX_INGRESS_IP:80/v2/models/ensemble/generate \
-d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'

Expected output:

{"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"\nMachine learning is a type of artificial intelligence (AI) that allows software applications to become more accurate"}

4. Q&A

Failed to pull image "nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3": failed to pull and unpack image "nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3": failed to copy: httpReadSeeker: failed open: failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://authn.nvidia.com/token?scope=repository%3Anvidia%2Ftritonserver%3Apull&service=registry: 401

Cause: Failed to authenticate the NVIDIA image repository.

Solution: Manually pull the nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3 image from your local device and push it to your own repository. Then, change the image address in ClusterServeRuntime to your own repository address.

Community

Best Practices for Large Model Inference in ACK: TensorRT-LLM

Overview

1. KServe

2. Triton Inference Server

3. TensorRT-LLM

Prerequisites

1. Prepare the Model Data and Compilation Script

1.1 Download the Llama-2-7b-hf Model from HuggingFace/ModelScope

1.2 Prepare the Model Compilation Script

1.3 Upload to OSS and Create a PV or PVC in the Cluster

1.4 Create a PV or PVC

2. Create ClusterServerRuntime

3. Deploy the Application

3.1 Deploy KServe

3.2 Access the Application

3.2.1 Access the application inside the container

3.2.2 Access the application from a node in the cluster

3.2.3 Access the application from outside the cluster

4. Q&A

5. Reference

Read previous post:

Read next post:

Alibaba Container Service

You may also like

Comments

Alibaba Container Service

Related Products

Best Practices

Container Service for Kubernetes

ACK One

EasyDispatch for Field Service Management