By Zibai
This article uses a Llama-2-7b-hf model as an example to introduce how to deploy the Triton framework using KServe on Alibaba Cloud Container Service for Kubernetes (ACK). Triton uses TensorRT-LLM as its backend.
KServe is an open source cloud-native model service platform designed to simplify the process of deploying and running machine learning (ML) models in Kubernetes. It supports multiple ML frameworks and provides scaling capabilities. KServe makes it easier to configure and manage model services by defining simple YAML files and providing declarative APIs to deploy models.
For more information about KServe, see the KServe documentation.
NVIDIA Triton Inference Server, or Triton for short, is an open source inference serving framework that helps users quickly build AI inference applications. Triton provides backend support for many ML frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, and vLLM. Triton delivers optimized performance for many query types, including real time, batched, and audio or video streaming.
For more information about Triton, see the Triton Inference Server GitHub code library.
NVIDIA TensorRT-LLM is an open source library for optimizing LLM inference. This framework is used to define LLMs and build TensorRT engines to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM can also work as a backend for integration with Triton, such as tensorrtllm_backend. Models built with TensorRT-LLM can run on a single GPU or multiple GPUs, and support Tensor Parallelism and Pipeline Parallelism.
For more information about TensorRT-LLM, see the TensorRT-LLM GitHub bode library.
• An ACK cluster that contains GPU-accelerated nodes is created. For more information, see Create an ACK cluster with GPU-accelerated nodes.
• GPU nodes have a memory of 24 GB or more.
• KServe is installed. For more information, see Install the ack-kserve component.
For more information about the models supported by TensorRT-LLM, see TensorRT-LLM support matrix.
Create a trtllm-llama-2-7b.sh file with the following content:
#!/bin/sh
set -e
# The script is applicable to the nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3 image.
MODEL_MOUNT_PATH=/mnt/models
OUTPUT_DIR=/root/trt-llm
TRT_BACKEND_DIR=/root/tensorrtllm_backend
# clone tensorrtllm_backend
echo "clone tensorrtllm_backend..."
if [ -d "$TRT_BACKEND_DIR" ]; then
echo "directory $TRT_BACKEND_DIR exists, skip clone tensorrtllm_backend"
else
cd /root
git clone -b v0.9.0 https://github.com/triton-inference-server/tensorrtllm_backend.git
cd $TRT_BACKEND_DIR
git submodule update --init --recursive
git lfs install
git lfs pull
fi
# covert checkpoint
if [ -d "$OUTPUT_DIR/llama-2-7b-ckpt" ]; then
echo "directory $OUTPUT_DIR/llama-2-7b-ckpt exists, skip convert checkpoint"
else
echo "covert checkpoint..."
python3 $TRT_BACKEND_DIR/tensorrt_llm/examples/llama/convert_checkpoint.py \
--model_dir $MODEL_MOUNT_PATH/Llama-2-7b-hf \
--output_dir $OUTPUT_DIR/llama-2-7b-ckpt \
--dtype float16
fi
# build trtllm engine
if [ -d "$OUTPUT_DIR/llama-2-7b-engine" ]; then
echo "directory $OUTPUT_DIR/llama-2-7b-engine exists, skip convert checkpoint"
else
echo "build trtllm engine..."
trtllm-build --checkpoint_dir $OUTPUT_DIR/llama-2-7b-ckpt \
--remove_input_padding enable \
--gpt_attention_plugin float16 \
--context_fmha enable \
--gemm_plugin float16 \
--output_dir $OUTPUT_DIR/llama-2-7b-engine \
--paged_kv_cache enable \
--max_batch_size 8
fi
# config model
echo "config model..."
cd $TRT_BACKEND_DIR
cp all_models/inflight_batcher_llm/ llama_ifb -r
export HF_LLAMA_MODEL=$MODEL_MOUNT_PATH/Llama-2-7b-hf
export ENGINE_PATH=$OUTPUT_DIR/llama-2-7b-engine
python3 tools/fill_template.py -i llama_ifb/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:8,preprocessing_instance_count:1
python3 tools/fill_template.py -i llama_ifb/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:8,postprocessing_instance_count:1
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:8,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i llama_ifb/ensemble/config.pbtxt triton_max_batch_size:8
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:8,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:1280,max_attention_window_size:1280,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0
# run server
echo "run server..."
pip install SentencePiece
tritonserver --model-repository=$TRT_BACKEND_DIR/llama_ifb --http-port=8080 --grpc-port=9000 --metrics-port=8002 --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_
# Create a directory
ossutil mkdir oss://<your-bucket-name>/Llama-2-7b-hf
# Upload the model file
ossutil cp -r ./Llama-2-7b-hf oss://<your-bucket-name>/Llama-2-7b-hf
# Upload the script file
chmod +x trtllm-llama-2-7b.sh
ossutil cp -r ./trtllm-llama-2-7b.sh oss://<your-bucket-name>/trtllm-llama-2-7b.sh
The file path in OSS is as follows:
tree -L 1
.
├── Llama-2-7b-hf
└── trtllm-llama-2-7b.sh
Replace ${your-accesskey-id}, ${your-accesskey-secret}, ${your-bucket-name}, and ${your-bucket-endpoint}
variables in the file.
kubectl apply -f- << EOF
apiVersion: v1
kind: Secret
metadata:
name: oss-secret
stringData:
akId: ${your-accesskey-id} # The AccessKey ID used to access OSS.
akSecret: ${your-accesskey-secert} # The AccessKey secret used to access OSS.
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: llm-model
labels:
alicloud-pvname: llm-model
spec:
capacity:
storage: 30Gi
accessModes:
- ReadOnlyMany
persistentVolumeReclaimPolicy: Retain
csi:
driver: ossplugin.csi.alibabacloud.com
volumeHandle: model-oss
nodePublishSecretRef:
name: oss-secret
namespace: default
volumeAttributes:
bucket: ${your-bucket-name}
url: ${your-bucket-endpoint} # e.g. oss-cn-hangzhou.aliyuncs.com
otherOpts: "-o umask=022 -o max_stat_cache_size=0 -o allow_other"
path: "/"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: llm-model
spec:
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 30Gi
selector:
matchLabels:
alicloud-pvname: llm-model
EOF
kubectl apply -f- <<EOF
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
name: triton-trtllm
spec:
annotations:
prometheus.kserve.io/path: /metrics
prometheus.kserve.io/port: "8002"
containers:
- args:
- tritonserver
- --model-store=/mnt/models
- --grpc-port=9000
- --http-port=8080
- --allow-grpc=true
- --allow-http=true
image: nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
name: kserve-container
resources:
requests:
cpu: "4"
memory: 12Gi
protocolVersions:
- v2
- grpc-v2
supportedModelFormats:
- name: triton
version: "2"
<< EOF
kubectl apply -f- << EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-2-7b
spec:
predictor:
model:
modelFormat:
name: triton
version: "2"
runtime: triton-trtllm
storageUri: pvc://llm-model/
name: kserve-container
resources:
limits:
nvidia.com/gpu: "1"
requests:
cpu: "4"
memory: 12Gi
nvidia.com/gpu: "1"
command:
- sh
- -c
- /mnt/models/trtllm-llama-2-7b.sh
EOF
Run the following command to check whether the application is ready:
kubectl get isvc llama-2-7b
Expected output:
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
llama-2-7b http://llama-2-7b-default.example.com True 29m
curl -X POST localhost:8080/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'
NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.spec.clusterIP}'`
# If the service is not deployed in the default namespace, you must modify the namespace name.
SERVICE_HOSTNAME=$(kubectl get inferenceservice llama-2-7b -n default -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" \
http://$NGINX_INGRESS_IP:80/v2/models/ensemble/generate \
-d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'
NGINX_INGRESS_IP=`kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}'`
# If the service is not deployed in the default namespace, you must modify the namespace name.
SERVICE_HOSTNAME=$(kubectl get inferenceservice llama-2-7b -n default -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" \
http://$NGINX_INGRESS_IP:80/v2/models/ensemble/generate \
-d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'
Expected output:
{"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"\nMachine learning is a type of artificial intelligence (AI) that allows software applications to become more accurate"}
Failed to pull image "nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3": failed to pull and unpack image "nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3": failed to copy: httpReadSeeker: failed open: failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://authn.nvidia.com/token?scope=repository%3Anvidia%2Ftritonserver%3Apull&service=registry: 401
Cause: Failed to authenticate the NVIDIA image repository.
Solution: Manually pull the nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3 image from your local device and push it to your own repository. Then, change the image address in ClusterServeRuntime to your own repository address.
Analyzing the Distributed Inference Process Using vLLM and Ray from the Perspective of Source Code
Best Practices for AI Model Inference Configuration in Knative
164 posts | 29 followers
FollowAlibaba Cloud Native Community - April 2, 2024
Alibaba Container Service - August 30, 2024
Alibaba Cloud Community - October 10, 2024
Alibaba Container Service - November 15, 2024
Alibaba Container Service - November 7, 2024
Alibaba Container Service - October 30, 2024
164 posts | 29 followers
FollowFollow our step-by-step best practices guides to build your own business case.
Learn MoreAlibaba Cloud Container Service for Kubernetes is a fully managed cloud container management service that supports native Kubernetes and integrates with other Alibaba Cloud products.
Learn MoreProvides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resources
Learn MoreApply the latest Reinforcement Learning AI technology to your Field Service Management (FSM) to obtain real-time AI-informed decision support.
Learn MoreMore Posts by Alibaba Container Service