This topic describes how to deploy a large language model (LLM) using vLLM as an inference service on Alibaba Cloud Container Service for Kubernetes (ACK). The deployment leverages Fluid for data orchestration and caching to optimize model loading performance and achieve efficient AI inference service deployment.
Introduction
Large language models (LLMs) have become essential for various AI applications, from chatbots to content generation. Efficiently deploying these models for inference requires specialized frameworks that can handle the computational demands and memory requirements of LLMs. vLLM is an open-source LLM inference and serving engine that excels in these areas.
Frameworks overview
vLLM
vLLM is an open-source LLM inference and serving engine that focuses on fast and easy LLM serving. It provides high-throughput and memory-efficient inference for various LLM architectures. vLLM implements PagedAttention, a novel attention algorithm that reduces memory fragmentation and enables serving more concurrent requests.
Key features of vLLM include:
Continuous batching for improved throughput
Efficient memory management with PagedAttention
Support for various LLM architectures (LLaMA, Mistral, Falcon, etc.)
Easy integration with popular frameworks
Production-ready API endpoints
For more technical details, visit the vLLM GitHub repository.
Triton (Triton Inference Server)
Triton Inference Server is an open-source inference serving framework from NVIDIA that helps you quickly build AI inference applications. Triton supports various machine learning frameworks as its runtime backends, such as TensorRT, TensorFlow, PyTorch, ONNX, and vLLM. Triton is optimized for real-time inference, batch inference, and audio and video stream inference scenarios to achieve high performance.
For more information about the Triton inference serving framework, see the Triton Inference Server GitHub repository.
Prerequisites
GPU environment preparation: An ACK Pro cluster with NVIDIA A10 GPUs has been created. The cluster Kubernetes version must be 1.22 or later.
Use GPU driver version 470.82.01 or later to ensure compatibility with the latest CUDA versions.
Fluid installation: Fluid has been installed in the cluster. Fluid provides distributed data acceleration and caching capabilities. For installation instructions, see Install and configure Fluid.
Arena installation: Arena has been installed in the cluster. Arena is a command-line tool for managing machine learning jobs. For installation instructions, see Install and configure Arena.
Model access permissions: Ensure you have access to the LLM model you want to deploy. This example uses a publicly available model, but you may need to configure authentication for private models.
Step 1: Prepare the model environment
Create a Dataset custom resource to define the model storage location. This example uses Alibaba Cloud Object Storage Service (OSS) as the backend storage.
You can also use other storage types such as NAS or CPFS based on your requirements.
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: vllm-model-oss
spec:
mounts:
- mountPoint: oss://your-bucket-name/model-path/
name: vllm-model
options:
fs.oss.endpoint: your-oss-endpoint
fs.oss.accessKeyId: your-access-key-id
fs.oss.accessKeySecret: your-access-key-secret
---
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
name: vllm-model-oss
spec:
replicas: 1
tieredstore:
levels:
- mediumtype: MEM
path: /dev/shm
quota: 2Gi
high: "0.95"
low: "0.7"Apply the Dataset configuration:
kubectl apply -f dataset.yamlCheck the Dataset status:
kubectl get dataset vllm-model-ossStep 2: Build the model inference environment
This step uses Fluid Dataflow to automate key stages of model deployment: downloading the LLM model, preparing the vLLM environment, and warming up the cache. The entire process is implemented through declarative configuration to ensure deployment consistency and reproducibility.
Dataflow encapsulates complex multi-step operations into an automated workflow, reducing manual intervention and improving deployment efficiency.
# Download the model from ModelScope or Hugging Face
---
apiVersion: data.fluid.io/v1alpha1
kind: DataProcess
metadata:
name: step1-download-model
spec:
dataset:
name: vllm-model-oss
namespace: default
mountPath: /mnt/models/
processor:
script:
image: python:3.9
imageTag: slim
imagePullPolicy: IfNotPresent
restartPolicy: OnFailure
command:
- bash
source: |
#!/bin/bash
echo "Downloading model..."
if [ -d "${MODEL_MOUNT_PATH}/llama-2-7b-chat-hf" ]; then
echo "Directory ${MODEL_MOUNT_PATH}/llama-2-7b-chat-hf exists. Skipping model download."
else
pip install huggingface_hub
python3 -c "
import os
from huggingface_hub import snapshot_download
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
snapshot_download(
repo_id='meta-llama/Llama-2-7b-chat-hf',
local_dir='${MODEL_MOUNT_PATH}/llama-2-7b-chat-hf',
local_dir_use_symlinks=False
)"
fi
env:
- name: MODEL_MOUNT_PATH
value: "/mnt/models"
- name: HF_TOKEN
value: "your-huggingface-token"
# Prepare the vLLM environment and warm up the model cache
---
apiVersion: data.fluid.io/v1alpha1
kind: DataProcess
metadata:
name: step2-vllm-setup
spec:
runAfter:
kind: DataProcess
name: step1-download-model
namespace: default
dataset:
name: vllm-model-oss
namespace: default
mountPath: /mnt/models/
processor:
script:
image: vllm/vllm-openai:latest
imagePullPolicy: IfNotPresent
restartPolicy: OnFailure
command:
- bash
source: |
#!/bin/bash
set -ex
MODEL_PATH="${MODEL_MOUNT_PATH}/llama-2-7b-chat-hf"
echo "Setting up vLLM environment for model: $MODEL_PATH"
# Test model loading to warm up cache
python3 -c "
import torch
from vllm import LLM, SamplingParams
# Initialize the model (this will cache it in memory)
print('Loading model for warm-up...')
llm = LLM(model='$MODEL_PATH', dtype='float16', max_model_len=2048)
print('Model loaded successfully!')
# Perform a simple inference to ensure everything works
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=50)
outputs = llm.generate(['Hello, how are you?'], sampling_params)
print('Warm-up inference completed successfully!')
print(f'Response: {outputs[0].outputs[0].text}')
"
env:
- name: MODEL_MOUNT_PATH
value: "/mnt/models"
resources:
requests:
cpu: 2
memory: 8Gi
nvidia.com/gpu: 1
limits:
cpu: 8
memory: 20Gi
nvidia.com/gpu: 1
# Load the prepared model into memory for fast-response inference service
---
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
name: step3-warmup-cache
spec:
runAfter:
kind: DataProcess
name: step2-vllm-setup
namespace: default
dataset:
name: vllm-model-oss
namespace: default
loadMetadata: true
target:
- path: /llama-2-7b-chat-hfSave the preceding code as
dataflow.yaml.Apply the Dataflow configuration to create the automated processing workflow:
kubectl create -f dataflow.yamlSuccessful execution should show:
dataprocess.data.fluid.io/step1-download-model created dataprocess.data.fluid.io/step2-vllm-setup created dataload.data.fluid.io/step3-warmup-cache createdThe three steps in the preceding code constitute a complete model deployment flow. This flow automates and scales model deployment, from downloading and preparing the original model to preloading the cache.
Monitor the Dataflow execution progress:
kubectl get dataprocess -wThe workflow is complete when all DataProcess resources show
Completestatus.
Step 3: Deploy the inference service
Run the following Arena command to deploy a custom Serve service.
The service is named llama2-chat with version v1. It requires one GPU, has one replica, and is configured with a readiness probe. The
--dataparameter is used to mount the model persistent volume claim (PVC)vllm-model-oss, which was created by Fluid, to the/mnt/modelsdirectory in the container.arena serve custom \ --name=llama2-chat \ --version=v1 \ --gpus=1 \ --replicas=1 \ --restful-port=8000 \ --readiness-probe-action="tcpSocket" \ --readiness-probe-action-option="port: 8000" \ --readiness-probe-option="initialDelaySeconds: 30" \ --readiness-probe-option="periodSeconds: 30" \ --image=vllm/vllm-openai:latest \ --data=vllm-model-oss:/mnt/models \ "python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model /mnt/models/llama-2-7b-chat-hf --dtype float16 --max-model-len 2048"Successful deployment should show:
service/llama2-chat-v1 created deployment.apps/llama2-chat-v1-custom-serving created INFO[0003] The Job llama2-chat has been submitted successfully INFO[0003] You can run `arena serve get llama2-chat --type custom-serving -n default` to check the job statusThis indicates that the inference service has been successfully submitted to the Kubernetes cluster.
Check the detailed information and running status of the inference service:
arena serve get llama2-chatWhen the service is running normally, it displays:
Name: llama2-chat Namespace: default Type: Custom Version: v1 Desired: 1 Available: 1 Age: 2m Address: 192.168.10.15 Port: RESTFUL:8000 GPU: 1
Step 4: Test the inference service
Establish port forwarding to access the inference service locally:
Port forwarding will continue running in the current terminal session. To stop it, press Ctrl+C.
kubectl port-forward svc/llama2-chat-v1 8000:8000After successful establishment of forwarding, it displays:
Forwarding from 127.0.0.1:8000 -> 8000 Forwarding from [::1]:8000 -> 8000Run the following command to send a model inference request.
curl -X POST localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/mnt/models/llama-2-7b-chat-hf", "prompt": "What is machine learning?", "max_tokens": 100, "temperature": 0.7 }'Expected output:
{ "id": "cmpl-1234567890", "object": "text_completion", "created": 1700000000, "model": "/mnt/models/llama-2-7b-chat-hf", "choices": [ { "index": 0, "text": " Machine learning is a type of artificial intelligence that enables computer systems to learn from data without being explicitly programmed. It involves algorithms that can identify patterns and make predictions or decisions based on input data.", "logprobs": null, "finish_reason": "length" } ], "usage": { "prompt_tokens": 5, "total_tokens": 105, "completion_tokens": 100 } }The output indicates that the model can generate a response based on the given input.
(Optional) Step 5: Clean up the environment
If you no longer need the deployed model inference service, run the following command to delete it.
arena serve delete llama2-chat