All Products
Search
Document Center

Container Service for Kubernetes:Deploy a vLLM model as an inference service

Last Updated:Feb 09, 2026

This topic describes how to deploy a large language model (LLM) using vLLM as an inference service on Alibaba Cloud Container Service for Kubernetes (ACK). The deployment leverages Fluid for data orchestration and caching to optimize model loading performance and achieve efficient AI inference service deployment.

Introduction

Large language models (LLMs) have become essential for various AI applications, from chatbots to content generation. Efficiently deploying these models for inference requires specialized frameworks that can handle the computational demands and memory requirements of LLMs. vLLM is an open-source LLM inference and serving engine that excels in these areas.

Frameworks overview

vLLM

vLLM is an open-source LLM inference and serving engine that focuses on fast and easy LLM serving. It provides high-throughput and memory-efficient inference for various LLM architectures. vLLM implements PagedAttention, a novel attention algorithm that reduces memory fragmentation and enables serving more concurrent requests.

Key features of vLLM include:

  • Continuous batching for improved throughput

  • Efficient memory management with PagedAttention

  • Support for various LLM architectures (LLaMA, Mistral, Falcon, etc.)

  • Easy integration with popular frameworks

  • Production-ready API endpoints

For more technical details, visit the vLLM GitHub repository.

Triton (Triton Inference Server)

Triton Inference Server is an open-source inference serving framework from NVIDIA that helps you quickly build AI inference applications. Triton supports various machine learning frameworks as its runtime backends, such as TensorRT, TensorFlow, PyTorch, ONNX, and vLLM. Triton is optimized for real-time inference, batch inference, and audio and video stream inference scenarios to achieve high performance.

For more information about the Triton inference serving framework, see the Triton Inference Server GitHub repository.

Prerequisites

  • GPU environment preparation: An ACK Pro cluster with NVIDIA A10 GPUs has been created. The cluster Kubernetes version must be 1.22 or later.

    Use GPU driver version 470.82.01 or later to ensure compatibility with the latest CUDA versions.

  • Fluid installation: Fluid has been installed in the cluster. Fluid provides distributed data acceleration and caching capabilities. For installation instructions, see Install and configure Fluid.

  • Arena installation: Arena has been installed in the cluster. Arena is a command-line tool for managing machine learning jobs. For installation instructions, see Install and configure Arena.

  • Model access permissions: Ensure you have access to the LLM model you want to deploy. This example uses a publicly available model, but you may need to configure authentication for private models.

Step 1: Prepare the model environment

Create a Dataset custom resource to define the model storage location. This example uses Alibaba Cloud Object Storage Service (OSS) as the backend storage.

You can also use other storage types such as NAS or CPFS based on your requirements.

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: vllm-model-oss
spec:
  mounts:
  - mountPoint: oss://your-bucket-name/model-path/
    name: vllm-model
    options:
      fs.oss.endpoint: your-oss-endpoint
      fs.oss.accessKeyId: your-access-key-id
      fs.oss.accessKeySecret: your-access-key-secret
---
apiVersion: data.fluid.io/v1alpha1
kind: AlluxioRuntime
metadata:
  name: vllm-model-oss
spec:
  replicas: 1
  tieredstore:
    levels:
      - mediumtype: MEM
        path: /dev/shm
        quota: 2Gi
        high: "0.95"
        low: "0.7"

Apply the Dataset configuration:

kubectl apply -f dataset.yaml

Check the Dataset status:

kubectl get dataset vllm-model-oss

Step 2: Build the model inference environment

This step uses Fluid Dataflow to automate key stages of model deployment: downloading the LLM model, preparing the vLLM environment, and warming up the cache. The entire process is implemented through declarative configuration to ensure deployment consistency and reproducibility.

Dataflow encapsulates complex multi-step operations into an automated workflow, reducing manual intervention and improving deployment efficiency.

# Download the model from ModelScope or Hugging Face
---
apiVersion: data.fluid.io/v1alpha1
kind: DataProcess
metadata:
  name: step1-download-model
spec:
  dataset:
    name: vllm-model-oss
    namespace: default
    mountPath: /mnt/models/
  processor:
    script:
      image: python:3.9
      imageTag: slim
      imagePullPolicy: IfNotPresent
      restartPolicy: OnFailure
      command:
      - bash
      source: |
        #!/bin/bash
        echo "Downloading model..."
        if [ -d "${MODEL_MOUNT_PATH}/llama-2-7b-chat-hf" ]; then
            echo "Directory ${MODEL_MOUNT_PATH}/llama-2-7b-chat-hf exists. Skipping model download."
        else
            pip install huggingface_hub
            python3 -c "
import os
from huggingface_hub import snapshot_download
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
snapshot_download(
    repo_id='meta-llama/Llama-2-7b-chat-hf',
    local_dir='${MODEL_MOUNT_PATH}/llama-2-7b-chat-hf',
    local_dir_use_symlinks=False
)"
        fi
      env:
      - name: MODEL_MOUNT_PATH
        value: "/mnt/models"
      - name: HF_TOKEN
        value: "your-huggingface-token"
# Prepare the vLLM environment and warm up the model cache
---
apiVersion: data.fluid.io/v1alpha1
kind: DataProcess
metadata:
  name: step2-vllm-setup
spec:
  runAfter:
    kind: DataProcess
    name: step1-download-model
    namespace: default
  dataset:
    name: vllm-model-oss
    namespace: default
    mountPath: /mnt/models/
  processor:
    script:
      image: vllm/vllm-openai:latest
      imagePullPolicy: IfNotPresent
      restartPolicy: OnFailure
      command:
      - bash
      source: |
        #!/bin/bash
        set -ex
        
        MODEL_PATH="${MODEL_MOUNT_PATH}/llama-2-7b-chat-hf"
        echo "Setting up vLLM environment for model: $MODEL_PATH"
        
        # Test model loading to warm up cache
        python3 -c "
import torch
from vllm import LLM, SamplingParams

# Initialize the model (this will cache it in memory)
print('Loading model for warm-up...')
llm = LLM(model='$MODEL_PATH', dtype='float16', max_model_len=2048)
print('Model loaded successfully!')

# Perform a simple inference to ensure everything works
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=50)
outputs = llm.generate(['Hello, how are you?'], sampling_params)
print('Warm-up inference completed successfully!')
print(f'Response: {outputs[0].outputs[0].text}')
"
      env:
      - name: MODEL_MOUNT_PATH
        value: "/mnt/models"
      resources:
        requests:
          cpu: 2
          memory: 8Gi
          nvidia.com/gpu: 1
        limits:
          cpu: 8
          memory: 20Gi
          nvidia.com/gpu: 1
# Load the prepared model into memory for fast-response inference service
---
apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: step3-warmup-cache
spec:
  runAfter:
    kind: DataProcess
    name: step2-vllm-setup
    namespace: default
  dataset:
    name: vllm-model-oss
    namespace: default
  loadMetadata: true
  target:
  - path: /llama-2-7b-chat-hf
  1. Save the preceding code as dataflow.yaml.

  2. Apply the Dataflow configuration to create the automated processing workflow:

    kubectl create -f dataflow.yaml

    Successful execution should show:

    dataprocess.data.fluid.io/step1-download-model created
    dataprocess.data.fluid.io/step2-vllm-setup created
    dataload.data.fluid.io/step3-warmup-cache created

    The three steps in the preceding code constitute a complete model deployment flow. This flow automates and scales model deployment, from downloading and preparing the original model to preloading the cache.

  3. Monitor the Dataflow execution progress:

    kubectl get dataprocess -w

    The workflow is complete when all DataProcess resources show Complete status.

Step 3: Deploy the inference service

  1. Run the following Arena command to deploy a custom Serve service.

    The service is named llama2-chat with version v1. It requires one GPU, has one replica, and is configured with a readiness probe. The --data parameter is used to mount the model persistent volume claim (PVC) vllm-model-oss, which was created by Fluid, to the /mnt/models directory in the container.

    arena serve custom \
    --name=llama2-chat \
    --version=v1 \
    --gpus=1 \
    --replicas=1 \
    --restful-port=8000 \
    --readiness-probe-action="tcpSocket" \
    --readiness-probe-action-option="port: 8000" \
    --readiness-probe-option="initialDelaySeconds: 30" \
    --readiness-probe-option="periodSeconds: 30" \
    --image=vllm/vllm-openai:latest \
    --data=vllm-model-oss:/mnt/models \
    "python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model /mnt/models/llama-2-7b-chat-hf --dtype float16 --max-model-len 2048"

    Successful deployment should show:

    service/llama2-chat-v1 created
    deployment.apps/llama2-chat-v1-custom-serving created
    INFO[0003] The Job llama2-chat has been submitted successfully
    INFO[0003] You can run `arena serve get llama2-chat --type custom-serving -n default` to check the job status

    This indicates that the inference service has been successfully submitted to the Kubernetes cluster.

  2. Check the detailed information and running status of the inference service:

    arena serve get llama2-chat

    When the service is running normally, it displays:

    Name:       llama2-chat
    Namespace:  default
    Type:       Custom
    Version:    v1
    Desired:    1
    Available:  1
    Age:        2m
    Address:    192.168.10.15
    Port:       RESTFUL:8000
    GPU:        1

Step 4: Test the inference service

  1. Establish port forwarding to access the inference service locally:

    Port forwarding will continue running in the current terminal session. To stop it, press Ctrl+C.

    kubectl port-forward svc/llama2-chat-v1 8000:8000

    After successful establishment of forwarding, it displays:

    Forwarding from 127.0.0.1:8000 -> 8000
    Forwarding from [::1]:8000 -> 8000
  2. Run the following command to send a model inference request.

    curl -X POST localhost:8000/v1/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "/mnt/models/llama-2-7b-chat-hf",
        "prompt": "What is machine learning?",
        "max_tokens": 100,
        "temperature": 0.7
      }'

    Expected output:

    {
      "id": "cmpl-1234567890",
      "object": "text_completion",
      "created": 1700000000,
      "model": "/mnt/models/llama-2-7b-chat-hf",
      "choices": [
        {
          "index": 0,
          "text": " Machine learning is a type of artificial intelligence that enables computer systems to learn from data without being explicitly programmed. It involves algorithms that can identify patterns and make predictions or decisions based on input data.",
          "logprobs": null,
          "finish_reason": "length"
        }
      ],
      "usage": {
        "prompt_tokens": 5,
        "total_tokens": 105,
        "completion_tokens": 100
      }
    }

    The output indicates that the model can generate a response based on the given input.

(Optional) Step 5: Clean up the environment

If you no longer need the deployed model inference service, run the following command to delete it.

arena serve delete llama2-chat