All Products
Search
Document Center

Container Service for Kubernetes:Use vLLM to deploy a Qwen model as an inference service in ACK

Last Updated:Nov 01, 2024

This topic describes how to use Versatile Large Language Model (vLLM) to deploy a Qwen model as an inference service in Container Service for Kubernetes (ACK). The Qwen 1.5-4B-Chat model is used as an example and the GPUs NVIDIA T4 and NVIDIA A10 are used.

Background information

Qwen1.5-4B-Chat

Qwen1.5-4B-Chat is a 4-billion-parameter model large language model (LLM) developed by Alibaba Cloud based on Transformer. This model is trained based on ultra-large amounts of training data, which covers a variety of web-based text, books of specialized sectors, and code. For more information, see Qwen GitHub repository.

vLLM

vLLM is a high-performance and easy-to-use LLM inference service framework. vLLM supports most commonly used LLMs, including Qwen models. vLLM is powered by technologies such as PagedAttention optimization, continuous batching, and model quantification to greatly improve the inference efficiency of LLMs. For more information about the vLLM framework, see vLLM GitHub repository.

Prerequisites

  • An ACK Pro cluster that contains GPU-accelerated nodes is created. The Kubernetes version of the cluster is 1.22 or later. Each GPU-accelerated node provides 16 GB of GPU memory or above. For more information, see Create an ACK managed cluster.

    We recommend that you install a GPU driver whose version is 525. You can add the ack.aliyun.com/nvidia-driver-version:525.105.17 label to GPU-accelerated nodes to specify the GPU driver version as 525.105.17. For more information, see Specify an NVIDIA driver version for nodes by adding a label.

  • The latest version of the Arena client is installed. For more information, see Configure the Arena client.

Step 1: Prepare the model data

In this example, the Qwen1.5-4B-Chat model is used to describe how to download a Qwen model, upload the model to Object Storage Service (OSS), and create a persistent volume (PV) and persistent volume claim (PVC) in an ACK cluster.

For more information about how to upload a model to File Storage NAS (NAS), see Mount a statically provisioned NAS volume.

  1. Download the model file.

    1. Run the following command to install Git:

      # Run yum install git or apt install git. 
      yum install git
    2. Run the following command to install the Git Large File Support (LFS) plug-in:

      # Run yum install git-lfs or apt install git-lfs. 
      yum install git-lfs
    3. Run the following command to clone the Qwen1.5-4B-Chat repository on ModelScope to the local environment:

      GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git
    4. Run the following command to enter the Qwen1.5-4B-Chat directory and pull large files managed by LFS:

      cd Qwen1.5-4B-Chat
      git lfs pull
  2. Upload the Qwen1.5-4B-Chat model file to OSS.

    1. Log on to the OSS console, and view and record the name of the OSS bucket that you created.

      For more information about how to create an OSS bucket, see Create a bucket.

    2. Install and configure ossutil to manage OSS resources. For more information, see Install ossutil.

    3. Run the following command to create a directory named Qwen1.5-4B-Chat in OSS:

      ossutil mkdir oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
    4. Run the following command to upload the model file to OSS:

      ossutil cp -r ./Qwen1.5-4B-Chat oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
  3. Configure PVs and PVCs in the destination cluster. For more information, see Mount a statically provisioned OSS volume.

    • The following table describes the parameters of the PV.

      Parameter

      Description

      PV Type

      OSS

      Volume Name

      llm-model

      Access Certificate

      Specify the AccessKey ID and AccessKey secret used to access the OSS bucket.

      Bucket ID

      Specify the name of the OSS bucket that you created.

      OSS Path

      Select the path of the model, such as /models/Qwen1.5-4B-Chat.

    • The following table describes the parameters of the PVC.

      Parameter

      Description

      PVC Type

      OSS

      Volume Name

      llm-model

      Allocation Mode

      Select Existing Volumes.

      Existing Volumes

      Click the Existing Volumes hyperlink and select the PV that you created.

Step 2: Deploy an inference service

Large models have significant demands on memory resources. To ensure the optimal performance, we recommend that you use the high-performance NVIDIA A10 GPU in a production environment. For testing purposes, you can choose the widely used and cost-effective NVIDIA T4 GPU, but the performance of the NVIDIA T4 GPU may be much lower than the performance of the NVIDIA A10 GPU.

  1. Run the following command to deploy the Qwen1.5-4B-Chat model as an inference service by using vLLM:

    You can consider the model parameter files as a special type of dataset. You can use the --data parameter provided by Arena to mount the model to a specific path in the inference service container. In this example, the Qwen1.5-4B-Chat model is mounted to /model/Qwen1.5-4B-Chat. The --max_model_len parameter is used to set the maximum token length that the Qwen1.5-4B-Chat model can process. Increasing the value of this parameter can improve the quality of model interactions, but this change may consume more GPU memory resources. For more information about the parameters supported by vLLM, see vLLM code repository on Github.

    Single NVIDIA A10 environment

    arena serve custom \
        --name=vllm-qwen-4b-chat \
        --version=v1 \
        --gpus=1 \
        --replicas=1 \
        --restful-port=8000 \
        --readiness-probe-action="tcpSocket" \
        --readiness-probe-action-option="port: 8000" \
        --readiness-probe-option="initialDelaySeconds: 30" \
        --readiness-probe-option="periodSeconds: 30" \
        --image=ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/vllm:0.4.1-ubuntu22.04 \
        --data=llm-model:/model/Qwen1.5-4B-Chat \
        "python3 -m vllm.entrypoints.openai.api_server --trust-remote-code --model /model/Qwen1.5-4B-Chat/ --gpu-memory-utilization 0.95 --max-model-len 16384"

    Single NVIDIA T4 environment

    arena serve custom \
        --name=vllm-qwen-4b-chat \
        --version=v1 \
        --gpus=1 \
        --replicas=1 \
        --restful-port=8000 \
        --readiness-probe-action="tcpSocket" \
        --readiness-probe-action-option="port: 8000" \
        --readiness-probe-option="initialDelaySeconds: 30" \
        --readiness-probe-option="periodSeconds: 30" \
        --image=ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/vllm:0.4.1-ubuntu22.04 \
        --data=llm-model:/model/Qwen1.5-4B-Chat \
        "python3 -m vllm.entrypoints.openai.api_server --trust-remote-code --model /model/Qwen1.5-4B-Chat/ --gpu-memory-utilization 0.95 --max-model-len 8192 --dtype half"

    The following table describes the parameters.

    Parameter

    Description

    --name

    The name of the inference service.

    --version

    The version of the inference service.

    --gpus

    The number of GPUs for each inference service replica.

    --replicas

    The number of inference service replicas.

    --restful-port

    The port of the inference service to be exposed.

    --readiness-probe-action

    The connection type of readiness probes. Valid values: HttpGet, Exec, gRPC, and TCPSocket.

    --readiness-probe-action-option

    The connection method of readiness probes.

    --readiness-probe-option

    The readiness probe configuration.

    --data

    Mount a shared PVC to the runtime environment. The value consists of two parts separated by a colon (:). Specify the name of the PVC on the left side of the colon. You can run the arena data list command to query the list of existing PVCs in the cluster. Specify the runtime environment on the right side of the colon. You can also specify the local path of the training data or model. This way, your script can access the data or model in the specified PV.

    --image

    The address of the inference service image.

    Expected output:

    service/vllm-qwen-4b-chat-v1 created
    deployment.apps/vllm-qwen-4b-chat-v1-custom-serving created
    INFO[0008] The Job vllm-qwen-4b-chat has been submitted successfully
    INFO[0008] You can run `arena serve get vllm-qwen-4b-chat --type custom-serving -n default` to check the job status

    The output indicates that a vLLM-related inference service and the corresponding Deployment and Job are deployed.

  2. Run the following command to view the details of the inference service:

    arena serve get vllm-qwen-4b-chat

    Expected output:

    Name:       vllm-qwen-4b-chat
    Namespace:  default
    Type:       Custom
    Version:    v1
    Desired:    1
    Available:  1
    Age:        36m
    Address:    172.16.XX.XX
    Port:       RESTFUL:8000
    GPU:        1
    
    Instances:
      NAME                                                  STATUS   AGE  READY  RESTARTS  GPU  NODE
      ----                                                  ------   ---  -----  --------  ---  ----
      vllm-qwen-4b-chat-v1-custom-serving-6d7c786b9f-z6nfk  Running  36m  1/1    0         1    cn-beijing.192.168.XX.XX

    The output indicates that the inference service is running as expected and is ready to provide services.

Step 3: Verify the inference service

  1. Run the following command to create a port forwarding rule between the inference service and the local environment:

    Important

    Port forwarding set up by using kubectl port-forward is not reliable, secure, or extensible in production environments. It is only for development and debugging. Do not use this command to set up port forwarding in production environments. For more information about networking solutions used for production in ACK clusters, see Ingress overview.

    kubectl port-forward svc/vllm-qwen-4b-chat-v1 8000:8000

    Expected output:

    Forwarding from 127.0.0.1:8000 -> 8000
    Forwarding from [::1]:8000 -> 8000
  2. Run the following command to send an inference request to the Triton Inference Server:

    curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json"  -d '{"model": "/model/Qwen1.5-4B-Chat/", "messages": [{"role": "user", "content": "Test"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}'

    Expected output:

    {"id":"cmpl-503270b21fa44db2b6b3c3e0abaa3c02","object":"chat.com pletion","created":1717141209,"model":"/model/Qwen1.5-4B-Chat/","options":[{"index":0,"message":{"role":"assistant","content":"OK. What do you want to test?"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":30,"completion_tokens":9}}

    The output indicates that the model can generate an appropriate response based on the given request. This example is a test message.

(Optional) Step 4: Delete the environment

If you no longer need the resources, delete the resources at the earliest opportunity.

  • Run the following command to delete the inference service:

    arena serve delete vllm-qwen-4b-chat
  • Run the following command to delete the PV and PVC:

    kubectl delete pvc llm-model
    kubectl delete pv llm-model