All Products
Search
Document Center

Container Service for Kubernetes:Use TGI to deploy Qwen inference services in ACK

Last Updated:Nov 01, 2024

This topic uses the Qwen1.5-4B-Chat model and the A10 GPU as an example to demonstrate how to use the Text Generation Inference (TGI) framework of Hugging Face to deploy Qwen inference services in Container Service for Kubernetes (ACK).

Background information

Qwen1.5-4B-Chat

Qwen1.5-4B-Chat is a 4-billion-parameter model large language model (LLM) developed by Alibaba Cloud based on Transformer. This model is trained based on ultra-large amounts of training data, which covers a variety of web-based text, books of specialized sectors, and code. For more information, see Qwen GitHub repository.

Text Generation Inference (TGI)

TGI is an open source tool provided by Hugging Face for deploying large language models (LLMs) as inference services. It provides a variety of inference acceleration features, such as Flash Attention, Paged Attention, continuous batching, and Tensor parallelism. For more information, see TGI official documentation.

Prerequisites

Step 1: Prepare model data

This section uses the Qwen1.5-4B-Chat model as an example to demonstrate how to download models from and upload models to Object Storage Service (OSS) and how to create persistent volumes (PVs) and persistent volume claims (PVCs) in ACK clusters.

For more information about how to upload a model to File Storage NAS (NAS), see Mount a statically provisioned NAS volume.

    1. Download the model file.

      1. Run the following command to install Git:

        # Run yum install git or apt install git. 
        yum install git
      2. Run the following command to install the Git Large File Support (LFS) plug-in:

        # Run yum install git-lfs or apt install git-lfs. 
        yum install git-lfs
      3. Run the following command to clone the Qwen1.5-4B-Chat repository on ModelScope to the local environment:

        GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-4B-Chat.git
      4. Run the following command to enter the Qwen1.5-4B-Chat directory and pull large files managed by LFS:

        cd Qwen1.5-4B-Chat
        git lfs pull
    2. Upload the Qwen1.5-4B-Chat model file to OSS.

      1. Log on to the OSS console, and view and record the name of the OSS bucket that you created.

        For more information about how to create an OSS bucket, see Create a bucket.

      2. Install and configure ossutil to manage OSS resources. For more information, see Install ossutil.

      3. Run the following command to create a directory named Qwen1.5-4B-Chat in OSS:

        ossutil mkdir oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
      4. Run the following command to upload the model file to OSS:

        ossutil cp -r ./Qwen1.5-4B-Chat oss://<Your-Bucket-Name>/Qwen1.5-4B-Chat
    3. Configure PVs and PVCs in the destination cluster. For more information, see Mount a statically provisioned OSS volume.

      • The following table describes the parameters of the PV.

        Parameter

        Description

        PV Type

        OSS

        Volume Name

        llm-model

        Access Certificate

        Specify the AccessKey ID and AccessKey secret used to access the OSS bucket.

        Bucket ID

        Specify the name of the OSS bucket that you created.

        OSS Path

        Select the path of the model, such as /models/Qwen1.5-4B-Chat.

      • The following table describes the parameters of the PVC.

        Parameter

        Description

        PVC Type

        OSS

        Volume Name

        llm-model

        Allocation Mode

        Select Existing Volumes.

        Existing Volumes

        Click the Existing Volumes hyperlink and select the PV that you created.

Step 2: Deploy an inference service

Important

TGI does not support outdated GPU models, such as V100 or T4. You must deploy your inference service on A10 or a GPU with an updated architecture.

  1. Run the following command to use Arena to deploy a custom inference service:

    The name of the inference service is tgi-qwen-4b-chat and its version is v1. The service uses one GPU and has one replica. Readiness probes are configured for the service. Models are considered a special type of data. Therefore, set the --data parameter to mount the model PVC to /model/Qwen1.5-4B-Chat directory in containers.

    arena serve custom \
        --name=tgi-qwen-4b-chat \
        --version=v1 \
        --gpus=1 \
        --replicas=1 \
        --restful-port=8000 \
        --readiness-probe-action="tcpSocket" \
        --readiness-probe-action-option="port: 8000" \
        --readiness-probe-option="initialDelaySeconds: 30" \
        --readiness-probe-option="periodSeconds: 30" \
        --image=ac2-registry.cn-hangzhou.cr.aliyuncs.com/ac2/text-generation-inference:2.0.2-ubuntu22.04 \
        --data=llm-model:/model/Qwen1.5-4B-Chat \
        "text-generation-launcher --model-id /model/Qwen1.5-4B-Chat --num-shard 1 -p 8000"

    The following table describes the parameters.

    Parameter

    Description

    --name

    The name of the inference service.

    --version

    The version of the inference service.

    --gpus

    The number of GPUs for each inference service replica.

    --replicas

    The number of inference service replicas.

    --restful-port

    The port of the inference service to be exposed.

    --readiness-probe-action

    The connection type of readiness probes. Valid values: HttpGet, Exec, gRPC, and TCPSocket.

    --readiness-probe-action-option

    The connection method of readiness probes.

    --readiness-probe-option

    The readiness probe configuration.

    --data

    Mount a shared PVC to the runtime environment. The value consists of two parts separated by a colon (:). Specify the name of the PVC on the left side of the colon. You can run the arena data list command to query the list of existing PVCs in the cluster. Specify the runtime environment on the right side of the colon. You can also specify the local path of the training data or model. This way, your script can access the data or model in the specified PV.

    --image

    The address of the inference service image.

    Expected output:

    service/tgi-qwen-4b-chat-v1 created
    deployment.apps/tgi-qwen-4b-chat-v1-custom-serving created
    INFO[0001] The Job tgi-qwen-4b-chat has been submitted successfully
    INFO[0001] You can run `arena serve get tgi-qwen-4b-chat --type custom-serving -n default` to check the job status

    The output indicates that the inference service is deployed.

  2. Run the following command to query the details of the inference service:

    arena serve get tgi-qwen-4b-chat

    Expected output:

    Name:       tgi-qwen-4b-chat
    Namespace:  default
    Type:       Custom
    Version:    v1
    Desired:    1
    Available:  1
    Age:        3m
    Address:    172.16.XX.XX
    Port:       RESTFUL:8000
    GPU:        1
    
    Instances:
      NAME                                                 STATUS   AGE  READY  RESTARTS  GPU  NODE
      ----                                                 ------   ---  -----  --------  ---  ----
      tgi-qwen-4b-chat-v1-custom-serving-67b58c9865-m89lq  Running  3m   1/1    0         1    cn-beijing.192.168.XX.XX

    The output indicates that a pod (tgi-qwen-4b-chat-v1-custom-serving-67b58c9865-m89lq) is running for the inference service and is ready to provide services.

Step 3: Verify the inference service

  1. Run the following command to set up port forwarding between the inference service and local environment:

    Important

    Port forwarding set up by using kubectl port-forward is not reliable, secure, or extensible in production environments. It is only for development and debugging. Do not use this command to set up port forwarding in production environments. For more information about networking solutions used for production in ACK clusters, see Ingress overview.

    kubectl port-forward svc/tgi-qwen-4b-chat-v1 8000:8000

    Expected output:

    Forwarding from 127.0.0.1:8000 -> 8000
    Forwarding from [::1]:8000 -> 8000
  2. Run the following command to send a request to the inference service:

    curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json"  -d '{"model": "/model/Qwen1.5-4B-Chat/", "messages": [{"role": "user", "content": "Test"}], "max_tokens": 10, "temperature": 0.7, "top_p": 0.9, "seed": 10}'

    Expected output:

    {"id":"","object":"text_completion","created":1716274541,"model":"/model/Qwen1.5-4B-Chat","system_fingerprint":"2.0.2-sha-6073ece","choices":[{"index":0,"message":{"role":"assistant","content":"OK. What test do you want me to run?"},"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":21,"completion_tokens":10,"total_tokens":31}}

    The output indicates that the model can generate a response based on the given prompt. In this example, the prompt is a test request.

(Optional) Step 4: Clear the environment

If you no longer need the resources, clear the environment promptly.

  • Run the following command to delete the inference service:

    arena serve delete tgi-qwen-4b-chat
  • Run the following command to delete the PV and PVC:

    kubectl delete pvc llm-model
    kubectl delete pv llm-model