All Products
Search
Document Center

Container Service for Kubernetes:Deploy inference services that share a GPU

Last Updated:Nov 01, 2024

In some scenarios, you may want multiple inference tasks to share the same GPU to improve GPU utilization. In this example, the Qwen1.5-0.5B-Chat model and the V100 GPU are used to describe how to use KServe to deploy inference services that share a GPU.

Prerequisites

Step 1: Prepare model data

You can use an OSS bucket or a File Storage NAS (NAS) file system to prepare model data. For more information, see Mount a statically provisioned OSS volume or Mount a statically provisioned NAS volume. In this example, an OSS bucket is used.

  1. Download the model. In this example, the Qwen1.5-0.5B-Chat model is used.

    1. Run the following command to install Git:

      sudo yum install git
    2. Run the following command to install the Git Large File Support (LFS) plug-in:

      sudo yum install git-lfs
    3. Run the following command to clone the Qwen1.5-0.5B-Chat repository on ModelScope to your on-premises machine:

      GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-0.5B-Chat.git
    4. Run the following command to go to the directory of the Qwen1.5-0.5B-Chat repository:

      cd Qwen1.5-0.5B-Chat
    5. Run the following command to download large files managed by LFS to the directory of Qwen1.5-0.5B-Chat:

      git lfs pull
  2. Upload the Qwen1.5-0.5B-Chat files to Object Storage Service (OSS).

    1. Log on to the OSS console, and view and record the name of the OSS bucket that you created.

      For more information about how to create an OSS bucket, see Create a bucket.

    2. Install and configure ossutil. For more information, see Install ossutil.

    3. Run the following command to create a directory named Qwen1.5-0.5B-Chat in OSS:

      ossutil mkdir oss://<Your-Bucket-Name>/Qwen1.5-0.5B-Chat
    4. Run the following command to upload the model files to OSS:

      ossutil cp -r ./Qwen1.5-0.5B-Chat oss://<Your-Bucket-Name>/Qwen1.5-0.5B-Chat
  3. Configure a persistent volume (PV) named llm-model and a persistent volume claim (PVC) named llm-model for the cluster where you want to deploy the inference services. For more information, see Mount a statically provisioned OSS volume.

    • The following table describes the parameters of the PV.

      Parameter

      Description

      PV Type

      OSS

      Volume Name

      llm-model

      Access Certificate

      Specify the AccessKey ID and the AccessKey secret used to access the OSS bucket.

      Bucket ID

      Select the OSS bucket that you created in the previous step.

      OSS Path

      Select the path of the model, such as /Qwen1.5-0.5B-Chat.

    • The following table describes the parameters of the PVC.

      Parameter

      Description

      PVC Type

      OSS

      Volume Name

      llm-model

      Allocation Mode

      Select Existing Volumes.

      Existing Volumes

      Click the Existing Volumes hyperlink and select the PV that you created.

Step 2: Deploy inference services

  1. Run the following command to query the GPU resources available in the cluster:

    arena top node

    Make sure that the cluster contains a GPU-accelerated node to run inference services.

  2. Run the following command to start two Qwen inference services. Each inference service requires 6 GB of GPU memory:

    1. Start the first inference service.

      arena serve kserve \
          --name=qwen1 \
          --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
          --gpumemory=6 \
          --cpu=3 \
          --memory=8Gi \
          --data="llm-model:/mnt/models/Qwen1.5-0.5B-Chat" \
          "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen1.5-0.5B-Chat --dtype=half --max-model-len=4096"

      Expected output:

      inferenceservice.serving.kserve.io/qwen1 created
      INFO[0003] The Job qwen1 has been submitted successfully 
      INFO[0003] You can run `arena serve get qwen1 --type kserve -n default` to check the job status 

      The output indicates that the inference service is deployed.

    2. Start the second inference service.

      arena serve kserve \
          --name=qwen2 \
          --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
          --gpumemory=6 \
          --cpu=3 \
          --memory=8Gi \
          --data="llm-model:/mnt/models/Qwen1.5-0.5B-Chat" \
          "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen1.5-0.5B-Chat --dtype=half --max-model-len=4096"

      Expected output:

      inferenceservice.serving.kserve.io/qwen2 created
      INFO[0001] The Job qwen2 has been submitted successfully 
      INFO[0001] You can run `arena serve get qwen2 --type kserve -n default` to check the job status 

      The output indicates that the inference service is deployed.

    Configure the parameters described in the following table.

    Parameter

    Required

    Description

    --name

    Yes

    The name of the inference service, which is globally unique.

    --image

    Yes

    The address of the inference service image.

    --gpumemory

    No

    The requested amount of GPU memory.

    To optimize resource allocation, make sure that the amount of GPU memory requested by all inference services does not exceed the GPU memory capacity. For example, if the memory capacity of the GPU is 8 GB and the first inference service requests 3 GB of GPU memory (--gpumemory=3), the remaining GPU memory is 5 GB. If the second inference service that shares the same GPU requests 4 GB of GPU memory (--gpumemory=4 ), the total amount of GPU memory requested is 7 GB, which is lower than the GPU memory capacity (8 GB). In this case, the two services can share the same GPU.

    --cpu

    No

    The number of vCPUs requested by the inference service.

    --memory

    No

    The amount of memory requested by the inference service.

    --data

    No

    The address of the inference service model. In this example, the PV of the model is llm-model, which is mounted to the /mnt/models/ directory of a container.

Step 3: Verify the inference services

  1. Run the following command to query the status of the inference services:

    kubectl get pod -owide |grep qwen

    Expected output:

    qwen1-predictor-856568bdcf-5pfdq   1/1     Running   0          7m10s   10.130.XX.XX   cn-beijing.172.16.XX.XX   <none>           <none>
    qwen2-predictor-6b477b587d-dpdnj   1/1     Running   0          4m3s    10.130.XX.XX   cn-beijing.172.16.XX.XX   <none>           <none>

    The expected output indicates that qwen1 and qwen2 are deployed on the same GPU-accelerated node (cn-beijing. 172.16.XX.XX).

  2. Run the following commands to log on to the pods where the inference services are deployed and view the amount of GPU memory allocated to the pods:

    kubectl exec -it qwen1-predictor-856568bdcf-5pfdq -- nvidia-smi # Log on to the pod where the first inference service is deployed. 
    kubectl exec -it qwen2-predictor-6b477b587d-dpdnj -- nvidia-smi # Log on to the pod where the second inference service is deployed.

    Expected output:

    • The GPU memory allocated to the first inference service

      Fri Jun 28 06:20:43 2024       
      +---------------------------------------------------------------------------------------+
      | NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
      |-----------------------------------------+----------------------+----------------------+
      | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
      |                                         |                      |               MIG M. |
      |=========================================+======================+======================|
      |   0  Tesla V100-SXM2-16GB           On  | 00000000:00:07.0 Off |                    0 |
      | N/A   39C    P0              53W / 300W |   5382MiB /  6144MiB |      0%      Default |
      |                                         |                      |                  N/A |
      +-----------------------------------------+----------------------+----------------------+
                                                                                               
      +---------------------------------------------------------------------------------------+
      | Processes:                                                                            |
      |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
      |        ID   ID                                                             Usage      |
      |=======================================================================================|
      +---------------------------------------------------------------------------------------+
    • The GPU memory allocated to the second inference service

      Fri Jun 28 06:40:17 2024       
      +---------------------------------------------------------------------------------------+
      | NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
      |-----------------------------------------+----------------------+----------------------+
      | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
      | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
      |                                         |                      |               MIG M. |
      |=========================================+======================+======================|
      |   0  Tesla V100-SXM2-16GB           On  | 00000000:00:07.0 Off |                    0 |
      | N/A   39C    P0              53W / 300W |   5382MiB /  6144MiB |      0%      Default |
      |                                         |                      |                  N/A |
      +-----------------------------------------+----------------------+----------------------+
                                                                                               
      +---------------------------------------------------------------------------------------+
      | Processes:                                                                            |
      |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
      |        ID   ID                                                             Usage      |
      |=======================================================================================|
      +---------------------------------------------------------------------------------------+

    The output shows that each pod can use at most 6 GB of GPU memory. The GPU memory capacity of the node is 16 GB. Therefore, the node has sufficient GPU memory for the pods where the two inference services are deployed.

  3. Run the following command to access one of the inference services by using the IP address of the NGINX Ingress:

    # Obtain the IP address of the NGINX Ingress. 
    NGINX_INGRESS_IP=$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}')
    # Obtain the hostname of the inference service. 
    SERVICE_HOSTNAME=$(kubectl get inferenceservice qwen1 -o jsonpath='{.status.url}' | cut -d "/" -f 3)
    # Send a request to the inference service. 
    curl -H "Host: $SERVICE_HOSTNAME" -H "Content-Type: application/json" http://$NGINX_INGRESS_IP:80/v1/chat/completions -d '{"model": "qwen", "messages": [{"role": "user", "content": "Test"}], "max_tokens": 10, ": 0.7, "top_p": 0.9, "seed": 10}' 

    Expected output:

    {"id":"cmpl-bbca59499ab244e1aabfe2c354bf6ad5","object":"chat.com pletion","created":1719303373,"model":"qwen","options":[{"index":0,"message":{"role":"assistant","content":"OK. What do you want to test?"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":31,"completion_tokens":10}}

    The output indicates that the model can generate a response based on the given prompt. In this example, the prompt is a test request.

(Optional) Step 4: Clear the environment

If you no longer need the resources, clear the environment promptly.

  • Run the following commands to delete the inference services:

    arena serve delete qwen1
    arena serve delete qwen2
  • Run the following commands to delete the PV and the PVC:

    kubectl delete pvc llm-model
    kubectl delete pv llm-model