Deploy inference services that share a GPU - Container Service for Kubernetes

In some scenarios, you may want multiple inference tasks to share the same GPU to improve GPU utilization. In this example, the Qwen1.5-0.5B-Chat model and the V100 GPU are used to describe how to use KServe to deploy inference services that share a GPU.

Prerequisites

A Container Service for Kubernetes (ACK) managed cluster or an ACK dedicated cluster with GPU-accelerated nodes is created. The cluster runs Kubernetes 1.22 or later and uses Compute Unified Device Architecture (CUDA) 12.0 or later. For more information, see Create an ACK cluster with GPU-accelerated nodes or Create an ACK dedicated cluster with GPU-accelerated nodes.
By default, GPU-accelerated nodes use CUDA 11. You can add the ack.aliyun.com/nvidia-driver-version:525.105.17 tag to the GPU-accelerated node pool to specify CUDA 12 for the GPU-accelerated nodes. For more information, see Specify an NVIDIA driver version for nodes by adding a label.
The GPU sharing component is installed and GPU sharing is enabled. For more information, see Configure the GPU sharing component.
The Arena client of version 0.9.15 or later is installed. For more information, see Configure the Arena client.
The cert-manager and ack-kserve components are installed. The ack-kserve component is deployed in Raw Deployment mode. For more information, see Install ack-kserve.

Step 1: Prepare model data

You can use an OSS bucket or a File Storage NAS (NAS) file system to prepare model data. For more information, see Mount a statically provisioned OSS volume or Mount a statically provisioned NAS volume. In this example, an OSS bucket is used.

Download the model. In this example, the Qwen1.5-0.5B-Chat model is used.
1. Run the following command to install Git:
```
sudo yum install git
```
2. Run the following command to install the Git Large File Support (LFS) plug-in:
```
sudo yum install git-lfs
```
3. Run the following command to clone the Qwen1.5-0.5B-Chat repository on ModelScope to your on-premises machine:
```
GIT_LFS_SKIP_SMUDGE=1 git clone https://www.modelscope.cn/qwen/Qwen1.5-0.5B-Chat.git
```
4. Run the following command to go to the directory of the Qwen1.5-0.5B-Chat repository:
```
cd Qwen1.5-0.5B-Chat
```
5. Run the following command to download large files managed by LFS to the directory of Qwen1.5-0.5B-Chat:
```
git lfs pull
```
Upload the Qwen1.5-0.5B-Chat files to Object Storage Service (OSS).
1. Log on to the OSS console, and view and record the name of the OSS bucket that you created.
  For more information about how to create an OSS bucket, see Create a bucket.
2. Install and configure ossutil. For more information, see Install ossutil.
3. Run the following command to create a directory named Qwen1.5-0.5B-Chat in OSS:
```
ossutil mkdir oss://<Your-Bucket-Name>/Qwen1.5-0.5B-Chat
```
4. Run the following command to upload the model files to OSS:
```
ossutil cp -r ./Qwen1.5-0.5B-Chat oss://<Your-Bucket-Name>/Qwen1.5-0.5B-Chat
```

Configure a persistent volume (PV) named llm-model and a persistent volume claim (PVC) named llm-model for the cluster where you want to deploy the inference services. For more information, see Mount a statically provisioned OSS volume.

The following table describes the parameters of the PV.

Parameter	Description
PV Type	OSS
Volume Name	llm-model
Access Certificate	Specify the AccessKey ID and the AccessKey secret used to access the OSS bucket.
Bucket ID	Select the OSS bucket that you created in the previous step.
OSS Path	Select the path of the model, such as /Qwen1.5-0.5B-Chat.

The following table describes the parameters of the PVC.
Parameter
Description
PVC Type
OSS
Volume Name
llm-model
Allocation Mode
Select Existing Volumes.
Existing Volumes
Click the Existing Volumes hyperlink and select the PV that you created.

Step 2: Deploy inference services

Run the following command to query the GPU resources available in the cluster:
```
arena top node
```
Make sure that the cluster contains a GPU-accelerated node to run inference services.

Run the following command to start two Qwen inference services. Each inference service requires 6 GB of GPU memory:

Start the first inference service.

arena serve kserve \
    --name=qwen1 \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
    --gpumemory=6 \
    --cpu=3 \
    --memory=8Gi \
    --data="llm-model:/mnt/models/Qwen1.5-0.5B-Chat" \
    "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen1.5-0.5B-Chat --dtype=half --max-model-len=4096"

Expected output:

inferenceservice.serving.kserve.io/qwen1 created
INFO[0003] The Job qwen1 has been submitted successfully 
INFO[0003] You can run `arena serve get qwen1 --type kserve -n default` to check the job status

The output indicates that the inference service is deployed.

Start the second inference service.

arena serve kserve \
    --name=qwen2 \
    --image=kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1 \
    --gpumemory=6 \
    --cpu=3 \
    --memory=8Gi \
    --data="llm-model:/mnt/models/Qwen1.5-0.5B-Chat" \
    "python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen1.5-0.5B-Chat --dtype=half --max-model-len=4096"

Expected output:

inferenceservice.serving.kserve.io/qwen2 created
INFO[0001] The Job qwen2 has been submitted successfully 
INFO[0001] You can run `arena serve get qwen2 --type kserve -n default` to check the job status

The output indicates that the inference service is deployed.

Configure the parameters described in the following table.

Parameter	Required	Description
--name	Yes	The name of the inference service, which is globally unique.
--image	Yes	The address of the inference service image.
--gpumemory	No	The requested amount of GPU memory. To optimize resource allocation, make sure that the amount of GPU memory requested by all inference services does not exceed the GPU memory capacity. For example, if the memory capacity of the GPU is 8 GB and the first inference service requests 3 GB of GPU memory (`--gpumemory=3`), the remaining GPU memory is 5 GB. If the second inference service that shares the same GPU requests 4 GB of GPU memory (`--gpumemory=4` ), the total amount of GPU memory requested is 7 GB, which is lower than the GPU memory capacity (8 GB). In this case, the two services can share the same GPU.
--cpu	No	The number of vCPUs requested by the inference service.
--memory	No	The amount of memory requested by the inference service.
--data	No	The address of the inference service model. In this example, the PV of the model is `llm-model`, which is mounted to the `/mnt/models/` directory of a container.

Step 3: Verify the inference services

Run the following command to query the status of the inference services:

kubectl get pod -owide |grep qwen

Expected output:

qwen1-predictor-856568bdcf-5pfdq   1/1     Running   0          7m10s   10.130.XX.XX   cn-beijing.172.16.XX.XX   <none>           <none>
qwen2-predictor-6b477b587d-dpdnj   1/1     Running   0          4m3s    10.130.XX.XX   cn-beijing.172.16.XX.XX   <none>           <none>

The expected output indicates that qwen1 and qwen2 are deployed on the same GPU-accelerated node (cn-beijing. 172.16.XX.XX).

Run the following commands to log on to the pods where the inference services are deployed and view the amount of GPU memory allocated to the pods:

kubectl exec -it qwen1-predictor-856568bdcf-5pfdq -- nvidia-smi # Log on to the pod where the first inference service is deployed. 
kubectl exec -it qwen2-predictor-6b477b587d-dpdnj -- nvidia-smi # Log on to the pod where the second inference service is deployed.

Expected output:

The GPU memory allocated to the first inference service

Fri Jun 28 06:20:43 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-16GB           On  | 00000000:00:07.0 Off |                    0 |
| N/A   39C    P0              53W / 300W |   5382MiB /  6144MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

The GPU memory allocated to the second inference service

Fri Jun 28 06:40:17 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-16GB           On  | 00000000:00:07.0 Off |                    0 |
| N/A   39C    P0              53W / 300W |   5382MiB /  6144MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

The output shows that each pod can use at most 6 GB of GPU memory. The GPU memory capacity of the node is 16 GB. Therefore, the node has sufficient GPU memory for the pods where the two inference services are deployed.

Run the following command to access one of the inference services by using the IP address of the NGINX Ingress:

# Obtain the IP address of the NGINX Ingress. 
NGINX_INGRESS_IP=$(kubectl -n kube-system get svc nginx-ingress-lb -ojsonpath='{.status.loadBalancer.ingress[0].ip}')
# Obtain the hostname of the inference service. 
SERVICE_HOSTNAME=$(kubectl get inferenceservice qwen1 -o jsonpath='{.status.url}' | cut -d "/" -f 3)
# Send a request to the inference service. 
curl -H "Host: $SERVICE_HOSTNAME" -H "Content-Type: application/json" http://$NGINX_INGRESS_IP:80/v1/chat/completions -d '{"model": "qwen", "messages": [{"role": "user", "content": "Test"}], "max_tokens": 10, ": 0.7, "top_p": 0.9, "seed": 10}'

Expected output:

{"id":"cmpl-bbca59499ab244e1aabfe2c354bf6ad5","object":"chat.com pletion","created":1719303373,"model":"qwen","options":[{"index":0,"message":{"role":"assistant","content":"OK. What do you want to test?"},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":21,"total_tokens":31,"completion_tokens":10}}

The output indicates that the model can generate a response based on the given prompt. In this example, the prompt is a test request.

(Optional) Step 4: Clear the environment

If you no longer need the resources, clear the environment promptly.

Run the following commands to delete the inference services:
```
arena serve delete qwen1
arena serve delete qwen2
```

Run the following commands to delete the PV and the PVC:

kubectl delete pvc llm-model
kubectl delete pv llm-model

Parameter	Description
PVC Type	OSS
Volume Name	llm-model
Allocation Mode	Select Existing Volumes.
Existing Volumes	Click the Existing Volumes hyperlink and select the PV that you created.