If you use dedicated resource groups to deploy services in Elastic Algorithm Service (EAS) of Platform for AI (PAI), you can enable GPU sharing to increase resource utilization. If you enable GPU sharing when you deploy a service, the system deploys virtualized GPU resources for the service. This allows EAS to allocate the resources required by each instance based on the computing power ratio and GPU memory that you specified. This topic describes how to configure GPU sharing.
Prerequisites
A dedicated resource group is created and resources are purchased. For more information, see Work with dedicated resource groups.
Limits
The GPU sharing feature is available only for users in the whitelist. If you want to use the GPU sharing feature, submit a ticket.
The GPU sharing feature is available only for services deployed by using dedicated resource groups.
Configure GPU sharing when you create a service
Use the console
Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).
Click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.
In the Resource Deployment section, configure the following key parameters. For more information about other parameters, see Deploy a model service in the PAI console.
Parameter
Description
Resource Type
Select EAS Resource Group.
GPU Sharing
Select GPU Sharing.
Deployment
Configure the following parameters:
Single-GPU Memory (GB): the GPU memory required by each instance. The value is an integer. Unit: GB. PAI allows memory resources of one GPU to be allocated to multiple instances.
ImportantThe GPU memory of multiple instances is not strictly isolated. To prevent out-of-memory (OOM) errors, make sure that the GPU memory used by each instance does not exceed the requested amount.
Computing Power per GPU (%): the computing power of a single GPU required by each instance. The value must be an integer from 1 to 100. For example, if you enter 10, the system allocates 10% computing power of a single GPU to an instance. This facilitates flexible scheduling of computing power and allows multiple instances to share a single GPU.
After you configure the parameters, click Deploy.
Use an on-premises client
Download the EASCMD client and complete identity authentication. In this example, Windows 64 is used.
Create a service configuration file named
service.json
in the directory in which the client is located. Sample content of the configuration file:{ "containers": [ { "image": "eas-registry-vpc.cn-beijing.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4", "port": 8000, "script": "python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen1.5-7B-Chat" } ], "metadata": { "cpu": 8, "enable_webservice": true, "gpu_core_percentage": 5, "gpu_memory": 20, "instance": 1, "memory": 20000, "name": "testchatglm", "resource": "eas-r-fky7kxiq4l2zzt****", "resource_burstable": false }, "name": "test" }
Take note of the following parameters. For information about other parameters, see All Parameters of model services.
Parameter
Description
gpu_memory
The amount of GPU memory required by each instance. The value must be an integer. Unit: GB.
PAI allows memory resources of one GPU to be allocated to multiple instances. If you want to schedule GPU memory, set the gpu field to 0. If you set the gpu field to 1, the instance occupies the entire GPU. In this case, the gpu_memory field is ignored.
ImportantThe GPU memory of multiple instances is not strictly isolated. To prevent out-of-memory (OOM) errors, make sure that the GPU memory used by each instance does not exceed the requested amount.
gpu_core_percentage
The ratio of the computing power required per GPU by each instance. The value is an integer between 1 and 100. Unit: percentage. For example, if you set the parameter to 10, the system uses 10% computing power of each GPU.
This facilitates flexible scheduling of computing power and allows multiple instances to share a single GPU. If you configure this parameter, you must also configure the gpu_memory parameter. Otherwise, this parameter does not take effect.
resource
The ID of the existing dedicated resource group. For more information about how to view the ID of a dedicated resource group, see Manage dedicated resource groups.
Run the following command in the directory in which the JSON file is located to create the service: For more information, see Run commands to use the EASCMD client.
eascmdwin64.exe create <service.json>
Replace
<service.json>
with the name of the JSON file that you created.
Configure GPU sharing when you update a service
If you did not enable the GPU sharing feature when you deploy a service by using dedicated resource groups, you can enable GPU sharing by updating the service configuration.
Update the service in the console
On the Elastic Algorithm Service (EAS) page, find the service that you want to update and click Update Service in the Actions column.
In the Resource Deployment section of the Update Service page, configure the Resource Type, GPU Sharing, and Deployment parameters. For more information, see the "Use the console" section of this topic.
After you configure the parameters, click Deploy.
Update the service by using the on-premises client
Download the EASCMD client and complete identity authentication. In this example, Windows 64 is used.
Create a file named
instances.json
in the directory in which the client is located. Sample content of the file:"metadata": { "gpu_memory": 2, "gpu_core_percentage": 5 }
For more information about the parameters in the preceding code, see the "Use an on-premises client" section of this topic.
Open the terminal tool. In the directory in which the JSON file is located, run the following command to enable GPU sharing for the EAS service:
eascmdwin64.exe modify <service_name> -s <instances.json>
Replace
<service_name>
with the name of the EAS service and<instances.json>
with the name of the JSON file that you create.