Configure GPU sharing for EAS services - Platform For AI

If you use dedicated resource groups to deploy services in Elastic Algorithm Service (EAS) of Platform for AI (PAI), you can enable GPU sharing to increase resource utilization. If you enable GPU sharing when you deploy a service, the system deploys virtualized GPU resources for the service. This allows EAS to allocate the resources required by each instance based on the computing power ratio and GPU memory that you specified. This topic describes how to configure GPU sharing.

Prerequisites

A dedicated resource group is created and resources are purchased. For more information, see Work with dedicated resource groups.

Limits

The GPU sharing feature is available only for users in the whitelist. If you want to use the GPU sharing feature, submit a ticket.
The GPU sharing feature is available only for services deployed by using EAS dedicated resource groups and does not support GPUs of the GU type. Make sure that you purchase EAS dedicated resources that use GPUs of non-GU types.

Configure GPU sharing when you create a service

Use the console

Log on to the PAI console. Select a region on the top of the page. Then, select the desired workspace and click Enter Elastic Algorithm Service (EAS).
Click Deploy Service. In the Custom Model Deployment section, click Custom Deployment.

In the Resource Deployment section, configure the following key parameters. For more information about other parameters, see Deploy a model service in the PAI console.

Parameter	Description
Resource Type	Select EAS Resource Group.
GPU Sharing	Select GPU Sharing.
Deployment	Configure the following parameters: Single-GPU Memory (GB): (Required) the single-GPU memory required by each instance. The value is an integer. Unit: GB. PAI allows memory resources of one GPU to be allocated to multiple instances. The GPU memory of multiple instances is not strictly isolated. To prevent out-of-memory (OOM) errors, make sure that the GPU memory used by each instance does not exceed the requested amount. Computing Power per GPU (%): (Optional) the computing power of a single GPU required by each instance. The value must be an integer from 1 to 100. For example, if you enter 10, the system allocates 10% computing power of a single GPU to an instance. This facilitates flexible scheduling of computing power and allows multiple instances to share a single GPU. Single-GPU Memory and Computing Power per GPU are in an AND relationship. For example, if you set Single-GPU Memory to 48 GB and Computing Power per GPU to 10%, a maximum of 48 GB of memory can be used, and simultaneously, only up to 10% of the computing power can be utilized.

After you configure the parameters, click Deploy.

Use an on-premises client

Download the EASCMD client and complete identity authentication. In this example, Windows 64 is used.

Create a service configuration file named service.json in the directory in which the client is located. Sample content of the configuration file:

{
    "containers": [
        {
            "image": "eas-registry-vpc.cn-beijing.cr.aliyuncs.com/pai-eas/chat-llm-webui:3.0.4",
            "port": 8000,
            "script": "python webui/webui_server.py --port=8000 --model-path=Qwen/Qwen1.5-7B-Chat"
        }
    ],
    "metadata": {
        "cpu": 8,
        "enable_webservice": true,
        "gpu_core_percentage": 5,
        "gpu_memory": 20,
        "instance": 1,
        "memory": 20000,
        "name": "testchatglm",
        "resource": "eas-r-fky7kxiq4l2zzt****",
        "resource_burstable": false
    },
    "name": "test"
}

Take note of the following parameters. For information about other parameters, see All Parameters of model services.

Parameter	Description
gpu_memory	The amount of GPU memory required by each instance. The value must be an integer. Unit: GB. PAI allows memory resources of one GPU to be allocated to multiple instances. If you want to schedule GPU memory, set the gpu field to 0. If you set the gpu field to 1, the instance occupies the entire GPU. In this case, the gpu_memory field is ignored. Important The GPU memory of multiple instances is not strictly isolated. To prevent out-of-memory (OOM) errors, make sure that the GPU memory used by each instance does not exceed the requested amount.
gpu_core_percentage	The ratio of the computing power required per GPU by each instance. The value is an integer between 1 and 100. Unit: percentage. For example, if you set the parameter to 10, the system uses 10% computing power of each GPU. This facilitates flexible scheduling of computing power and allows multiple instances to share a single GPU. If you configure this parameter, you must also configure the gpu_memory parameter. Otherwise, this parameter does not take effect.
resource	The ID of the existing dedicated resource group. For more information about how to view the ID of a dedicated resource group, see Manage dedicated resource groups.

Run the following command in the directory in which the JSON file is located to create the service: For more information, see Run commands to use the EASCMD client.
```
eascmdwin64.exe create <service.json>
```
Replace <service.json> with the name of the JSON file that you created.

Configure GPU sharing when you update a service

If you did not enable the GPU sharing feature when you deploy a service by using dedicated resource groups, you can enable GPU sharing by updating the service configuration.

Update the service in the console

On the Elastic Algorithm Service (EAS) page, find the service that you want to update and click Update Service in the Actions column.
In the Resource Deployment section of the Update Service page, configure the Resource Type, GPU Sharing, and Deployment parameters. For more information, see the "Use the console" section of this topic.
After you configure the parameters, click Deploy.

Update the service by using the on-premises client

Download the EASCMD client and complete identity authentication. In this example, Windows 64 is used.
Create a file named instances.json in the directory in which the client is located. Sample content of the file:
```
"metadata": {
        "gpu_memory": 2,
        "gpu_core_percentage": 5
    }
```
For more information about the parameters in the preceding code, see the "Use an on-premises client" section of this topic.
Open the terminal tool. In the directory in which the JSON file is located, run the following command to enable GPU sharing for the EAS service:
```
eascmdwin64.exe modify <service_name> -s <instances.json>
```
Replace <service_name> with the name of the EAS service and <instances.json> with the name of the JSON file that you create.