Use Multi-Process Service (MPS) to implement GPU sharing and GPU memory isolation - Container Service for Kubernetes

GPU Sharing uses NVIDIA Multi-Process Service (MPS) as the GPU isolation module at the bottom layer to enable multiple pods to share one GPU while isolating GPU memory among the pods. This topic describes how to enable NVIDIA MPS and use NVIDIA MPS together with the GPU Sharing component to implement GPU sharing and GPU memory isolation.

Background Information

MPS supports multi-core parallelization, which balances resource allocation among CPU-intensive tasks. This ensures that multiple computing tasks are processed in parallel to accelerate the computing process of your workloads. When you use Compute Unified Architecture (CUDA) kernels to accelerate MPI processes, each MPI process may be assigned a small number of workloads. In this case, the execution of each MPS process is accelerated but GPUs are underutilized. If an application runs a small number of tasks on a GPU, a proportion of the GPU resources may become idle. To resolve this issue, we recommend that you enable NVIDIA MPS to run multiple CUDA applications on an NVIDIA GPU. This feature is suitable for multi-tenant environments or scenarios where you need to run multiple tasks that require small amounts of resources. This helps improve GPU utilization and application throughput.

MPS enables the system to run multiple applications on a GPU in parallel to improve GPU utilization. MPS uses the client-server architecture to achieve this capability. In addition, MPS is binary-compatible, which eliminates the need for major code modifications on your CUDA applications when you use MPS. MPS consists of the following components:

Control Daemon Process: This component is responsible for starting and stopping the MPS server and coordinating connections between clients and the MPS server. This way, clients can connect to the MPS server and request GPU resources as normal.
Client Runtime: This component is built into the CUDA driver library. You can directly use MPS without the need to make major code modifications on your CUDA applications when you use MPS. When multiple applications use the CUDA driver to perform operations on a GPU, Client Runtime automatically interacts with the MPS server to ensure that the applications can share the GPU in an efficient and secure manner.
Server Process: This component receives requests from different clients and uses scheduling policies to efficiently distribute requests to one GPU and allows the GPU to concurrently process the requests of different clients.

Usage notes

In the NVIDIA MPS architecture, MPS clients need to interact with MPS Control Daemon. MPS clients are the applications that require GPU resources and have MPS enabled. If MPS Control Daemon is restarted, errors may occur on the MPS clients and the clients may exit.
In this example, MPS Control Daemon runs in containers and is deployed as a DaemonSet to run on each GPU-accelerated node. Each node runs an MPS Control Daemon pod. The following list describes the usage notes for the MPS Control Daemon pod:
- Do not delete or restart the MPS Control Daemon pod. If you delete the MPS Control Daemon pod, applications that require GPU resources may become unavailable. You can run the kubectl get po -l app.aliyun.com/name=mps-control-daemon -A command to check the status of the MPS Control Daemon pod.
- When you run MPS Control Daemon in containers, the containers must have the privileged, hostIPC, and hostPID permissions. Security risks may arise from the permissions. Exercise caution before you enable MPS.
- The MPS Control Daemon pod uses the priorityClassName: system-node-critical configuration to apply for a high priority. This prevents the system from terminating the MPS Control Daemon pod when resources on a node become insufficient. If the MPS Control Daemon pod is terminated, your applications that use MPS may not work as normal. If resources on a node are insufficient when you deploy MPS Control Daemon, MPS Control Daemon may preempt the resources used by pods that are assigned lower priorities. In this case, the pods are evicted from the node. Before you deploy MPS Control Daemon on a node, we recommend that you ensure sufficient CPU and memory resources on the node.
For GPU nodes that are managed in Container Service for Kubernetes (ACK) clusters, you need to pay attention to the following items when you request GPU resources for applications and use GPU resources.
- Do not run GPU-heavy applications directly on nodes.
- Do not use tools, such as Docker, Podman, or nerdctl, to create containers and request GPU resources for the containers. For example, do not run the docker run --gpus all or docker run -e NVIDIA_VISIBLE_DEVICES=all command and run GPU-heavy applications.
- Do not add the NVIDIA_VISIBLE_DEVICES=all or NVIDIA_VISIBLE_DEVICES=<GPU ID> environment variable to the env section in the pod YAML file. Do not use the NVIDIA_VISIBLE_DEVICES environment variable to request GPU resources for pods and run GPU-heavy applications.
- Do not set NVIDIA_VISIBLE_DEVICES=all and run GPU-heavy applications when you build container images if the NVIDIA_VISIBLE_DEVICES environment variable is not specified in the pod YAML file.
- Do not add privileged: true to the securityContext section in the pod YAML file and run GPU-heavy applications.

The following potential risks may exist when you use the preceding methods to request GPU resources for your application:
- If you use one of the preceding methods to request GPU resources on a node but do not specify the details in the device resource ledger of the scheduler, the actual GPU resource allocation information may be different from that in the device resource ledger of the scheduler. In this scenario, the scheduler can still schedule certain pods that request the GPU resources to the node. As a result, your applications may compete for resources provided by the same GPU, such as requesting resources from the same GPU, and some applications may fail to start up due to insufficient GPU resources.
- Using the preceding methods may also cause other unknown issues, such as the issues reported by the NVIDIA community.

Prerequisites

An ACK Pro cluster that runs Kubernetes 1.20 or later is created. For more information, see Create an ACK managed cluster and Update an ACK cluster.

Procedure

Step 1: Install MPS Control Daemon

Log on to the ACK console. In the left-side navigation pane, choose Marketplace > Marketplace.
Go to the Marketplace page, enter ack-mps-control in the search box, and then click the search icon. Then, click the component that is displayed.
On the ack-mps-control page, click Deploy. In the Deploy panel, select the cluster where you want to deploy the component and click Next.
In the Create panel, select the chart version that you want to install and click OK.
Important
If you uninstall or update ack-mps-control on a node, errors may occur in running applications that require GPU resources on the node and the applications may exit. We recommend that you perform these operations during off-peak hours.

Step 2: Install ack-ai-installer

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose Applications > Cloud-native AI Suite.
On the Cloud-native AI Suite page, click Deploy.
On the Cloud-native AI Suite page, select Scheduling Policy Extension (Batch Task Scheduling, GPU Sharing, Topology-aware GPU Scheduling).
In the lower part of the Cloud-native AI Suite page, click Deploy Cloud-native AI Suite.
After you install the cloud-native AI suite, the GPU Sharing component ack-ai-installer is displayed in the Components section on the Cloud-native AI Suite page.

Step 3: Enable GPU sharing and GPU memory isolation

On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose Nodes > Node Pools.
On the Node Pools page, click Create Node Pool.

In the Create Node Pool dialog box, configure the parameters and click Confirm Order.

The following table describes the key parameters. For more information about other parameters, see Create a node pool.

Parameter	Description
Expected Nodes	Specify the initial number of nodes in the node pool. If you do not want to create nodes in the node pool, set this parameter to 0. Note After the node pool is created, you can add GPU-accelerated nodes to the node pool. To add GPU-accelerated nodes, you need to select ECS instances that use the GPU-accelerated architecture. For more information, see Add existing ECS instances to an ACK cluster or Create and manage a node pool.
Node Labels	Click the icon next to Node Labels. Set the key to ack.node.gpu.schedule and the value to mps. Important The system deploys the MPS Control Daemon pod on a GPU-accelerated node only if the node has the `ack.node.gpu.schedule=mps` label. After you deploy ack-ai-installer in your cluster, if you add the `ack.node.gpu.schedule=mps` label to a node, GPU sharing and MPS-based GPU memory isolation are enabled for the node. After you add the label for enabling GPU sharing to a node, do not run the `kubectl label nodes` command to change the label value or use the label management feature to change the node label on the Nodes page in the ACK console. This prevents potential issues. For more information about these potential issues, see Issues that may occur if you use the kubectl label nodes command or use the label management feature to change label values in the ACK console. We recommend that you configure GPU sharing based on node pools. For more information, see Configure GPU scheduling policies for node pools.

Step 4: Install a GPU inspection tool

Download kubectl-inspect-cgpu. The executable file must be downloaded to a directory included in the PATH environment variable. This section uses /usr/local/bin/ as an example.
- If you use Linux, run the following command to download kubectl-inspect-cgpu:
```
wget http://aliacs-k8s-cn-beijing.oss-cn-beijing.aliyuncs.com/gpushare/kubectl-inspect-cgpu-linux -O /usr/local/bin/kubectl-inspect-cgpu
```
- If you use macOS, run the following command to download kubectl-inspect-cgpu:
```
wget http://aliacs-k8s-cn-beijing.oss-cn-beijing.aliyuncs.com/gpushare/kubectl-inspect-cgpu-darwin -O /usr/local/bin/kubectl-inspect-cgpu
```
Run the following command to grant the execute permissions to kubectl-inspect-cgpu:
```
chmod +x /usr/local/bin/kubectl-inspect-cgpu
```

Run the following command to query the GPU usage of the cluster:

kubectl inspect cgpu

Expected output:

NAME                       IPADDRESS      GPU0(Allocated/Total)  GPU Memory(GiB)
cn-shanghai.192.168.6.104  192.168.6.104  0/15                   0/15
----------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
0/15 (0%)

Step 5: Deploy an application

Use the following YAML template to create an application:

apiVersion: batch/v1
kind: Job
metadata:
  name: mps-sample
spec:
  parallelism: 1
  template:
    metadata:
      labels:
        app: mps-sample
    spec:
      hostIPC: true  # This parameter is required. If you do not specify the parameter, the pod fails to be started. 
      hostPID: true  # This parameter is optional. In this example, this parameter allows you to view the effect of MPS in a convenient manner. 
      containers:
      - name: mps-sample
        image: registry.cn-hangzhou.aliyuncs.com/ai-samples/gpushare-sample:tensorflow-1.5
        command:
        - python
        - tensorflow-sample-code/tfjob/docker/mnist/main.py
        - --max_steps=100000
        - --data_dir=tensorflow-sample-code/data
        resources:
          limits: 
            aliyun.com/gpu-mem: 7  # This pod applies to 7 GiB of GPU memory. 
        workingDir: /root
      restartPolicy: Never

Note

After you enable MPS for a node, you must add the hostIPC: true setting to the configurations of pods that require GPU resources on the node. Otherwise, the pods fail to be started.

After the pod is created and enters the Running state, run the following command to check whether MPS is enabled.

kubectl exec -ti mps-sample-xxxxx --  nvidia-smi

Expected output:

Tue Nov 12 11:09:35 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla xxxxxxxxxxxxxx           On  | 00000000:00:07.0 Off |                    0 |
| N/A   37C    P0              56W / 300W |    345MiB / 32768MiB |      0%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    197792      C   nvidia-cuda-mps-server                       30MiB |
|    0   N/A  N/A    387820    M+C   python                                      312MiB |
+---------------------------------------------------------------------------------------+

The output of the nvidia-smi command shows that mps-server is started and the process ID (PID) of mps-server on the node is 197792. In addition, a Python program (PID: 387820) is started, which indicates that MPS is enabled.