Use node pools to control cGPU - - Alibaba Cloud Documentation Center

Container Service for Kubernetes (ACK) allows you to use node pools to control cGPU. This way, you can create more flexible GPU sharing and memory isolation policies. In this topic, two labeled node pools are created in an ACK Pro cluster to demonstrate how to use node pools to control the GPU sharing and memory isolation capabilities of cGPU.

Use scenarios

Only ACK dedicated clusters that contain GPU-accelerated nodes support the ack-cgpu component. ACK managed clusters that contain GPU-accelerated nodes do not support the ack-cgpu component.
If you want to install ack-cgpu in ACK Pro clusters, see Install and use ack-ai-installer and the GPU inspection tool.

Prerequisites

Before you start, make sure that the following operations are completed:

Install ack-cgpu.
Node pools are configured.
You can customize the names of the node pools. In this example, two node pools named cgpu and cgpu-no-isolation are used.
Node pool name
GPU sharing
Memory isolation
Label
cgpu
Enabled
Enabled
cgpu=true
cgpu.disable.isolation=false
cgpu-no-isolation
Enabled
Disabled
cgpu=true
cgpu.disable.isolation=true

Background information

When you use cGPU in an ACK cluster, you may come across the following scenarios:

The amount of GPU memory that can be allocated to Job A is already specified in the script. In this case, the ACK cluster needs only to enable GPU sharing for Job A. Memory isolation is not required.
The amount of GPU memory that can be allocated to Job B is not specified in the script. In this case, the ACK cluster must enable both GPU sharing and memory isolation for Job B.

How do I configure an ACK cluster to support both scenarios?

To resolve this problem, you can use node pools to control cGPU. You need only to create two node pools:

Create a node pool that supports only GPU sharing. This node pool is used to run Job A.
Create another node pool that supports both GPU sharing and memory isolation. This node pool is used to run Job B.

Usage notes

When you use node pools to control cGPU, take note of the following limits:

When you use node pools to control cGPU, if a job is not configured with a node selector, the pods of the job may be scheduled to other node pools. This may cause job execution errors.
Important
We recommend that you configure a node selector for each job.
When the label of a node is changed, for example, the node label is changed from cgpu.disable.isolation=false to cgpu.disable.isolation=true, you must restart the pod of gpushare-device-plugin on the node for the configuration to take effect.
To do this, you must delete the pod of gpushare-device-plugin on the node. Then, ACK automatically creates a new pod. You can perform the following operations:
1. Run the following command to query the pods of gpushare-device-plugin in the ACK cluster:
```
kubectl get po -n kube-system  -l name=gpushare-device-plugin-ds -o wide
```
  Expected output:
```
NAME                              READY   STATUS    RESTARTS   AGE   IP              NODE                        NOMINATED NODE   READINESS GATES
gpushare-device-plugin-ds-6r8gs   1/1     Running   0          18h   192.168.7.157   cn-shanghai.192.168.7.157   <none>           <none>
gpushare-device-plugin-ds-pjrvn   1/1     Running   0          15h   192.168.7.158   cn-shanghai.192.168.7.158   <none>           <none>
```
2. In this example, node cn-shanghai.192.168.7.157 is used. Run the following command to delete the pod of gpushare-device-plugin on this node:
```
kubectl delete po gpushare-device-plugin-ds-6r8gs -n kube-system
```

Step 1: Create node pools

Log on to the ACK console.
In the left-side navigation pane of the ACK console, click Clusters.
On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.
In the left-side navigation pane of the details page, choose Nodes > Node Pools.
On the right side of the Node Pools page, click Create Node Pool.
In the Create Node Pool dialog box, set the parameters.
For more information, see Create an ACK managed cluster. The following list describes some of the parameters:
- Quantity: Specify the initial number of nodes in the node pool. If you do not want to add nodes to the node pool, set this parameter to 0.
- Operating System: Select the operating system of the nodes. CentOS 7.x and Alibaba Cloud Linux 2.x are supported.
- Node Label: You can add labels to the nodes.
- ECS Label: You can add labels to the Elastic Compute Service (ECS) instances.
- Custom Resource Group: You can specify the resource group to which the nodes in the node pool belong.
In the Node Label section, you can add specified labels to the nodes in the node pool.
- Add the following labels to the nodes in the cgpu node pool: cgpu=true and cgpu.disable.isolation=false.
- Add the following labels to the nodes in the cgpu-no-isolation node pool: cgpu=true and cgpu.disable.isolation=true.
The following figure shows the labels that are added to the nodes in the cgpu-no-isolation node pool.
Click Confirm Order.
On the Node Pools page, check the Status column of the node pool. If the node pool is in the Initializing state, it indicates that the node pool is being created. After the node pool is created, the state of the node pool changes to Active.

Note

If you want to add GPU-accelerated nodes to the node pool, you can scale out the node pool. For more information, see Manage node pools.

Step 2: Submit jobs

Submit two jobs named cgpu-test and cgpu-test-no-isolation. You must set nodeSelector in the YAML files of both jobs.

cgpu-test: The amount of GPU memory that can be allocated to this job is not specified in the script of the job. Therefore, memory isolation is required for running this job. The following YAML template is an example:

apiVersion: batch/v1
kind: Job
metadata:
  name: cgpu-test
spec:
  parallelism: 1
  template:
    metadata:
      labels:
        app: cgpu-test
    spec:
      nodeSelector:
        cgpu.disable.isolation: "false" # Add a node selector to select the cgpu node pool. 
      containers:
      - name: cgpu-test
        image: registry.cn-hangzhou.aliyuncs.com/ai-samples/gpushare-sample:tensorflow-1.5
        command:
        - python
        - tensorflow-sample-code/tfjob/docker/mnist/main.py
        - --max_steps=100000
        - --data_dir=tensorflow-sample-code/data
        resources:
          limits:
            # The pod requests 3 GiB of GPU memory in total. 
            aliyun.com/gpu-mem: 3
        workingDir: /root
      restartPolicy: Never

Note

nodeSelector: selects the cgpu node pool.
cgpu.disable.isolation=false: schedules the job to nodes in the cgpu node pool.
aliyun.com/gpu-mem: specifies the amount of GPU memory requested by the job.

cgpu-test-no-isolation: The amount of memory that can be allocated to the job per GPU is specified in the script of the job. Therefore, memory isolation is not required for running this job. The following YAML template is an example:

apiVersion: batch/v1
kind: Job
metadata:
  name: cgpu-test-no-isolation
spec:
  parallelism: 1
  template:
    metadata:
      labels:
        app: cgpu-test-no-isolation
    spec:
      nodeSelector:
        cgpu.disable.isolation: "true" # Add a node selector to select the cgpu node pool. 
      containers:
      - name: cgpu-test-no-isolation
        image: registry.cn-hangzhou.aliyuncs.com/ai-samples/gpushare-sample:tensorflow-1.5
        command:
        - python
        - tensorflow-sample-code/tfjob/docker/mnist/main.py
        - --max_steps=100000
        - --data_dir=tensorflow-sample-code/data
        resources:
          limits:
            # The pod requests 3 GiB of GPU memory in total. 
            aliyun.com/gpu-mem: 3

Note

nodeSelector: specifies the label that is used to select the cgpu-no-isolation node pool.
cgpu.disable.isolation=true: the label that is used to schedule the job to the nodes in the cgpu-no-isolation node pool.
aliyun.com/gpu-mem: specifies the amount of GPU memory requested by the job.

Step 3: Check the result

Run the following command to query the status of the job:

kubectl get po

Expected output:

NAME                       READY   STATUS    RESTARTS   AGE
cgpu-test-0                1/1     Running   0          5m55s
cgpu-test-no-isolation-0   1/1     Running   0          6m42s

Run the nvidia-smi command in pod cgpu-test-0 (requires memory isolation) to query the amount of GPU memory that can be used by the pod:

kubectl exec cgpu-test-0 nvidia-smi

Expected output:

Mon Nov  2 11:33:10 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:07.0 Off |                    0 |
| N/A   34C    P0    54W / 300W |   3039MiB /  3226MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

The output shows that 3,226 MiB of GPU memory can be used by the containers. The total GPU memory is 16 GiB. This indicates that GPU memory isolation is enabled.

Run the nvidia-smi command in pod cgpu-test-no-isolation-0 (does not require GPU memory isolation) to query the amount of GPU memory that can be used by the containers in the pod:

kubectl exec cgpu-test-no-isolation-0 nvidia-smi

Expected output:

Mon Nov  2 11:39:59 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:07.0 Off |                    0 |
| N/A   37C    P0    56W / 300W |   1929MiB / 16130MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

The output shows that 16,130 MiB of GPU memory can be discovered by the containers. The total GPU memory is 16 GiB. This indicates that GPU memory isolation is disabled. In this case, you must query the following environment variables to check the amount of GPU memory that can be used by the containers in the pod. Run the following command to check the amount of GPU memory that can be used by the containers in the pod:

kubectl exec cgpu-test-no-isolation-0 env | grep ALIYUN

Expected output:

ALIYUN_COM_GPU_MEM_CONTAINER=3    # The amount of GPU memory that can be used by the containers in the pod. The amount is 3 GiB in this example. 
ALIYUN_COM_GPU_MEM_DEV=15      # The total amount of memory provided by the GPU. 
...

After you run the nvidia-smi command, compare the results returned from pod cgpu-test-no-isolation-0 and pod cgpu-test-0.
The result of pod cgpu-test-no-isolation-0 shows the total amount of GPU memory and the result of pod cgpu-test-0 shows only the amount of GPU memory requested by the pod. This indicates that you can use node pools to control cGPU for GPU sharing and memory isolation.

Node pool name	GPU sharing	Memory isolation	Label
cgpu	Enabled	Enabled	cgpu=true cgpu.disable.isolation=false
cgpu-no-isolation	Enabled	Disabled	cgpu=true cgpu.disable.isolation=true