How to use the ack-cgpu component in an ACK dedicated cluster that contains GPU-accelerated nodes -

This topic describes how to use cGPU in a Container Service for Kubernetes (ACK) dedicated cluster that contains GPU-accelerated nodes after you install the ack-cgpu component.

Prerequisites

The ack-cgpu component is installed in your cluster. For more information, see Install ack-cgpu.

Usage notes

For GPU nodes that are managed in Container Service for Kubernetes (ACK) clusters, you need to pay attention to the following items when you request GPU resources for applications and use GPU resources.

Do not run GPU-heavy applications directly on nodes.
Do not use tools, such as Docker, Podman, or nerdctl, to create containers and request GPU resources for the containers. For example, do not run the docker run --gpus all or docker run -e NVIDIA_VISIBLE_DEVICES=all command and run GPU-heavy applications.
Do not add the NVIDIA_VISIBLE_DEVICES=all or NVIDIA_VISIBLE_DEVICES=<GPU ID> environment variable to the env section in the pod YAML file. Do not use the NVIDIA_VISIBLE_DEVICES environment variable to request GPU resources for pods and run GPU-heavy applications.
Do not set NVIDIA_VISIBLE_DEVICES=all and run GPU-heavy applications when you build container images if the NVIDIA_VISIBLE_DEVICES environment variable is not specified in the pod YAML file.
Do not add privileged: true to the securityContext section in the pod YAML file and run GPU-heavy applications.

The following potential risks may exist when you use the preceding methods to request GPU resources for your application:

If you use one of the preceding methods to request GPU resources on a node but do not specify the details in the device resource ledger of the scheduler, the actual GPU resource allocation information may be different from that in the device resource ledger of the scheduler. In this scenario, the scheduler can still schedule certain pods that request the GPU resources to the node. As a result, your applications may compete for resources provided by the same GPU, such as requesting resources from the same GPU, and some applications may fail to start up due to insufficient GPU resources.
Using the preceding methods may also cause other unknown issues, such as the issues reported by the NVIDIA community.

Procedure

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage and choose Nodes > Node Pools in the left-side navigation pane.
Log on to a master node and run the following command to query the GPU sharing status of the cluster:
Note
- For more information about how to log on to a master node, see Connect to an instance by using VNC or Connect to a Windows instance by using a password.
- If you want to query the GPU sharing status of a cluster from an on-premises machine, you must install ack-cgpu and a GPU inspection tool. For more information, see Step 4: Install and use the GPU inspection tool.
```
kubectl inspect cgpu
```
Expected output:
```
NAME                     IPADDRESS    GPU0(Allocated/Total)  GPU1(Allocated/Total)  GPU Memory(GiB)
cn-beijing.192.168.XX.XX   192.168.XX.XX  0/7                    0/7                    0/14
---------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
0/14 (0%)
```
Note
To query detailed information about GPU sharing, run the kubectl inspect cgpu -d command.

In the left-side navigation pane, choose Workloads > Jobs. In the upper-right corner of the Jobs page, click Create from YAML. On the Create page, select a namespace from the Namespace drop-down list and select an existing template or a custom template from the Sample Template drop-down list. Enter the following content into the code editor and click Create.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-share-sample
spec:
  containers:
  - name: gpu-share-sample
    image: registry.cn-hangzhou.aliyuncs.com/ai-samples/gpushare-sample:tensorflow-1.5    
    command:
    - python
    - tensorflow-sample-code/tfjob/docker/mnist/main.py
    - --max_steps=100000
    - --data_dir=tensorflow-sample-code/data
    resources:
      limits:
        # The pod requests 3 GiB of GPU memory in total. 
        aliyun.com/gpu-mem: 3 # Specify the requested amount of GPU memory. 
    workingDir: /root

Run the following command again on the master node to query the memory usage of the GPU:

kubectl inspect cgpu

Expected output:

NAME                      IPADDRESS      GPU0(Allocated/Total)  GPU Memory(GiB)
cn-beijing.192.168.XX.XX  192.168.XX.XX  3/14                   3/14
---------------------------------------------------------------------
Allocated/Total GPU Memory In Cluster:
3/14 (21%)

The output shows that the total GPU memory of the node cn-beijing.192.168.XX.XX is 14 GiB and 3 GiB of GPU memory has been allocated.

Verify the result

You can use the following method to check whether GPU memory isolation is enabled for the node.

Log on to the master node.

Run the following command to print the log of the deployed application to check whether GPU memory isolation is enabled:

kubectl logs gpu-share-sample --tail=1

Expected output:

2023-08-07 09:08:13.931003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2832 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:07.0, compute capability: 7.5)

The output indicates that 2,832 MiB of GPU memory is requested by the container.

Run the following command to log on to the container and view the amount of GPU memory that is allocated to the container:

kubectl exec -it gpu-share-sample nvidia-smi

Expected output:

Mon Aug 7 08:52:18 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:07.0 Off |                    0 |
| N/A   41C    P0    26W /  70W |   3043MiB /  3231MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

The output indicates that the amount of GPU memory allocated to the container is 3,231 MiB.

Run the following command to query the total GPU memory of the GPU-accelerated node where the application is deployed.

nvidia-smi

Expected output:

Mon Aug  7 09:18:26 2023 
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:07.0 Off |                    0 |
| N/A   40C    P0    26W /  70W |   3053MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      8796      C   python3                                     3043MiB |
+-----------------------------------------------------------------------------+

The output indicates that the total GPU memory of the node is 15,079 MiB and 3,053 MiB of GPU memory is allocated to the container.

:Use the ack-cgpu component

Prerequisites

Usage notes

Procedure

Verify the result

References