This topic describes how to use cGPU in a Container Service for Kubernetes (ACK) dedicated cluster that contains GPU-accelerated nodes after you install the ack-cgpu component.
Prerequisites
The ack-cgpu component is installed in your cluster. For more information, see Install ack-cgpu.
Usage notes
For GPU nodes that are managed in Container Service for Kubernetes (ACK) clusters, you need to pay attention to the following items when you request GPU resources for applications and use GPU resources.
Do not run GPU-heavy applications directly on nodes.
Do not use tools, such as
Docker
,Podman
, ornerdctl
, to create containers and request GPU resources for the containers. For example, do not run thedocker run --gpus all
ordocker run -e NVIDIA_VISIBLE_DEVICES=all
command and run GPU-heavy applications.Do not add the
NVIDIA_VISIBLE_DEVICES=all
orNVIDIA_VISIBLE_DEVICES=<GPU ID>
environment variable to theenv
section in the pod YAML file. Do not use theNVIDIA_VISIBLE_DEVICES
environment variable to request GPU resources for pods and run GPU-heavy applications.Do not set
NVIDIA_VISIBLE_DEVICES=all
and run GPU-heavy applications when you build container images if theNVIDIA_VISIBLE_DEVICES
environment variable is not specified in the pod YAML file.Do not add
privileged: true
to thesecurityContext
section in the pod YAML file and run GPU-heavy applications.
The following potential risks may exist when you use the preceding methods to request GPU resources for your application:
If you use one of the preceding methods to request GPU resources on a node but do not specify the details in the device resource ledger of the scheduler, the actual GPU resource allocation information may be different from that in the device resource ledger of the scheduler. In this scenario, the scheduler can still schedule certain pods that request the GPU resources to the node. As a result, your applications may compete for resources provided by the same GPU, such as requesting resources from the same GPU, and some applications may fail to start up due to insufficient GPU resources.
Using the preceding methods may also cause other unknown issues, such as the issues reported by the NVIDIA community.
Procedure
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage and choose in the left-side navigation pane.
Log on to a master node and run the following command to query the GPU sharing status of the cluster:
NoteFor more information about how to log on to a master node, see Connect to an instance by using VNC or Connect to a Windows instance by using a password.
If you want to query the GPU sharing status of a cluster from an on-premises machine, you must install ack-cgpu and a GPU inspection tool. For more information, see Step 4: Install and use the GPU inspection tool.
kubectl inspect cgpu
Expected output:
NAME IPADDRESS GPU0(Allocated/Total) GPU1(Allocated/Total) GPU Memory(GiB) cn-beijing.192.168.XX.XX 192.168.XX.XX 0/7 0/7 0/14 --------------------------------------------------------------------- Allocated/Total GPU Memory In Cluster: 0/14 (0%)
NoteTo query detailed information about GPU sharing, run the
kubectl inspect cgpu -d
command.In the left-side navigation pane, choose . In the upper-right corner of the Jobs page, click Create from YAML. On the Create page, select a namespace from the Namespace drop-down list and select an existing template or a custom template from the Sample Template drop-down list. Enter the following content into the code editor and click Create.
apiVersion: v1 kind: Pod metadata: name: gpu-share-sample spec: containers: - name: gpu-share-sample image: registry.cn-hangzhou.aliyuncs.com/ai-samples/gpushare-sample:tensorflow-1.5 command: - python - tensorflow-sample-code/tfjob/docker/mnist/main.py - --max_steps=100000 - --data_dir=tensorflow-sample-code/data resources: limits: # The pod requests 3 GiB of GPU memory in total. aliyun.com/gpu-mem: 3 # Specify the requested amount of GPU memory. workingDir: /root
Run the following command again on the master node to query the memory usage of the GPU:
kubectl inspect cgpu
Expected output:
NAME IPADDRESS GPU0(Allocated/Total) GPU Memory(GiB) cn-beijing.192.168.XX.XX 192.168.XX.XX 3/14 3/14 --------------------------------------------------------------------- Allocated/Total GPU Memory In Cluster: 3/14 (21%)
The output shows that the total GPU memory of the node
cn-beijing.192.168.XX.XX
is 14 GiB and 3 GiB of GPU memory has been allocated.
Verify the result
You can use the following method to check whether GPU memory isolation is enabled for the node.
Log on to the master node.
Run the following command to print the log of the deployed application to check whether GPU memory isolation is enabled:
kubectl logs gpu-share-sample --tail=1
Expected output:
2023-08-07 09:08:13.931003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2832 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:07.0, compute capability: 7.5)
The output indicates that 2,832 MiB of GPU memory is requested by the container.
Run the following command to log on to the container and view the amount of GPU memory that is allocated to the container:
kubectl exec -it gpu-share-sample nvidia-smi
Expected output:
Mon Aug 7 08:52:18 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:00:07.0 Off | 0 | | N/A 41C P0 26W / 70W | 3043MiB / 3231MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| +-----------------------------------------------------------------------------+
The output indicates that the amount of GPU memory allocated to the container is 3,231 MiB.
Run the following command to query the total GPU memory of the GPU-accelerated node where the application is deployed.
nvidia-smi
Expected output:
Mon Aug 7 09:18:26 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:00:07.0 Off | 0 | | N/A 40C P0 26W / 70W | 3053MiB / 15079MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 8796 C python3 3043MiB | +-----------------------------------------------------------------------------+
The output indicates that the total GPU memory of the node is 15,079 MiB and 3,053 MiB of GPU memory is allocated to the container.