ACK Pro clusters support GPU sharing. You can use GPU sharing to share GPU resources and isolate GPU memory. This topic describes how to configure a multiple GPU sharing policy.
Prerequisites
Introduction to multiple GPU sharing
You can use multiple GPU sharing only in scenarios where computing power is shared by containers and GPU memory is isolated. You cannot use multiple GPU sharing if computing power is not shared.
Developers may need to use more than one GPU when developing models but the development platform cannot use all GPU resources. If all GPUs are allocated to the same development platform, resource waste may occur. To avoid this problem, you can use multiple GPU sharing.
Multiple GPU sharing works in the following way: an application requests N GiB of GPU memory in total and requires M GPUs to allocate the requested amount of memory. The memory that is allocated by each GPU is N/M. The value of N/M must be an integer and the used GPUs must be installed on the same node. For example, an application requests 8 GiB of memory and requires 2 GPUs to allocate the requested memory. In this case, a node needs to allocate 2 GPUs to the application and each GPU needs to allocate 4 GiB of memory. Difference between single GPU sharing and multiple GPU sharing:
Single GPU sharing: A pod can request GPU resources that are allocated by only one GPU.
Multiple GPU sharing: A pod can request GPU resources that are evenly allocated by multiple GPUs.
Configure a multiple GPU sharing policy
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage and choose in the left-side navigation pane.
Click Create from YAML in the upper-right part of the page. Copy the following content to the Template section and click Create:
YAML template description:
The YAML template defines a TensorFlow MNIST job. The job requests 8 GiB of memory allocated by 2 GPUs. Each GPU allocates 4 GiB of memory.
Add the
aliyun.com/gpu-count=2
pod label to request two GPUs.Add the
aliyun.com/gpu-mem: 8
resource limit to request 8 GiB of memory.
Verify the multiple GPU sharing policy
On the Clusters page, click the name of the cluster that you want to manage and choose in the left-side navigation pane.
Click Terminal in the Actions column of the pod that you created, such as tensorflow-mnist-multigpu-***, to log on to the pod and run the following command:
nvidia-smi
Expected output:
Wed Jun 14 03:24:14 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:00:09.0 Off | 0 | | N/A 38C P0 61W / 300W | 569MiB / 4309MiB | 2% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:00:0A.0 Off | 0 | | N/A 36C P0 61W / 300W | 381MiB / 4309MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+
The output indicates that the pod can use only two GPUs. Each GPU can provide 4,309 MiB of memory, which is requested by the pod. The actual memory size of each GPU is 16,160 MiB.
Click Logs in the Actions column of the pod to view the logs of the pod. The following information is displayed:
totalMemory: 4.21GiB freeMemory: 3.91GiB totalMemory: 4.21GiB freeMemory: 3.91GiB
The device information indicates that each GPU allocates 4 GiB of memory. The actual memory size of each GPU is 16,160 MiB. This means that memory isolation is implemented.