In some scenarios, you may need to use GPU sharing without GPU memory isolation. For example, some applications, such as Java applications, allow you to specify the maximum amount of GPU memory that the applications can use. If you use the GPU memory isolation module provided by GPU sharing, conflicts occur. To avoid this problem, you can choose not to install the GPU memory isolation module on nodes where GPU sharing is configured. This topic describes how to configure GPU sharing without GPU memory isolation.
Prerequisites
Step 1: Create a node pool
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage and choose in the left-side navigation pane.
In the upper-right corner of the Node Pools page, click Create Node Pool.
In the Create Node Pool dialog box, configure the node pool and click Confirm Order. The following table describes the key parameters. For more information about other parameters, see Create a node pool.
Parameter
Description
Instance Type
Set Architecture to GPU-accelerated and select multiple GPU-accelerated instance types. In this example, instance types that use the V100 GPU are selected.
Expected Nodes
Specify the initial number of nodes in the node pool. If you do not want to create nodes in the node pool, set this parameter to 0.
Node Label
Click set Key to
ack.node.gpu.schedule
, and then set Value toshare
. Enable GPU sharing and scheduling.For more information about node labels, see Labels for enabling GPU scheduling policies.
Step 2: Submit a job
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose .
Click Create from YAML in the upper-right part of the page, copy the following content to the Template section, and then click Create:
YAML template description:
This YAML template defines a TensorFlow MNIST job. The job creates one pod and the pod requests 4 GiB of memory.
The
aliyun.com/gpu-mem: 4
resource limit is added to request 4 GiB of memory.
Step 3: Verify the configuration
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose .
Click Terminal in the Actions column of the pod that you created, such as tensorflow-mnist-multigpu-***, to log on to the pod and run the following command:
nvidia-smi
Expected output:
Wed Jun 14 06:45:56 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:00:09.0 Off | 0 | | N/A 35C P0 59W / 300W | 334MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+
In this example, a V100 GPU is used. The output indicates that the pod can use all memory provided by the GPU, which is 16,384 MiB in size. This means that GPU sharing is implemented without GPU memory isolation. If GPU memory isolation is enabled, the memory size displayed in the output will equal the amount of memory requested by the pod, which is 4 GiB in this example.
The pod determines the amount of GPU memory that it can use based on the following environment variables:
ALIYUN_COM_GPU_MEM_CONTAINER=4 # The amount of GPU memory that the pod can use. ALIYUN_COM_GPU_MEM_DEV=16 # The memory size of each GPU.
To calculate the ratio of the GPU memory that the pod can use to the total GPU memory, use the following formula:
percetange = ALIYUN_COM_GPU_MEM_CONTAINER / ALIYUN_COM_GPU_MEM_DEV = 4 / 16 = 0.25