You may require GPU sharing without GPU memory isolation in some scenarios. For example, some applications, such as Java applications, allow you to specify the maximum amount of GPU memory that the applications can use. In this scenario, if you use GPU memory isolation, exceptions may occur. To address this problem, you can disable GPU memory isolation for nodes that support GPU sharing. This topic describes how to configure GPU sharing without GPU memory isolation.
Prerequisites
A Container Service for Kubernetes (ACK) dedicated cluster that contains GPU-accelerated nodes is created. For more information, see Create an ACK dedicated cluster with GPU-accelerated nodes.
The ack-cgpu component is installed. For more information, see Install the ack-cgpu component.
Step 1: Create a node pool
Perform the following steps to create a node pool that has GPU memory isolation disabled.
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage and choose in the left-side navigation pane.
On the Node Pools page, click Create Node Pool. In the Create Node Pool dialog box, configure the parameters and click Confirm Order.
The following table describes the key parameters. For more information, see Create a node pool.
Expected Nodes: Specify the initial number of nodes in the node pool. If you do not want to add nodes to the node pool, set this parameter to 0.
Node Label: Add GPU sharing labels to nodes. For more information about the labels, see Labels for enabling GPU scheduling policies and methods for changing label values.
Click the icon, set Key to
gpushare
, and then set Value totrue
.
Step 2: Submit a job
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage and choose in the left-side navigation pane.
On the Jobs page, click Create from YAML. In the code editor on the Create page, paste the following content and Create.
apiVersion: batch/v1 kind: Job metadata: name: tensorflow-mnist-share spec: parallelism: 1 template: metadata: labels: app: tensorflow-mnist-share spec: containers: - name: tensorflow-mnist-share image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:tensorflow-1.5 command: - python - tensorflow-sample-code/tfjob/docker/mnist/main.py - --max_steps=100000 - --data_dir=tensorflow-sample-code/data resources: limits: aliyun.com/gpu-mem: 4 # Request 4 GiB of GPU memory. workingDir: /root restartPolicy: Never
Code description:
The YAML content defines a TensorFlow job. The job creates one pod and the pod requests 4 GiB of GPU memory.
You can set
aliyun.com/gpu-mem: 4
belowresources.limits
to request 4 GiB of GPU memory.
Step 3: Verify the configuration
Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the name of the cluster that you want to manage and choose in the left-side navigation pane.
On the Pods page, choose Step 2 to log on to the pod.
in the Actions column of the pod that you created inRun the following command to query GPU memory information:
nvidia-smi
Expected output:
Wed Jun 14 06:45:56 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:00:09.0 Off | 0 | | N/A 35C P0 59W / 300W | 334MiB / 16384MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+
The output indicates that the GPU allocated to the pod provides 16,384 MiB of memory. In this example, the GPU model is V100. If GPU memory isolation is enabled, the value equals the amount of memory requested by the pod, which is 4 GiB. This indicates that the configuration is in effect.
The application needs to read the GPU memory allocation information from the following environment variables.
ALIYUN_COM_GPU_MEM_CONTAINER=4 # The GPU memory available for the pod. ALIYUN_COM_GPU_MEM_DEV=16 # The total GPU memory provided by each GPU.
If the application requires the ratio of available GPU memory, you can use the following formula to calculate the ratio of GPU memory used by the application to total GPU memory provided by the GPU based on the preceding environment variables:
percetange = ALIYUN_COM_GPU_MEM_CONTAINER / ALIYUN_COM_GPU_MEM_DEV = 4 / 16 = 0.25