All Products
Search
Document Center

Container Service for Kubernetes:Configure GPU sharing without GPU memory isolation

Last Updated:Oct 29, 2024

In some scenarios, you may need to use GPU sharing without GPU memory isolation. For example, some applications, such as Java applications, allow you to specify the maximum amount of GPU memory that the applications can use. If you use the GPU memory isolation module provided by GPU sharing, conflicts occur. To avoid this problem, you can choose not to install the GPU memory isolation module on nodes where GPU sharing is configured. This topic describes how to configure GPU sharing without GPU memory isolation.

Prerequisites

Step 1: Create a node pool

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, click the name of the cluster that you want to manage and choose Nodes > Node Pools in the left-side navigation pane.

  1. In the upper-right corner of the Node Pools page, click Create Node Pool.

  2. In the Create Node Pool dialog box, configure the node pool and click Confirm Order. The following table describes the key parameters. For more information about other parameters, see Create a node pool.

    Parameter

    Description

    Instance Type

    Set Architecture to GPU-accelerated and select multiple GPU-accelerated instance types. In this example, instance types that use the V100 GPU are selected.

    Expected Nodes

    Specify the initial number of nodes in the node pool. If you do not want to create nodes in the node pool, set this parameter to 0.

    Node Label

    Click 1.jpg set Key to ack.node.gpu.schedule, and then set Value to share. Enable GPU sharing and scheduling.

    For more information about node labels, see Labels for enabling GPU scheduling policies.

Step 2: Submit a job

  1. Log on to the ACK console. In the left-side navigation pane, click Clusters.

  2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Workloads > Jobs.

  3. Click Create from YAML in the upper-right part of the page, copy the following content to the Template section, and then click Create:

    Click to view YAML content

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: tensorflow-mnist-share
    spec:
      parallelism: 1
      template:
        metadata:
          labels:
            app: tensorflow-mnist-share
        spec:
          containers:
          - name: tensorflow-mnist-share
            image: registry.cn-beijing.aliyuncs.com/ai-samples/gpushare-sample:tensorflow-1.5
            command:
            - python
            - tensorflow-sample-code/tfjob/docker/mnist/main.py
            - --max_steps=100000
            - --data_dir=tensorflow-sample-code/data
            resources:
              limits:
                aliyun.com/gpu-mem: 4 # Request 4 GiB of memory. 
            workingDir: /root
          restartPolicy: Never

    YAML template description:

    • This YAML template defines a TensorFlow MNIST job. The job creates one pod and the pod requests 4 GiB of memory.

    • The aliyun.com/gpu-mem: 4 resource limit is added to request 4 GiB of memory.

Step 3: Verify the configuration

  1. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Workloads > Pods.

  2. Click Terminal in the Actions column of the pod that you created, such as tensorflow-mnist-multigpu-***, to log on to the pod and run the following command:

    nvidia-smi

    Expected output:

    Wed Jun 14 06:45:56 2023
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla V100-SXM2...  On   | 00000000:00:09.0 Off |                    0 |
    | N/A   35C    P0    59W / 300W |    334MiB / 16384MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    +-----------------------------------------------------------------------------+

    In this example, a V100 GPU is used. The output indicates that the pod can use all memory provided by the GPU, which is 16,384 MiB in size. This means that GPU sharing is implemented without GPU memory isolation. If GPU memory isolation is enabled, the memory size displayed in the output will equal the amount of memory requested by the pod, which is 4 GiB in this example.

    The pod determines the amount of GPU memory that it can use based on the following environment variables:

    ALIYUN_COM_GPU_MEM_CONTAINER=4 # The amount of GPU memory that the pod can use. 
    ALIYUN_COM_GPU_MEM_DEV=16 # The memory size of each GPU.

    To calculate the ratio of the GPU memory that the pod can use to the total GPU memory, use the following formula:

    percetange = ALIYUN_COM_GPU_MEM_CONTAINER / ALIYUN_COM_GPU_MEM_DEV = 4 / 16 = 0.25