All Products
Search
Document Center

Container Service for Kubernetes:Disable the memory isolation feature of cGPU

Last Updated:Oct 20, 2023

This topic provides an example on how to disable the memory isolation feature of cGPU for a Container Service for Kubernetes (ACK) cluster.

Scenarios

This topic applies to ACK dedicated clusters and ACK Pro clusters that have the memory isolation feature of cGPU enabled.

Prerequisites

The ack-cgpu component is installed in your cluster. For more information, see Install ack-cgpu or Install and use ack-ai-installer and the GPU inspection tool.

Procedure

  1. Run the following command to query the status of GPU sharing in your cluster:

    kubectl inspect cgpu

    Expected output:

    NAME                      IPADDRESS      GPU0(Allocated/Total)  GPU Memory(GiB)
    cn-beijing.192.16x.x.xx3  192.16x.x.xx3  0/15                   0/15
    cn-beijing.192.16x.x.xx1  192.16x.x.xx1  0/15                   0/15
    cn-beijing.192.16x.x.xx2  192.16x.x.xx2  0/15                   0/15
    ---------------------------------------------------------------------
    Allocated/Total GPU Memory In Cluster:
    0/45 (0%)
    Note

    To query detailed information about GPU sharing, run the kubectl inspect cgpu -d command.

  2. Use the following YAML template to create a container for which GPU sharing is enabled and memory isolation is disabled:

    apiVersion: batch/v1
    kind: Job
    metadata:
      name: disable-cgpu
    spec:
      parallelism: 1
      template:
        metadata:
          labels:
            app: disable-cgpu
        spec:
          containers:
          - name: disable-cgpu
            image: registry.cn-hangzhou.aliyuncs.com/ai-samples/gpushare-sample:tensorflow-1.5
            env:
            - name: CGPU_DISABLE # Disable the memory isolation feature of cGPU. 
              value: "true"
            command:
            - python
            - tensorflow-sample-code/tfjob/docker/mnist/main.py
            - --max_steps=100000
            - --data_dir=tensorflow-sample-code/data
            resources:
              limits:
                # The pod requests 3 GiB of GPU memory in total. 
                aliyun.com/gpu-mem: 3
            workingDir: /root
          restartPolicy: Never
    Note
    • aliyun.com/gpu-mem: specifies the amount of GPU memory requested by the job.

    • To disable the memory isolation feature of cGPU, set CGPU_DISABLE to true.

  3. Run the following command to query the result of GPU scheduling performed by cGPU:

    kubectl inspect cgpu

    Expected output:

    NAME                      IPADDRESS      GPU0(Allocated/Total)  GPU Memory(GiB)
    cn-beijing.192.16x.x.xx1  192.16x.x.xx1  0/15                   0/15
    cn-beijing.192.16x.x.xx2  192.16x.x.xx2  0/15                   0/15
    cn-beijing.192.16x.x.xx3  192.16x.x.xx3  3/15                   3/15
    ---------------------------------------------------------------------
    Allocated/Total GPU Memory In Cluster:
    3/45 (6%)

    The newly created container is allocated with 3 GiB of GPU memory from the cn-beijing.192.16x.x.xx3 node.

Check the result

You can use one of the following methods to check whether the memory isolation feature of cGPU is disabled:

  • Method 1: Run the following command to query the application log:

    kubectl logs disable-cgpu-xxxx --tail=1

    Expected output:

    2020-08-25 08:14:54.927965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15024 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:07.0, compute capability: 7.0)

    The returned log entry shows that the GPU memory that the containerized application can use is 15,024 MiB. This indicates that the memory isolation feature of cGPU is disabled. If the memory isolation feature of cGPU is enabled, the amount of GPU memory that can be discovered by the containerized application is 3 GiB.

  • Method 2: Run the following command to log on to the container and view the amount of GPU memory that is allocated to the container:

    kubectl exec disable-cgpu-xxxx nvidia-smi

    Expected output:

    Tue Aug 25 08:23:33 2020
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla V100-SXM2...  Off  | 00000000:00:07.0 Off |                    0 |
    | N/A   33C    P0    55W / 300W |  15453MiB / 16130MiB |      1%      Default |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    +-----------------------------------------------------------------------------+

    The output shows that the GPU memory capacity of the host is 16,130 MiB and the amount of GPU memory that is allocated to the container is 15,453 MiB. This indicates that the memory isolation feature of cGPU is disabled. If the memory isolation feature of cGPU is enabled, the amount of GPU memory that is allocated to the container is 3 GiB.