All Products
Search
Document Center

Elastic GPU Service:Use cGPU by running the Docker command line

Last Updated:Oct 17, 2024

You can use cGPU to isolate GPU resources. This allows multiple containers to share a single GPU. cGPU provides external services as a component of Container Service for Kubernetes (ACK) and is applicable to scenarios that require high performance computing (HPC) capabilities, such as machine learning, deep learning, and scientific computing scenarios. You can use cGPU to efficiently utilize GPU resources and accelerate computing tasks. This topic describes how to install and use cGPU.

Note

If you use cGPU to isolate GPU resources, you cannot request GPU memory by using Unified Virtual Memory (UVM). Therefore, you cannot request GPU memory by calling cudaMallocManaged() of the Compute Unified Device Architecture (CUDA) API. You can request GPU memory by using other methods. For example, you can call cudaMalloc(). For more information, see Unified Memory for CUDA Beginners.

Prerequisites

Before you perform operations, make sure that your GPU-accelerated instance meets the following requirements:

  • The instance belongs to one of the following instance families: gn7i, gn6i, gn6v, gn6e, gn5i, gn5, ebmgn7i, ebmgn6i, ebmgn7e, and ebmgn6e.

  • The OS that is run by the instance is CentOS, Ubuntu, or Alibaba Cloud Linux (Alinux).

  • The Tesla driver of version 418.87.01 or later is installed on the instance.

  • Docker of version 19.03.5 or later is installed on the instance.

Install cGPU

We recommend that you install and use cGPU by using the Docker runtime environment of ACK, regardless of whether you are an enterprise user or an individual user.

Important

If you install cGPU of version 1.5.7, the cGPU kernel driver may be locked because parallel processes interfere with each other. As a result, a Linux Kernel Panic issue is caused. To prevent kernel issues from your new business, we recommend that you install cGPU of version 1.5.8 or later, or update cGPU to version 1.5.8 or later.

  1. Create a cluster.

    For more information, see Create an ACK managed cluster.

  2. On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Applications > Cloud-native AI Suite.

  3. On the Cloud-native AI Suite page, click Deploy.

  4. In the Basic Capabilities section, select Scheduling Policy Extension (Batch Task Scheduling, GPU Sharing, Topology-aware GPU Scheduling).

  5. At the bottom of the page, click Deploy Cloud-native AI Suite.

    After cGPU is installed, you can view that the ack-ai-installer component is in the Deployed state on the Cloud-native AI Suite page.

Use cGPU

This section provides an example on how to use cGPU to allow two containers to share one GPU. In this example, an ecs.gn6i-c4g1.xlarge instance is used.

Environment variables that can affect cGPU

When you create a container, you can specify values for environment variables to control the computing power that the container uses cGPU to obtain.

Environment variable name

Value type

Description

Example

CGPU_DISABLE

Boolean

Specifies whether to disable cGPU. Valid values:

  • false: enables cGPU.

  • true: disables cGPU and uses the default NVIDIA container service.

Default value: false.

true.

ALIYUN_COM_GPU_MEM_DEV

Integer

The total memory of each GPU on a GPU-accelerated instance.

The value of this variable varies based on the instance type, and must be an integer. Unit: GiB.

If you use an ecs.gn6i-c4g1.xlarge instance that is configured with an NVIDIA Tesla T4 GPU, you can run the nvidia-smi command on the instance to obtain the total memory of the GPU.

15 GiB. The sample command output shows that the total memory of the GPU is 15,109 MiB, which is rounded to 15 GiB.

ALIYUN_COM_GPU_MEM_CONTAINER

Integer

The GPU memory that is allocated to the container. This variable is used together with ALIYUN_COM_GPU_MEM_DEV.

For a GPU whose total memory is 15 GiB, if you set ALIYUN_COM_GPU_MEM_DEV to 15 and ALIYUN_COM_GPU_MEM_CONTAINER to 1, the container is allocated 1 GiB of GPU memory.

Note

If this variable is left empty or set to 0, the default NVIDIA container service is used instead of cGPU.

1 GiB.

ALIYUN_COM_GPU_VISIBLE_DEVICES

Integer or UUID

The GPUs that are allocated to the container.

If you run the nvidia-smi -L command on a GPU-accelerated instance that is configured with four GPUs to obtain the device numbers and UUIDs of the GPUs, the following command output is returned:

GPU 0: Tesla T4 (UUID: GPU-b084ae33-e244-0959-cd97-83****)
GPU 1: Tesla T4 (UUID: GPU-3eb465ad-407c-4a23-0c5f-bb****)
GPU 2: Tesla T4 (UUID: GPU-2fce61ea-2424-27ec-a2f1-8b****)
GPU 3: Tesla T4 (UUID: GPU-22401369-db12-c6ce-fc48-d7****)

Then, you can set ALIYUN_COM_GPU_VISIBLE_DEVICES to one of the following values:

  • If you set ALIYUN_COM_GPU_VISIBLE_DEVICES to 0,1, the first and second GPUs are allocated to the container.

  • If you set ALIYUN_COM_GPU_VISIBLE_DEVICES to GPU-b084ae33-e244-0959-cd97-83****,GPU-3eb465ad-407c-4a23-0c5f-bb****,GPU-2fce61ea-2424-27ec-a2f1-8b****, three GPUs of the specified UUIDs are allocated to the container.

0,1.

ALIYUN_COM_GPU_SCHD_WEIGHT

Integer

The weight based on which the container obtains computing power. Valid values: 1 to max_inst.

None.

ALIYUN_COM_GPU_HIGH_PRIO

Integer

Specifies whether to configure a high priority for the container. Valid values:

  • 0: configures a regular priority for the container.

  • 1: configures a high priority for the container.

Default value: 0.

Note

We recommend that you configure at least one high-priority container for each GPU. GPU computing power is allocated to multiple high-priority containers based on the scheduling policy that is specified by the policy parameter.

  • If a GPU task runs in a high-priority container, the container is not restricted by the scheduling policy and can preempt GPU computing power.

  • If no GPU task runs in a high-priority container, the container is not involved in the scheduling process and cannot be allocated GPU computing power.

0.

Run cGPU

  1. Run the following commands to create containers and specify the GPU memory that is allocated to the containers.

    In this example, ALIYUN_COM_GPU_MEM_CONTAINER that specifies the GPU memory allocated to the container and ALIYUN_COM_GPU_MEM_DEV that specifies the total GPU memory are configured. The following containers are created:

    • gpu_test1: This container is allocated 6 GiB of GPU memory.

      sudo docker run -d -t --gpus all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --name gpu_test1 -v /mnt:/mnt -e ALIYUN_COM_GPU_MEM_CONTAINER=6 -e ALIYUN_COM_GPU_MEM_DEV=15 nvcr.io/nvidia/tensorflow:19.10-py3
    • gpu_test2: This container is allocated 8 GiB of GPU memory.

      sudo docker run -d -t --gpus all --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --name gpu_test2 -v /mnt:/mnt -e ALIYUN_COM_GPU_MEM_CONTAINER=8 -e ALIYUN_COM_GPU_MEM_DEV=15 nvcr.io/nvidia/tensorflow:19.10-py3
    Note

    In the preceding commands, the TensorFlow image nvcr.io/nvidia/tensorflow:19.10-py3 is used. Replace the image with your container image based on your business requirements. For more information about how to use the TensorFlow image to build a TensorFlow deep learning framework, see Deploy an NGC environment for deep learning development.

  2. Run the following command to view the GPU information about the container such as the GPU memory:

    sudo docker exec -i gpu_test1 nvidia-smi

    In this example, the gpu_test1 container is used. The following figure shows that the GPU memory of the container is 6,043 MiB.gpu_test1

View cGPU by using procfs nodes

The cGPU runtime generates multiple proc filesystem (procfs) nodes in the /proc/cgpu_km directory and automatically manages the nodes. You can view cGPU information and configure cGPU settings by using the procfs nodes.

  1. Run the following command to view information about the procfs nodes:

    ls /proc/cgpu_km/

    The following figure shows the command output.

    Dingtalk_20240911164737.jpg

    Nodes

    Node

    Read/Write type

    Description

    0

    Read and write

    The directory of the GPU. cGPU generates a directory for each GPU on the GPU-accelerated instance, and uses numbers such as 0, 1, and 2 as the directory names.

    In this example, a single GPU is used and the directory name of the GPU is 0.

    default_memsize

    Read and write

    The default memory that is allocated to the created container if ALIYUN_COM_GPU_MEM_CONTAINER is left empty.

    inst_ctl

    Read and write

    The control node.

    upgrade

    Read and write

    Controls the hot update of cGPU.

    version

    Read-only

    The version of cGPU.

  2. Run the following command to view the parameters in the directory of the GPU.

    In this example, the 0 GPU directory is used.

    ls /proc/cgpu_km/0

    The following figure shows the command output.

    Dingtalk_20240911170725.jpg

    Parameters in the GPU directory

    Parameter

    Read/Write type

    Description

    012b2edccd7a or 0852a381c0cf

    Read and write

    The directory of the container.

    cGPU generates a directory for each container that runs on the GPU-accelerated instance, and uses the container IDs as the directory names.

    Note

    You can run the docker ps command to view the created containers.

    free_weight

    Read-only

    The weight of the GPU. You can use this parameter to query and change the available weight of the GPU.

    If free_weight is set to 0, the weight of the newly created container is 0. In this case, the container cannot obtain GPU computing power and cannot be used to run applications that require GPU computing power.

    major

    Read-only

    The primary device number of cGPU. The value indicates a different device type.

    max_inst

    Read and write

    The maximum number of containers. Valid values: 1 to 25.

    policy

    Read and write

    The scheduling policy for computing power. cGPU supports the following scheduling policies:

    • 0: fair-share scheduling. Each container occupies a fixed time slice. The percentage of the time slice is 1/max_inst.

    • 1: preemptive scheduling. Each container occupies as many time slices as possible. The percentage of the time slices is 1/Number of containers.

    • 2: weight-based preemptive scheduling. When ALIYUN_COM_GPU_SCHD_WEIGHT is set to a value greater than 1, weight-based preemptive scheduling is used.

    • 3: fixed scheduling. Computing power is scheduled at a fixed percentage.

    • 4: soft scheduling. Compared with preemptive scheduling, soft scheduling isolates GPU resources in a softer manner.

    • 5: native scheduling. The built-in scheduling policy for the GPU driver.

    You can change the value of the policy parameter to adjust the scheduling policy. For more information about scheduling policies, see Usage examples of cGPU.

    prio_ratio

    Read and write

    The maximum computing power that a high-priority container can preempt in colocation of multi-types workloads. Valid values: 20 to 99.

  3. Run the following command to view the parameters in the directory of the container.

    In this example, the 012b2edccd7a container directory is used.

    ls /proc/cgpu_km/0/012b2edccd7a

    The following figure shows the command output.

    Dingtalk_20240911171620.jpg

    Parameters in the container directory

    Parameter

    Read/Write type

    Description

    highprio

    Read and write

    The high priority of the container. Default value: 0.

    If ALIYUN_COM_GPU_HIGH_PRIO is set to 1 and highprio is specified, the container can preempt the maximum computing power that is specified by prio_ratio.

    Note

    This parameter is used in colocation of multi-types workloads.

    id

    Read-only

    The container ID.

    memsize

    Read and write

    The GPU memory of the container. cGPU generates a value for this parameter based on the value of ALIYUN_COM_GPU_MEM_DEV.

    meminfo

    Read-only

    The information about the GPU memory, including the remaining GPU memory in the container, the ID of the process that is using the GPU, and the GPU memory usage of the process. Sample output:

    Free: 6730809344
    PID: 19772 Mem: 200278016

    weight

    Read and write

    The weight based on which the container obtains the maximum GPU computing power. Default value: 1. The sum of the weights of all running containers cannot exceed the value of max_inst.

  4. (Optional) Run the following commands to configure cGPU:

    After you are familiar with procfs nodes, you can run commands on the GPU-accelerated instance to perform operations. For example, you can change the scheduling policy and change the weight. The following table describes sample commands.

    Command

    Effect

    echo 2 > /proc/cgpu_km/0/policy

    Changes the scheduling policy to weight-based preemptive scheduling.

    cat /proc/cgpu_km/0/free_weight

    Queries the available weight on the GPU. If free_weight is set to 0, the weight of the newly created container is 0. In this case, the container cannot obtain GPU computing power and cannot be used to run applications that require GPU computing power.

    cat /proc/cgpu_km/0/$dockerid/weight

    Queries the weight of a specified container.

    echo 4 > /proc/cgpu_km/0/$dockerid/weight

    Changes the weight based on which the container obtains the GPU computing power.

View cGPU containers by using cgpu-smi

You can use cgpu-smi to view information about a container for which cGPU is used. The information includes the container ID, GPU utilization, computing power limit, GPU memory usage, and total allocated memory.

Note

cgpu-smi provides sample monitoring information about cGPU. When you deploy Kubernetes applications, you can refer to or use the sample monitoring information to perform custom development and integration.

The following figure shows the sample monitoring information provided by cgpu-smi.

cgpu-smi

Update or uninstall cGPU

Update cGPU

cGPU supports cold updates and hot updates.

  • Cold update

    If cGPU is not used for Docker, you can perform a cold update on cGPU. Perform the following operations:

    1. Run the following command to stop all running containers:

      sudo docker stop $(docker ps -a | awk '{ print $1}' | tail -n +2) 
    2. Run the following command to update cGPU to the latest version:

      sudo sh upgrade.sh
  • Hot update

    If cGPU is used for Docker, you can perform a hot update on the cGPU kernel driver. Specific limits are imposed on the updatable versions. If you require assistance, contact Alibaba Cloud after-sales engineers.

Uninstall cGPU

For more information about how to uninstall cGPU of an earlier version from a node, see Upgrade the cGPU version on a node by using a CLI.

Usage examples of cGPU

Use cGPU to schedule computing power

When cGPU loads the cgpu_km module, cGPU sets time slices (X ms) for each GPU based on the maximum number of containers (max_inst) to allocate GPU computing power to the containers. In the following examples, Slice 1, Slice 2, and Slice N time slices are used. The following examples show how GPU computing power is allocated by using different scheduling policies.

  • Fair-share scheduling (policy = 0)

    When you create containers, cGPU allocates time slices to the containers. cGPU starts scheduling from Slice 1. The scheduling task is submitted to the physical GPU and executed in the container within a time slice (X ms). Then, cGPU moves to the next time slice. Each container obtains the same computing power, which is 1/max_inst. The following figure shows the details.

    image
  • Preemptive scheduling (policy = 1)

    When you create containers, cGPU allocates time slices to the containers. cGPU starts scheduling from Slice 1. However, if no container is used within Slice 1 or if the GPU device is not started by a process in the container, cGPU skips scheduling within Slice 1 and moves to the next time slice.

    Examples:

    1. You create only a container named Docker 1, and allocate Slice 1 to and run two TensorFlow processes in the container. In this case, Docker 1 can obtain the computing power of the entire physical GPU.

    2. Then, you create a container named Docker 2 and allocate Slice 2 to the container. If the GPU device is not started by a process in Docker 2, cGPU skips scheduling for Docker 2 within Slice 2.

    3. If the GPU device is started by a process in Docker 2, cGPU performs scheduling within Slice 1 and Slice 2. Docker 1 and Docker 2 can obtain up to half of the computing power of the physical GPU. The following figure shows the details.

      image
  • Weight-based preemptive scheduling (policy = 2)

    If ALIYUN_COM_GPU_SCHD_WEIGHT is set to a value greater than 1 when you create a container, weight-based preemptive scheduling is used. cGPU divides the computing power of the physical GPU into max_inst portions based on the number of containers (max_inst). If ALIYUN_COM_GPU_SCHD_WEIGHT is set to a value greater than 1, cGPU combines multiple time slices into a larger time slice and allocates the time slice to the containers.

    Sample configurations:

    • Docker 1: ALIYUN_COM_GPU_SCHD_WEIGHT = m

    • Docker 2: ALIYUN_COM_GPU_SCHD_WEIGHT = n

    Scheduling results:

    • If only Docker 1 is running, Docker 1 preempts the computing power of the entire physical GPU.

    • If Docker 1 and Docker 2 are running, Docker 1 and Docker 2 obtain the computing power at a theoretical ratio of m:n. Compared with preemptive scheduling, Docker 2 consumes n time slices even if the GPU device is not started by a process in Docker 2.

      Note

      The running performance of the containers differs when m:n is set to 2:1 and 8:4. The number of time slices within 1 second when m:n is set to 2:1 is four times the number of time slices within 1 second when m:n is set to 8:4.

      image

    Weight-based preemptive scheduling limits the theoretical maximum GPU computing power that containers can obtain. However, for GPUs that provide strong computing power such as an NVIDIA V100 GPU, a computing task can be completed only within a single time slice if small GPU memory is used. In this case, if m:n is set to 8:4, the GPU computing power becomes idle during the remaining time slices and the limit on the theoretical maximum GPU computing power becomes invalid.

  • Fixed scheduling (policy = 3)

    You can use ALIYUN_COM_GPU_SCHD_WEIGHT together with max_inst to fix the percentage of computing power.

  • Soft scheduling (policy = 4)

    When you create containers, cGPU allocates time slices to the containers. Compared with preemptive scheduling, soft scheduling isolates GPU resources in a softer manner. For more information, see Preemptive scheduling (policy = 1).

  • Native scheduling (policy = 5)

    You can use this policy to isolate only GPU memory. When the policy is used, computing power is scheduled based on the built-in scheduling methods of NVIDIA GPU drivers.

The scheduling policies for computing power are supported for all Alibaba Cloud heterogeneous GPU-accelerated instances and NVIDIA GPUs that are used for the instances, including Tesla P4, Tesla P100, Tesla T4, Tesla V100, and Tesla A10 GPUs. In this example, two containers that share a GPU-accelerated instance configured with a Tesla A10 GPU are tested. The computing power ratio of the containers is 1:2. Each container obtains 12 GiB of GPU memory.

Note

The following performance test data is provided for reference only.

  • Test 1: The performance data of the ResNet50 model that is trained by using the TensorFlow framework at different batch_size values is compared. The FP16 precision is used. The following section shows the test results.

    Framework

    Model

    batch_size

    Precision

    Images per second of Docker 1

    Images per second of Docker 2

    TensorFlow

    ResNet50

    16

    FP16

    151

    307

    TensorFlow

    ResNet50

    32

    FP16

    204

    418

    TensorFlow

    ResNet50

    64

    FP16

    247

    503

    TensorFlow

    ResNet50

    128

    FP16

    257

    516

  • Test 2: The performance data of the ResNet50 model that is trained by using the TensorRT framework at different batch_size values is compared. The FP16 precision is used. The following section shows the test results.

    Framework

    Model

    batch_size

    Precision

    Images per second of Docker 1

    Images per second of Docker 2

    TensorRT

    ResNet50

    1

    FP16

    568.05

    1132.08

    TensorRT

    ResNet50

    2

    FP16

    940.36

    1884.12

    TensorRT

    ResNet50

    4

    FP16

    1304.03

    2571.91

    TensorRT

    ResNet50

    8

    FP16

    1586.87

    3055.66

    TensorRT

    ResNet50

    16

    FP16

    1783.91

    3381.72

    TensorRT

    ResNet50

    32

    FP16

    1989.28

    3695.88

    TensorRT

    ResNet50

    64

    FP16

    2105.81

    3889.35

    TensorRT

    ResNet50

    128

    FP16

    2205.25

    3901.94

Use cGPU to allocate memory to multiple GPUs

In the following example, four GPUs are configured. GPU 0, GPU 1, GPU 2, and GPU 3 are separately allocated 3 GiB, 4 GiB, 5 GiB, and 6 GiB of memory. Sample code:

docker run -d -t --runtime=nvidia  --name gpu_test0123 --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -v /mnt:/mnt -e ALIYUN_COM_GPU_MEM_CONTAINER=3,4,5,6 -e ALIYUN_COM_GPU_MEM_DEV=23 -e NVIDIA_VISIBLE_DEVICES=0,1,2,3 nvcr.io/nvidia/tensorflow:21.03-tf1-py3
docker exec -i gpu_test0123   nvidia-smi

The following command output shows the memory details of the GPUs.回显结果

You can use ALIYUN_COM_GPU_MEM_CONTAINER to allocate memory to multiple GPUs. The following table describes the values of ALIYUN_COM_GPU_MEM_CONTAINER.

Value

Description

ALIYUN_COM_GPU_MEM_CONTAINER=3

The memory of each of the four GPUs is set to 3 GiB.

ALIYUN_COM_GPU_MEM_CONTAINER=3,1

The memory of the four GPUs is set to 3 GiB, 1 GiB, 1 GiB, and 1 GiB in sequence.

ALIYUN_COM_GPU_MEM_CONTAINER=3,4,5,6

The memory of the four GPUs is set to 3 GiB, 4 GiB, 5 GiB, and 6 GiB in sequence.

ALIYUN_COM_GPU_MEM_CONTAINER not specified

cGPU is disabled.

ALIYUN_COM_GPU_MEM_CONTAINER=0

ALIYUN_COM_GPU_MEM_CONTAINER=1,0,0