All Products
Search
Document Center

Container Service for Kubernetes:GPU sharing overview

Last Updated:Nov 08, 2024

This topic introduces the GPU sharing solution provided by Alibaba Cloud, describes the benefits of GPU Sharing Professional Edition, and compares the features and use scenarios of GPU Sharing Basic Edition and GPU Sharing Professional Edition. This helps you better understand and use GPU sharing.

Background information

GPU sharing allows you to run multiple containers on the same GPU device. After Container Service for Kubernetes (ACK) makes GPU sharing open source, you can use GPU sharing in Kubernetes clusters that run on Alibaba Cloud or run GPU sharing in self-managed Kubernetes clusters. GPU sharing reduces the expenses on GPUs. However, when you run multiple containers on one GPU, the stability of the containers cannot be guaranteed.

To ensure container stability, you must isolate the GPU resources that are allocated to each container. When you run multiple containers on one GPU, GPU resources are allocated to each container as requested. However, if one container occupies excessive GPU resources, the performance of other containers may be affected. To solve this issue, many solutions are provided in the computing industry. For example, NVIDIA vGPU, Multi-Process Service (MPS), and vCUDA enable fine-grained sharing of GPUs.

ACK provides the GPU sharing solution to meet the preceding requirements. GPU sharing enables a GPU to be shared by multiple tasks. GPU sharing also allows you to isolate the GPU memory that is allocated to each application and partition the computing capacity of the GPU.

Features and benefits

The GPU sharing solution uses the server kernel driver that is developed by Alibaba Cloud to provide more efficient use of the underlying drivers of NVIDIA GPUs. GPU sharing provides the following features:

  • High compatibility: GPU sharing is compatible with standard open source solutions, such as Kubernetes and NVIDIA Docker.

  • Ease of use: GPU sharing provides excellent user experience. To replace a Compute Unified Device Architecture (CUDA) library of an AI application, you do not need to recompile the application or create a new container image.

  • Stability: GPU sharing provides stable underlying operations on NVIDIA GPUs. API operations on CUDA libraries and some private API operations on CUDA Deep Neural Network (cuDNN) are difficult to call.

  • Resource isolation: GPU sharing ensures that the allocated GPU memory and computing power do not affect each other.

GPU sharing provides a cost-effective, reliable, and user-friendly solution that allows you to enable GPU scheduling and memory isolation.

Benefit

Description

Supports GPU sharing, scheduling, and memory isolation.

  • Supports GPU sharing, scheduling, and memory isolation on a one-pod-one-GPU basis. This is commonly used in model inference scenarios.

  • Supports GPU sharing, scheduling, and memory isolation on a one-pod-multi-GPU basis. This is commonly used to build the code to train distributed models.

Supports flexible GPU sharing and memory isolation policies.

  • Supports GPU allocation by using the binpack and spread algorithms.

    • Binpack: The system preferentially shares one GPU with multiple pods. This applies to scenarios where high GPU utilization is required.

    • Spread: The system attempts to allocate one GPU to each pod. This applies to scenarios where the high availability of GPUs is required. The system attempts to avoid allocating the same GPU to different replicated pods of an application.

  • Supports GPU sharing without memory isolation. This applies to deep learning scenarios where applications are configured with user-defined isolation systems at the application layer.

  • Supports GPU sharing on multiple GPUs and memory isolation.

Supports comprehensive monitoring of GPU resources.

Supports monitoring of both exclusive GPUs and shared GPUs.

Billing

GPU sharing is a charged service. You need to activate the cloud-native AI suite before you can use GPU sharing. For more information about the billing details, see Billing of the cloud-native AI suite.

Usage notes

GPU sharing supports only ACK Pro clusters. For more information about how to install and use GPU sharing, see the following topics:

You can also use the following advanced features provided by GPU sharing:

Terms

Share mode and exclusive mode

The share mode allows multiple pods to share one GPU, as shown in the following figure. TU2.png

The exclusive mode allows a pod to occupy one or more GPUs exclusively, as shown in the following figure. TU1.png

GPU memory isolation

GPU sharing can only ensure that multiple pods run on one GPU but cannot prevent resource contention among the pods when GPU memory isolation is disabled. The following section shows an example.

Pod 1 requests 5 GiB of GPU memory and Pod 2 requests 10 GiB of GPU memory. When GPU memory isolation is disabled, Pod 1 can use up to 10 GiB of GPU memory, including the 5 GiB of GPU memory requested by Pod 2. Consequently, Pod 2 fails to launch due to insufficient GPU memory. After GPU memory isolation is enabled, when Pod 1 attempts to use GPU memory greater than the requested value, the GPU memory isolation module forces Pod 1 to fail. TU3.png

GPU scheduling policies: binpack and spread

If a node with the GPU sharing feature enabled has multiple GPUs, you can choose one of the following GPU selection policies:

  • Binpack: By default, the binpack policy is used. The scheduler allocates all resources of a GPU to pods before you switch to another GPU. This helps prevent GPU fragments.

  • Spread: The scheduler attempts to spread pods to different GPUs on the node in case business interruptions occur when a GPU is faulty.

In this example, a node has two GPUs. Each GPU provides 15 GiB of memory. Pod1 requests 2 GiB of memory and Pod2 requests 3 GiB of memory.

image

Single GPU sharing and multiple GPU sharing

  • Single GPU sharing: A pod can request GPU resources that are allocated by only one GPU.

  • Multiple GPU sharing: A pod can request GPU resources that are evenly allocated by multiple GPUs.

图片1.png