Usage notes for the memory isolation capability of cGPU - Container Service for Kubernetes

cGPU is a GPU memory and computing power isolation module developed by Alibaba Cloud. It ensures that when multiple containers share a single GPU, the memory and compute resources used by each container do not interfere with one another. This topic provides answers to some frequently asked questions (FAQs) when using cGPU.

Before you begin

Before you start, make sure that you take note of the following items:

If the GPU-accelerated nodes in your cluster have the ack.node.gpu.schedule=cgpu, ack.node.gpu.schedule=core_mem, or cgpu=true labels, this indicates that the isolation feature is enabled on those nodes using cGPU.
See the release notes for ack-ai-installer to find the mapping between ack-ai-installer versions and cGPU versions.
For more information about cGPU, see official NVIDIA documentation.

FAQs

What should I do if a `Linux Kernel Panic` occurs when I use cGPU?

If you install cGPU version 1.5.7, a deadlock may occur in the cGPU kernel driver, causing concurrent processes to lock each other, resulting in a Linux kernel panic. To prevent this issue, we recommend that you install or update to cGPU version 1.5.10 or later. For more information about how to update, see Update the cGPU version on a node.

What should I do if a `Failed to initialize NVML` error occurs when I execute `nvidia-smi` in a cGPU pod?

If you install cGPU version 1.5.2 or earlier with a driver version released after July 2023, you may encounter incompatibility issues between the cGPU and GPU driver versions. To verify the release date of your GPU driver, find it in the Linux AMD64 Display Driver Archive. For a list of default GPU driver versions supported by each ACK cluster type, see NVIDIA driver versions supported by ACK.

Once a pod requests to share GPU scheduling resources and its status is Running, you can run the nvidia-smi command in the pod to verify if the following output is returned:

Failed to initialize NVML: GPU access blocked by operating system

If this output is returned, update the AI suite to the latest version to resolve the issue. For more information, see Update the GPU sharing component.

What should I do if a failure or timeout occurs when I create a container for a cGPU pod?

If you install a cGPU version earlier than 1.0.10 and use an NVIDIA Toolkit version 1.11 or later, you may encounter container creation failures or experience timeouts.

To resolve this issue, update the AI suite to the latest version. For more information, see Update the GPU sharing component.

Before you begin

FAQs

What should I do if a Linux Kernel Panic occurs when I use cGPU?

What should I do if a Failed to initialize NVML error occurs when I execute nvidia-smi in a cGPU pod?

What should I do if a failure or timeout occurs when I create a container for a cGPU pod?

What should I do if a `Linux Kernel Panic` occurs when I use cGPU?

What should I do if a `Failed to initialize NVML` error occurs when I execute `nvidia-smi` in a cGPU pod?