All Products
Search
Document Center

Container Service for Kubernetes:Manage the GPU sharing component

Last Updated:Feb 26, 2026

Alibaba Cloud Container Service for Kubernetes (ACK) provides GPU sharing capabilities for model inference scenarios using a single, shared GPU. It also supports GPU memory isolation at the NVIDIA driver kernel level. To ensure compatibility, you must upgrade the GPU sharing component if it is incompatible with the node's GPU driver or operating system. This topic describes how to manage the GPU sharing component on GPU nodes to enable GPU scheduling and isolation.

Prerequisites

Limitations

  • Do not set the CPU Policy to static for nodes that use GPU sharing.

  • If you need to customize the KubeConfig file path, use export KUBECONFIG=<kubeconfig> to specify the path, because the kubectl inspect cgpu command does not support the --kubeconfig parameter.

  • The isolation feature of the cGPU service does not support GPU memory allocation using Unified Virtual Memory (UVM), which involves calling the cudaMallocManaged() CUDA API. Allocate GPU memory using other methods, such as calling cudaMalloc(). For more information, see the official NVIDIA documentation.

  • The GPU sharing DaemonSet Pod does not have the highest priority on a node. This means its resources can be preempted by other high-priority Pods, leading to eviction. To prevent this, modify the DaemonSet you are using (for example, gpushare-device-plugin-ds for shared GPU memory) by adding priorityClassName: system-node-critical to grant it the highest priority.

  • For performance considerations, create a maximum of 20 Pods on a single physical GPU card when using cGPU. If you exceed this limit, subsequent Pods scheduled to the card will fail to run and generate the error Error occurs when creating cGPU instance: unknown.

  • You can install the GPU sharing component in any region. However, only the following regions support GPU memory isolation. Ensure your cluster is in one of these regions.

    Regions

    Region

    Region ID

    China (Beijing)

    cn-beijing

    China (Shanghai)

    cn-shanghai

    China (Hangzhou)

    cn-hangzhou

    China (Zhangjiakou)

    cn-zhangjiakou

    China (Ulanqab)

    cn-wulanchabu

    China (Shenzhen)

    cn-shenzhen

    China (Chengdu)

    cn-chengdu

    China (Heyuan)

    cn-heyuan

    China (Hong Kong)

    cn-hongkong

    Japan (Tokyo)

    ap-northeast-1

    Indonesia (Jakarta)

    ap-southeast-5

    Singapore

    ap-southeast-1

    US (Virginia)

    us-east-1

    US (Silicon Valley)

    us-west-1

    Germany (Frankfurt)

    eu-central-1

  • Version requirements.

    Configuration

    Supported version

    Kubernetes version

    • If the ack-ai-installer component version is earlier than 1.12.0, clusters of version 1.18.8 or later are supported.

    • If the ack-ai-installer component version is 1.12.0 or later, only clusters of version 1.20 or later are supported.

    NVIDIA driver version

    418.87.01 or later

    Container Runtime version

    • Docker: 19.03.5 or later

    • containerd: 1.4.3 or later

    Operating System

    Alibaba Cloud Linux 3.x (container-optimized versions require ack-ai-installer version 1.12.6 or later), Alibaba Cloud Linux 2.x, CentOS 7.6, CentOS 7.7, CentOS 7.9, Ubuntu 22.04

    Supported GPU cards

    P-series, T-series, V-series, A-series, H-series

Install the GPU sharing component

Step 1: Install the GPU sharing component

Cloud-native AI suite not deployed

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, find the cluster you want and click its name. In the left-side navigation pane, choose Applications > Cloud-native AI Suite.

  3. On the Cloud-native AI Suite page, click Deploy.

  4. On the Deploy Cloud-native AI Suite page, select Scheduling Policy Extension (Batch Task Scheduling, GPU Sharing, Topology-aware GPU Scheduling).

  5. (Optional) Click Advanced to the right of Scheduling Policy Extension (Batch Task Scheduling, GPU Sharing, Topology-aware GPU Scheduling). In the Parameters dialog box that appears, modify the policy parameter for cGPU. After making changes, click OK.

    If you do not have special requirements for cGPU compute sharing, we recommend using the default policy: 5, which is native scheduling. For supported cGPU policies, see Install and use the cGPU component.

    image

  6. At the bottom of the Cloud-native AI Suite page, click Deploy Cloud-native AI Suite.

    After the installation succeeds, the ack-ai-installer component appears in the component list on the Cloud-native AI Suite page.

Cloud-native AI suite deployed

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, find the cluster you want and click its name. In the left-side navigation pane, choose Applications > Cloud-native AI Suite.

  3. In the row for the scheduling component ack-ai-installer, click Deploy in the Actions column.

  4. (Optional) In the Parameters dialog box that appears, modify the policy field for cGPU.

    If you do not have special requirements for cGPU compute sharing, we recommend using the default policy: 5, which is native scheduling. For supported cGPU policies, see Install and use the cGPU component.

    image

  5. After making changes, click OK.

    After installation, the Status of ack-ai-installer changes to Deployed.

Step 2: Enable GPU sharing and memory isolation

  1. On the Clusters page, find the cluster to manage and click its name. In the left navigation pane, choose Nodes > Node Pools.

  2. On the Node Pools page, click Create Node Pool and complete the configurations as described in Create and manage node pools.

  3. On the Create Node Pool page, configure the settings for the node pool, then click Confirm. Key configurations are described below:

    Parameter

    Description

    Expected Nodes

    The initial number of nodes for the node pool. To create a node pool without nodes, enter 0.

    Node Labels

    Add label values based on your business requirements. For details about node labels, see Enable scheduling features.

    The following example uses the label value cgpu, which enables GPU sharing on the node. Each Pod only needs to request GPU memory resources, and Pods on a single card will have memory isolation and share compute power.

    Click the Node Label icon next to Node Labels, and set the Key to ack.node.gpu.schedule and the Value to cgpu.

    Important
    • For notes on cGPU isolation, see cGPU FAQ.

    • After adding the GPU sharing label, do not use the kubectl label nodes command or the label management on the Node page in the console to change the GPU scheduling attribute label. Doing so may cause potential issues. To avoid potential issues, Enable scheduling features.

Step 3: Add GPU nodes

Note

If you created GPU nodes when you added the node pool, you can skip this step.

After creating a node pool, you can add GPU nodes to it. When adding GPU nodes, you must specify GPU-accelerated instance as the architecture for the instance type. For specific operations, see Add existing nodes or Create and manage node pools.

Step 4: Install and use the GPU resource query tool

  1. You must download the executable file to a directory included in your PATH environment variable. This example uses /usr/local/bin/.

    • If you are using a Linux system, download kubectl-inspect-cgpu:

      wget http://aliacs-k8s-cn-beijing.oss-cn-beijing.aliyuncs.com/gpushare/kubectl-inspect-cgpu-linux -O /usr/local/bin/kubectl-inspect-cgpu
    • On macOS:

      wget http://aliacs-k8s-cn-beijing.oss-cn-beijing.aliyuncs.com/gpushare/kubectl-inspect-cgpu-darwin -O /usr/local/bin/kubectl-inspect-cgpu
  2. Add executable permissions to kubectl-inspect-cgpu.

    chmod +x /usr/local/bin/kubectl-inspect-cgpu
  3. View the GPU usage in your cluster.

    kubectl inspect cgpu

    Expected output:

    NAME                       IPADDRESS      GPU0(Allocated/Total)  GPU Memory(GiB)
    cn-shanghai.192.168.6.104  192.168.6.104  0/15                   0/15
    ----------------------------------------------------------------------
    Allocated/Total GPU Memory In Cluster:
    0/15 (0%)

Update the GPU sharing component

Step 1: Determine the upgrade method for the component

The upgrade method depends on how the ack-ai-installer component was originally installed. There are two installation methods.

  • Via Cloud-native AI Suite (Recommended): Install the ack-ai-installer component from the Cloud-native AI Suite page.

  • Via App Catalog (Deprecated): Install the ack-ai-installer component from the App Catalog page in the App Marketplace. This installation method is now deprecated. However, for existing components installed this way, you can still upgrade them using this method.

    Important

    If you uninstall a component that you installed this way, you will need to activate the Cloud-native AI Suite to reinstall it.

Determine the installation method

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, find the cluster you want and click its name. In the left-side navigation pane, choose Applications > Cloud-native AI Suite.

  3. Check if the ack-ai-installer component is listed on the Cloud-native AI Suite page.

    If it is, the component was installed via Cloud-native AI Suite. Otherwise, it was installed via App Catalog.

Step 2: Upgrade the component

Cloud-native AI suite

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, find the cluster you want and click its name. In the left-side navigation pane, choose Applications > Cloud-native AI Suite.

  3. In the Components area, locate the ack-ai-installer component and click Upgrade in the Actions column.

App Catalog

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, find the cluster you want and click its name. In the left-side navigation pane, choose Applications > Helm.

  3. In the Helm release list, locate the ack-ai-installer component, click Update in the Actions column, and follow the on-screen instructions to select the latest Chart version and complete the update.

    Important

    If you need to customize the Chart configuration, confirm the component update after modifying the configuration.

    After the update, verify that the ack-ai-installer component shows the latest Chart version in the Helm release list.

Step 3: Upgrade existing nodes

Upgrading the ack-ai-installer component does not automatically upgrade the cGPU version on existing nodes. Use the following instructions to determine if cGPU isolation is enabled on your nodes.

  • If your cluster contains GPU nodes with cGPU isolation enabled, you must also upgrade the cGPU version on the existing nodes. For instructions, see Upgrade the cGPU version of a node.

  • If your cluster has no nodes with cGPU isolation enabled, you can skip this step.

    Note
    • A node has cGPU isolation enabled if it has the label ack.node.gpu.schedule=cgpu or ack.node.gpu.schedule=core_mem.

    • Upgrading the cGPU version on existing nodes requires stopping all application Pods on them. Perform this operation during off-peak hours to minimize business impact.