In cloud environments, GPUs are scarce and valuable computing resources. On-demand acquisition of GPUs involves uncertainty, which may cause critical business operations to be interrupted or delayed when resources are not available in time. To address this issue, Alibaba Cloud Container Service ACS (Alibaba Cloud Container Service) Serverless Kubernetes provides two resource reservation modes to ensure deterministic resource guarantees for your GPU applications.
GPU Pod capacity reservation (Pod-level Reservation)
How it works: Instance reservation is a workload-oriented, standardized capacity reservation. You need to specify Pod specifications (such as 2×A10 GPU, 16 vCPU, 32 GiB memory) and the number of Pods to reserve (such as 12). The platform will reserve computing capacity that can accommodate exactly these 12 Pods with specific specifications.
Determinism provided: It provides "workload capacity determinism". You can be confident that whenever you initiate a creation request, the system guarantees the ability to run your specified 12 Pods with that specification. This greatly simplifies capacity planning. You do not need to worry about underlying node specifications and resource fragmentation. You only need to focus on your application Pod requirements.
Scenarios:
Homogeneous workloads: This mode is the best choice when your application (such as large-scale distributed training or online inference service) consists of many Pods with identical specifications.
Simplified operations: When you want to completely delegate the complexity of underlying resource planning to the platform and focus only on application-level capacity requirements.
GPU-HPN capacity reservation (Node-level Reservation)
How it works: This mode reserves and locks dedicated GPU computing node capacity for you in the underlying resource pool of ACS. These resources are locked for exclusive use by your account, ensuring that when you need to create new GPU Pods, there are always active hardware resources to host them. This avoids Pod scheduling failures (Pending status) caused by resource pool constraints.
Determinism provided: It provides "physical resource determinism". It ensures that when you need to scale out, the underlying infrastructure (GPU nodes) is definitely available. You can decide how to schedule and combine Pods of different specifications on these nodes (known as "bin packing").
Scenarios:
Heterogeneous workloads: This mode provides maximum flexibility when you need to run GPU Pods with various specifications in the same resource pool.
Fine-grained resource control: When you want to use custom scheduling policies (such as Taints/Tolerations, Node Affinity) to control Pod physical layout precisely for performance optimization or resource isolation.
Summary and comparison
Attribute | GPU Pod capacity reservation (Pod-level) | GPU-HPN capacity reservation (Node-level) |
Reservation object | Number of Pods with specific specifications. | Underlying GPU computing node capacity. |
Reservation granularity | Logical workload (such as 12 Pods with 1A10GPU8C16G). | Physical node resources (such as 2 P16EN nodes). |
Guarantee level | Workload capacity determinism. | Physical node resource determinism. |
Flexibility | Lower (bound to specific Pod specifications). | Very high (can run Pods with flexible specifications). |
Management complexity | Low (platform handles resource matching). | Higher (requires response to node management events). |
Selection recommendation |
| Medium to large-scale applications with complex and variable specifications. |
By choosing the appropriate reservation mode, you can effectively mitigate the risks of GPU resource acquisition based on different business requirements for determinism, ensuring stable and reliable operation of AI applications.