Node Pool: Concepts, Classification, and Operational Guidelines - Container Service for Kubernetes

A node pool is a collection of nodes with the same attributes, enabling streamlined management and operation, such as node upgrades and auto scaling. Node pools also support automated operations, including OS CVE vulnerability repairs, automatic node recovery, and kubelet and containerd version upgrades, to minimize operational costs.

Introduction to node pools

Consider a node pool as a configuration template; nodes added to the pool inherit its configuration. Within a cluster, you can establish multiple node pools with diverse configurations and types. The configuration of a node pool configuration includes properties such as instance specifications, billing type, zone (vSwitch), operating system image, CPU architecture, tags, and taints. These properties are set during node pool creation or can be modified later. For more information about how to create a node pool, see Create and manage a node pool.

Clusters created before the introduction of node pools contain free nodes. To continue using these nodes, we recommend managing them by incorporating them into a node pool. For instructions, see Add free nodes to a node pool.

A single node pool simplifies management and configuration, while multiple node pools enable refined resource isolation and mixed deployment of different node types.

Single node pool

Multiple node pools

Manage computing resources for multiple teams or various workloads through a single node pool, simplifying operations and maintenance. A single node pool can support the following features.

Manage computing resources for multiple teams simultaneously.
Configure multiple instance types, such as regular Elastic Compute Service (ECS) instances, GPU-accelerated instances, elastic bare metal instances, and high-performance computing optimized instances, to meet the needs of different workloads.
Distribute nodes across multiple zones to improve high availability.

Currently, mixing instances of different operating system types and CPU architectures (Arm and x86) is not supported.

Create multiple node pools to provide independent computing resources for different workloads or teams, thereby avoiding resource contention and potential security risks. Suitable for the following scenarios.

Tenant isolation, providing independent computing resources for different teams, and facilitating billing management.
Isolation of machines with different hardware specifications (such as CPU architecture, GPU, FPGA, etc.) to ensure the reasonable allocation of hardware resources.
Enhance security isolation for sensitive applications.
Deploy different operating systems.

By utilizing multiple node pools, you can prioritize different node pools through scheduling policies to enhance resource and cost management. Consider the following scenarios:

Manage the priority order of computing resources with varying costs, such as preemptible instances and subscription instances, to minimize expenses.
Allocate various instance types based on workload requirements, including the proportion of x86 and Arm architectures utilized.

Node pool features

Container Service for Kubernetes (ACK) offers a range of features at the node pool level to simplify node management. For those who prefer to concentrate on developing higher-level applications without the burden of extensive worker node maintenance, the managed node pool feature can be enabled to utilize various automated operations and maintenance capabilities.

Basic features

Feature item	Description	Related documents
Create, edit, delete, and view	Support creating node pools through the console, configuring basic information, network configuration, instance specifications, storage configuration, expected number of nodes, etc. Support adjusting some configurations of existing node pools. For the configuration items that can be edited and the operation notes, see the document for details. If nodes are no longer needed, you can delete the node pool. Whether the expected number of nodes is enabled and the billing mode of the node pool will affect the behavior of node release. Support viewing node pool details, including basic configuration information, resource dashboard, node list, scaling activities, etc.	Create and manage a node pool
Manual or automatic scaling	Support manually adjusting the expected number of nodes in the node pool to achieve scaling, keeping the number of nodes at the expected quantity and saving resource costs. Some non-standard operations such as removal, modification, and release may cause the node pool to not scale as expected. See the document for details. Support configuring node auto scaling solutions. When the capacity planning of the cluster cannot meet the scheduling of application pods, automatically scale node resources.	Manually scale a node pool Node scaling
Add existing nodes	If you need to add purchased ECS instances to an ACK cluster as worker nodes or re-add worker nodes to a node pool after removal, you can use the add existing nodes feature. This feature has some usage restrictions and notes. See the document for details.	Add existing ECS instances to an ACK cluster
Remove nodes	If certain nodes are no longer needed, you can remove them from the cluster or node pool. Perform standardized operations to remove them to avoid unexpected behavior.	Remove a node
Upgrade kubelet version	Enable automated O&M capabilities Upgrade the kubelet version and containerd version of nodes in the node pool.	Update a node pool
Replace operating system	Support upgrading the operating system version and replacing the operating system type (such as switching an EOL operating system to ContainerOS or Alibaba Cloud Linux).	Change the operating system
CVE vulnerability repair	Enable automated O&M capabilities Manually scan and repair security vulnerabilities in the node operating system. Some CVE vulnerabilities require node restart to be repaired. See the document for feature description and notes.	Patch OS CVE vulnerabilities for node pools
Customize node pool kubelet parameters	Log on to the console to customize the kubelet parameter configuration of nodes at the node pool level, adjusting node behavior, such as reserving resources for the entire cluster to allocate resource usage. We recommend that you do not use the CLI to customize kubelet parameters that are unavailable in the ACK console.	Customize the kubelet parameters of a node pool
Customize node pool OS parameters	Log on to the console to customize the OS parameter configuration of nodes at the node pool level to optimize system performance. We recommend that you do not use the CLI to customize OS parameters that are unavailable in the ACK console.	Customize the OS parameters of a node pool

Automated O&M capabilities

Activating automated O&M features for a node pool reduces the maintenance burden, allowing ACK to handle tasks such as OS CVE vulnerability repairs, kubelet upgrades, and node self-recovery. However, if your applications are sensitive to underlying node changes and cannot tolerate restarts or pod migrations, enabling these features is not recommended.

Preparations

Ensure your operating system is one of the following: Alibaba Cloud Linux 3 Container-optimized, ContainerOS, Alibaba Cloud Linux, Red Hat, or Ubuntu.
For supported operating system images and their usage restrictions in ACK clusters, see OS images.
Before using automated O&M features in a node pool, complete the following on the ACK console Node Pools page (settings can be adjusted afterward).
For instructions on creating and editing node pools, see Create and manage a node pool.
Expand to View Preparations
- Enable managed capability for the node pool
  - New node pool: Select Managed Node Pool during creation.
  - Existing node pool: In the Actions column of the node pool list, click More > Enable Managed Node Pool for the desired node pool.
- Enable automated O&M features as needed
  This includes automatic node recovery, kubelet and runtime version upgrades, and OS CVE vulnerability repairs.
- Configure the cluster maintenance window
  Automated O&M tasks are performed within the configured maintenance window.
  - New node pool: Set the Maintenance Window for the Managed Node Pool during creation.
  - Existing node pool: In the Actions column of the node pool list, click More > Configure Managed Node Pool for the existing managed node pool and set the maintenance window.
The configuration example is shown in the following figure.

Feature introduction

Feature Item	Description
Automatic upgrade of the kubelet and runtime versions	Automatically maintain kubelet version and containerd runtime upgrades. kubelet upgrade: Automatically updates the kubelet on all nodes within a node pool to match the version of the control plane, using an in-place upgrade by default. Container runtime upgrade: The container runtime of a node is upgraded to the newest version following the release of an update. In the event of significant OS CVE vulnerabilities in containerd, ACK by default automatically performs the necessary upgrades and patches. For an introduction to related features and important notes, see Update a node pool.
OS CVE vulnerability auto repair	ACK scans for security vulnerabilities on nodes and schedules CVE vulnerability repairs according to the cluster maintenance window, enhancing cluster stability, security, and compliance. For related notes, see Patch OS CVE vulnerabilities for node pools.
Automatic responses to ECS system events	Support automatic responses to ECS system events. The supported system event types include the following: Instance restart due to system maintenance (SystemMaintenance.Reboot) Automatic Response Flow Upon event receipt, ACK sends a text or internal message notification. Monitor these notifications closely. Drain the affected ECS instance, migrate the pods to other active nodes, and restart the ECS instance. If the node pool has an available maintenance window before the scheduled execution time of ECS, ACK will perform the automatic response during this window. Otherwise, ACK will execute the process one hour before the scheduled execution time of ECS. The drainage operation evicts pods from the node. To ensure service continuity, deploy multiple replicas across nodes and consider using PodDisruptionBudgets (PDBs) for critical applications. If drainage fails, ACK will not forcibly restart the instance.
Node auto repair	ACK continuously monitors node health, such as `condition` information. If a node becomes unresponsive, ACK automatically initiates recovery actions to maintain node functionality. Note that some complex node issues may still require manual intervention. For more on the triggers, events, and recovery process, see Enable node auto repair.

Node pool lifecycle

The lifecycle of a node pool in an ACK cluster encompasses several stages, from creation and deployment to ongoing maintenance activities like scaling, upgrading, and node removal, and ultimately, deletion. The various states and their transitions are outlined below.

Node pool status	Description
Initializing (initial)	The node pool is being created.
Activated (active)	The node pool is successfully created and running.
Failed (failed)	The node pool creation failed.
Scaling Out (scaling)	The node pool is scaling out or adding nodes.
Updating (updating)	The node pool configuration is being updated.
Removing (removing_nodes)	Nodes are being removed from the node pool.
Upgrading (upgrading)	The node pool is being upgraded.
Repairing (repairing)	Repairing in the node pool, such as repairing node pool nodes, node pool CVE vulnerabilities, etc.
Deleting (deleting)	The node pool is being deleted.
Deleted (deleted, this status is invisible to you)	The node pool is successfully deleted.

Node pool billing

While using node pools and their automated O&M capabilities is free, the cloud resources within the node pool, such as ECS instances, incur charges from the respective cloud products.

For more information about the billing rules of ECS instances, see Billing overview.
To change the billing type of existing nodes in the node pool, see Change the billing method of an instance from pay-as-you-go to subscription. Note that changing the billing type of the node pool only affects newly added nodes and does not alter the billing type of existing nodes.
For details on scaling group billing, see Billing overview.

Related terms

Familiarize yourself with the following concepts and terms related to node pools before using them for the first time:

Scaling group: ACK leverages the Auto Scaling service for node pool expansion (scale-out) and reduction (scale-in) activities. Each node pool is directly associated with a single scaling group instance. A scaling group consists of one or more ECS instances, which serve as worker nodes.
Scaling configuration: Node pools use scaling configurations to manage node properties at the foundational level. Auto Scaling configurations serve as templates for ECS instances during auto scaling, automatically creating instances based on these templates when scaling activities are triggered.
Scaling activity: Each node addition or removal in a node pool initiates a scaling activity. The system automatically completes all scaling actions and logs the activity, allowing you to review historical scaling records through the scaling activities of node pool.
Replace system disk: Some node pool operations, such as adding existing nodes or updating the container runtime, involve reinitializing nodes by replacing the system disk. This process does not alter node-specific properties like name, instance ID, or IP, but it does erase data on the system disk. Attached data disks remain unaffected.
When ACK conducts disk replacement, it initiates node drainage to relocate pods from the affected node to other available nodes, adhering to the PDB. To maintain high service availability, we recommend that you implement a multi-replica deployment strategy, spreading workloads over several nodes, and to set up PDB for critical services to manage the simultaneous disruption of pods.
In-place upgrade: An alternative to disk replacement, this method updates and replaces necessary components directly on the original node without reinitializing it or affecting the data present on the node.

References

For more information about the best practices for node pools, see Best practices for nodes and node pools.
Clusters that run Kubernetes 1.24 will no longer support Docker as the built-in container runtime. For more information about how to migrate to containerd, see Migrate the container runtime from Docker to containerd.
As cluster load increases, consider enabling the auto scaling solution to dynamically adjust node resources. For more information, see Auto scaling.
For more information about how to specify the node pool for application pods, see Schedule application pods to a specific node pool.
An ACK Basic cluster supports up to 10 nodes. If you need to increase the limit, we recommend that you perform a hot migration to an ACK Pro cluster. For more information, see Hot migration from ACK basic clusters to ACK Pro clusters.
For more information about how to troubleshoot issue related to nodes or node pools, see FAQ about nodes and node pools.
ACK offers a cluster cost management solution. For more information, see Cost suite.