Update a node pool

When you update the Kubernetes version of your cluster, you must update the node pool during off-peak hours after you update the control plane. During a node pool update, the kubelet and container runtime are updated. Container Service for Kubernetes (ACK) performs a precheck before updating a node pool. The precheck notifies you of the risks of the update to ensure a seamless update.

Usage notes

Node scaling
- If node scaling is enabled for the cluster, the cluster automatically updates the cluster-autoscaler component to the latest version after the cluster is updated. This ensures that the auto scaling feature can function as expected. After the cluster is updated, check whether the version of cluster-autoscaler is updated to the latest version. For more information, see Enable node auto scaling.
- During a cluster update, nodes whose scaling mode is set to swift mode may fail to be updated because the nodes are shut down. If nodes in swift mode fail to be updated after the cluster is updated, we recommend that you manually remove the nodes.
After you update the Kubernetes version of the cluster to 1.18, ACK automatically configures resource reservation. If resource reservation is not configured for the cluster and the resource usage of nodes is high, ACK may fail to schedule evicted pods to the nodes after the cluster is updated. Reserve sufficient resources on the nodes. We recommend that the CPU utilization does not exceed 50% and the memory utilization does not exceed 70%. For more information, see Resource reservation policy.
If the pods in a cluster that run Kubernetes 1.24 or earlier are configured only with a startup probe, the pods may temporarily remain in the NotReady state after kubelet is restarted. We recommend that you use a multi-replica deployment strategy to distribute workloads across multiple nodes to ensure that pods are sufficient during a node restart.
If a pod accesses another pod on the same node by using the IP address of the Server Load Balancer (SLB) instance exposed by the LoadBalancer Service and the externalTrafficPolicy of the Service is set to Local, the two pods may no longer reside on the same node after the node is renewed. This results in a network failure.
Custom OS images are not strictly validated by ACK. ACK does not guarantee the success of cluster updates for clusters that use a custom OS image.
To update a cluster, you must use Yum to download the required software packages. If your cluster uses custom network configurations or a custom OS image, you need to ensure that Yum can run as expected. You can run the yum makecache command to check the status of Yum.
If your cluster uses other custom configurations, such as swap partitions, kubelet configurations modified by using the CLI, or runtime configurations, the cluster may fail to be updated or the custom configurations may be overwritten during the update.
When you update a node by replacing system disks, ACK drains the node and evicts the pods from the node to other available nodes based on PodDisruptionBudget (PDB). To ensure high service availability, we recommend that you use a multi-replica deployment strategy to distribute workloads across multiple nodes. You can also configure PDB for key services to control the number of pods that are interrupted at the same time.
The default timeout period for node draining is 30 minutes. If the pod migration fails to be completed within the timeout period, ACK terminates the update to ensure service stability.
When you update a node by replacing the system disk, ACK reinitializes the node according to the current node pool configurations, including node logon methods, labels, taints, operating system images, and runtime versions. Normally, node pool configurations are updated by editing a node pool. If you made changes to the node in other ways, these changes will be overwritten during the update.
If pods on a node use hostPath volumes and the hostPath volumes points to a system disk, data in the hostPath volumes is lost after the node is updated by replacing system disks.
During a node pool update, you can only scale out the node pool. Scale-in is not allowed.
If your node is a free node, which is a worker node not managed by a node pool, you must migrate the node. For more information, see Add free nodes to a node pool.

Feature description

During a node pool update, the kubelet and container runtime are updated.

kubelet: Update the kubelet on all nodes in a node pool to the same version as the control plane. An in-place update is performed on the node pool by default.
Container runtime: If a new container runtime version is available, you can update the container runtime of the nodes in a node pool to the new version.
- If you change the container runtime from Docker to containerd, the change is applied by replacing the system disks of the nodes in the node pool. All data in the system disks is cleared. If the runtime update is performed by replacing system disks, back up the important data in the system disks before you update the node pool. For more information, see Migrate the container runtime from Docker to containerd.
- If you update the containerd version, an in-place update is performed on the node pool by default. The /etc/containerd/config.toml file on each node is replaced with the new version provided by ACK.
  Note
  During a runtime update, pod probes and lifecycle hooks may fail to run and pods may perform in-place restarts.
- In clusters that run Kubernetes versions earlier than 1.24, when you update Docker, nodes in the node pool are updated by replacing the system disks by default. This update process erases all contents on the system disks. Therefore, ensure to back up important data on the system disks before performing an update.

Procedure

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side navigation pane, choose Nodes > Node Pools.
On the Node Pools page, find the node pool that you want to update and choose More > Kubelet Update in the Actions column.
View the update objects (container runtime and kubelet), specify the nodes that you want to update (all nodes or specific nodes), select update methods, and configure the batch update policy.
- Update Method
  - Upgrade Node Pool by Replacing System Disk: When the system replaces the system disks of the nodes in a node pool, the system uses the node pool configuration to render the node component parameters. This ensures that the configurations of the node components and node pool are the same.
    If you select this update method, back up the data before you update the node pool or do not store important data on the system disks. The data disks are not affected by the update.
  - Create Snapshot before Update: If your nodes have important data on the system disks, you can select Create Snapshot before Update to back up and restore the data of the nodes. You are charged for using snapshots. For more information, see Snapshots. The creation progress of the snapshot dynamically changes. If you no longer need the snapshot after the update, delete the snapshot at the earliest opportunity.
- Batch Update Policy
  - Maximum Number of Nodes per Batch: You can specify the maximum number of nodes that can be concurrently updated in a batch. You can specify up to 10 nodes. For more information about the update process, see Reference: In-place updates and updates by replacing system disks.
  - Automatic Pause Policy: The suspension policy for the node during the update.
  - Interval Between Batches: If you do not configure the Automatic Pause Policy, you can specify whether to set an interval between update batches. The interval between batches can be set to 5 to 120 minutes.
Click Precheck. After the precheck is complete, follow the instructions on the page to start the update.
Note
If the cluster fails the precheck or the precheck result contains warnings, refer to Cluster check items and suggestions on how to fix cluster issues or check the Pre-check Details section based on the instructions on the page to troubleshoot the issues.
During the update, you can perform the following operations in the Event Rotation section:
- Pause: does not perform operations after you pause the update of a cluster. In addition, we recommend that you resume and complete the update at your earliest convenience. If the update is paused for more than 7 days, the system automatically terminates the update process. The events and log data that are generated during the update process are also deleted.
  After you click Pause, you cannot roll back the kubelet and container runtime after you update them.
- Cancel: cancels the update. After you click Cancel, the update is canceled. After you perform this operation, you cannot roll back the kubelet and container runtime after you update them.
After the update, click the node name on the Nodes page. On the Overview tab, check whether the kubelet version and container runtime version of the node are updated to the desired versions.

Reference: In-place updates and updates by replacing system disks

In-place updates and updates by replacing system disks

The following section describes the procedure for in-place updates and updates by replacing system disks. You can specify the maximum number of nodes that can be concurrently updated in a batch. You can specify up to 10 nodes. The number of nodes updated per batch increases by batch in the following sequence: 1, 2, 4, 8... After the maximum concurrency is reached, the number of nodes updated in each batch is equal to the maximum concurrency. For example, if you set the maximum concurrency to 4, one node is updated in the first batch, two nodes are concurrently updated in the second batch, and four nodes are concurrently updated in the third batch and subsequent batches.

The following figure shows the batch update process when the maximum concurrency is N. The number of nodes updated per batch increases by batch in the following sequence: 1, 2, 4, 8, ..., N

How an in-place update is performed on a node

Perform the precheck before the update. If the container has critical issues, such as ttrpc request processing failures or container processes not responding to signals, the update is suspended.
Save the current container and pod status to the tmp temporary directory.
Update containerd, crictl, and the related configuration files to the latest versions provided by ACK. The restart does not affect running containers. If you modified the /etc/containerd/config.toml file on the node, the update overwrites your changes.
Ensure that the kubelet runs as expected and the node is ready.

How a node is updated by replacing the system disk

The node is drained. When the node is drained, it is set to unschedulable.
The Elastic Compute Service (ECS) instance is stopped.
The system disk is replaced and the disk ID is changed. The category of the system disk, the IP addresses of the ECS instance, and the MAC addresses of the elastic network interfaces (ENIs) that are bound to the ECS instance remain unchanged.
The node is re-initialized.
The node is restarted and ready. When the node is ready, it is set to schedulable.
If a node is set to unschedulable before the node is drained, the node does not automatically revert to schedulable after the node is updated by replacing the system disk.

FAQ

Can I roll back a node pool after I update the node pool?

You cannot roll back the kubelet and container runtime after you update them. You can roll back only the OS image after you update it. The original OS image that you want to use must be supported by the node pool.

Are applications affected during the update?

In-place update: Pods are not restarted and applications are not affected.

Update by replacing the system disk: Nodes are drained during the update. If an application runs in multiple pods that are spread across multiple nodes and graceful shutdown is enabled for the pods, the application is not affected. For more information about graceful shutdown, see Graceful shutdown and zero downtime deployments in Kubernetes. To prevent multiple replicas of the same application from being updated in the same batch, we recommend that you set the maximum concurrency to a value less than the number of pods in which the application runs.

How long does it require for updating the nodes in a batch?

In-place update: within 5 minutes.

Update by replacing the system disk: within 8 minutes if snapshots are not created. If you select Create Snapshot before Update, ACK starts to update the nodes after the snapshots are created. The timeout period of snapshot creation is 40 minutes. If snapshot creation times out, the node update fails to start. If no business data is stored on the system disks, we recommend that you clear Create Snapshot before Update.

Does data loss occur when a node is updated?

If the runtime update is performed by replacing system disks, back up the data before you update the node pool or do not store important data in the system disks. The data disks are not affected by the update.

Does the IP address of a node change after the system disk of the node is replaced?

After the system disk is replaced, the disk ID is changed but the category of the system disk, the IP addresses of the ECS instance, and the MAC addresses of the ENIs that are bound to the ECS instance remain unchanged. For more information, see Replace the system disk (operating system) of an instance.

How do I update free nodes?

Nodes that are not added to node pools are called free nodes. Free nodes exist in clusters that were created before the node pool feature is released. To update a free node, add the node to a node pool and update the node pool. For more information about how to add a free node to a node pool, see Add free nodes to a node pool.

What do I do if the Docker directory still exists and occupies disk space after I change the container runtime of a node from Docker to containerd?

In addition to cluster-related containers, images, and logs, the file paths that you created are included in the Docker directory. If you no longer require the data in the Docker directory, you can manually delete the directory.

How do I restore data from snapshots?

You can create snapshots for nodes in a node pool when you update the node pool. By default, the snapshot is retained for 30 days. You can manually delete the snapshot before the retention period ends. If your data is lost after you update a node pool, you can use the following methods to restore the data:

If an in-place update is performed to update only the kubelet, you can use a snapshot to roll back the disk. For more information, see Roll back a disk by using a snapshot.
If the operating system or container runtime is updated by replacing the system disks of the nodes in the node pool, you can create a disk from the snapshot. For more information, see Create a disk from a snapshot.

References

You can enable auto updates for clusters to reduce the O&M pressure. For more information, see Automatically update a cluster.
For more information about the release notes for containerd, see Release notes for containerd.
You can create a managed node pool and enable the auto common vulnerabilities and exposures (CVE) patching feature for the node pool. For more information, see Auto CVE patching (recommended).
Kubernetes 1.24 plans to stop supporting Docker as a built-in container runtime. For more information about how to migrate node container runtime from Docker to containerd, see Migrate the container runtime from Docker to containerd.
Docker and containerd have different command-line tools. For more information, see Comparison of the commonly used commands provided by Docker Engine and containerd.
We recommend that you update the OS image version at the earliest opportunity. For more information, see Change the operating system.

Usage notes

Feature description

Procedure

Reference: In-place updates and updates by replacing system disks

In-place updates and updates by replacing system disks

How an in-place update is performed on a node

How a node is updated by replacing the system disk

FAQ

Can I roll back a node pool after I update the node pool?

Are applications affected during the update?

How long does it require for updating the nodes in a batch?

Does data loss occur when a node is updated?

Does the IP address of a node change after the system disk of the node is replaced?

How do I update free nodes?

What do I do if the Docker directory still exists and occupies disk space after I change the container runtime of a node from Docker to containerd?

How do I restore data from snapshots?

References

Sales Support

Technical Support

Connect & Report Abuse

About Alibaba Cloud

Our Global Network

Quick Start

Global Offices

Olympic Games Paris 2024 New

Stade Roland Garros – Glitz from the Past New

Place de la Concorde – “Breaking” the Barriers New

Vaires-sur-Marne Nautical Stadium – Sports with Sustainability New

International Broadcast Center – Images, Sounds, and Data that Captivate Billions New

Customer Success Stories New

Trust Center

Security & Compliance Center

Cloud Compliance Resources

Security Compliance FAQs

Product & Feature Update New

Cloud Forward

Press Room

Alibaba Cloud e-Magazine New

Alibaba Cloud in Analyst Research

Notice

Go Global Service New

Go Global Alliance with Alibaba Cloud

China Gateway Hot

Information Compliance

China Gateway - MLPS 2.0 Compliance New

China Gateway - Networking

China Gateway - Global Application Acceleration New

China Gateway - Security

China Gateway - Data Security New

ICP Support Hot

China Gateway - Omnichannel Data Mid-End New

China Gateway - Organizational Data Mid-End New

China Gateway - Business Mid-End New

China Gateway - AI Service for Conversational Chatbots New

China Gateway - Online Education

China Gateway - Domain Registration

Work at Alibaba Cloud

Experienced Professionals

Students and Graduates

Free Trial

Pricing

Promo Center

Price Reduction

Pay Less and Deploy More

FinOps

Elastic Compute Service (ECS)

Simple Application Server (SAS)

Elastic GPU Service

Elastic Desktop Service (EDS)

Object Storage Service (OSS)

Cloud Enterprise Network (CEN)

Web Application Firewall (WAF)

Domain Names

Container Compute Service (ACS)

Secure Access Service Edge (SASE)

Intelligent Media Services(IMS)

Edge Security Acceleration (ESA)(Original DCDN)

Intelligent Media Management

DingTalk Enterprise

YiDA

Alibaba Cloud Model Studio

Apsara Prime - For Easy Cloud Product Selection

Alibaba Cloud ECS - Cater All Your Cloud Hosting Needs

1TB CDN—Get Free 1 TB Outbound Traffic Plan Now

Security—Under Attack? Get Free Security Support

Short Message Service - Free Testing is Available

Elastic Compute Service (ECS) Hot

CloudBox

Compute Nest

Dedicated Host Hot

ECS Bare Metal Instance

Elastic Desktop Service (EDS) Featured

Cloud Phone Beta

Elastic GPU Service Featured

Simple Application Server (SAS) Hot

Auto Scaling

Batch Compute

Elastic High Performance Computing (E-HPC)

Super Computing Cluster (SCC)

Function Compute (FC)