Automatically repair the nodes in a managed node pool based on automated O&M capabilities - Container Service for Kubernetes

Container Service for Kubernetes (ACK) monitors the status of nodes in managed node pools. This ensures that the nodes in an ACK cluster can run as normal. When exceptions occur on a node in a managed node pool, ACK automatically repairs the node. After you configure a node pool as a managed node pool, auto repair is automatically enabled for the nodes in the managed node pool. This topic describes the use scenarios and procedure of auto repair.

Prerequisites

The managed node pool feature is enabled. For more information, see Create a node pool.
The Kubernetes event center is enabled. For more information, see Event monitoring.

Conditions to trigger auto repair

Important

If you enabled Restart Nodes if Necessary to Patch CVE Vulnerabilities when you create the node pool, operations such as node draining and system disk replacement may be performed during the auto repair. To avoid data loss, we recommend that you store data in data disks.

ACK determines whether to run auto repair tasks based on the status (condition) of nodes. To check the status of a node, run the kubectl describe node command and check the value of the condition field in the output. If a node remains in an abnormal state within a period of time that exceeds the threshold, ACK automatically runs repair tasks on the node. The threshold is the maximum duration a node remains in an abnormal state after an exception occurs and before auto repair is triggered.

The following table describes the trigger conditions.

Check Item	Description	Severity	Threshold	Fix
KubeletNotReady(KubeletHung)	The node enters the NotReady state because the kubelet stops running.	High	180s	Restart the kubelet. Restart the Elastic Compute Service (ECS) instance if Restart Nodes if Necessary to Patch CVE Vulnerabilities is enabled.
KubeletNotReady(PLEG)	The node enters the NotReady state because the Pod Lifecycle Event Generator (PLEG) module fails to pass health checks.	Medium	180s	Restart containerd or Docker. Restart the kubelet. Restart the ECS instance if Restart Nodes if Necessary to Patch CVE Vulnerabilities is enabled.
KubeletNotReady(SandboxError)	The kubelet cannot be started because no sandboxed pod is found.	High	180s	Delete the sandboxed pod. Restart kubelet.
RuntimeOffline	Docker or containerd stops running and the node is unavailable.	High	90s	Restart containerd or Docker. Restart the ECS instance if Restart Nodes if Necessary to Patch CVE Vulnerabilities is enabled.
NTPProblem	The time synchronization service (ntpd or chronyd) is in an abnormal state.	High	10s	Restart ntpd or chronyd.
SystemdOffline	Systemd is in an abnormal state and cannot launch or destroy containers.	High	90s	Restart the ECS instance if Restart Nodes if Necessary to Patch CVE Vulnerabilities is enabled.
ReadonlyFilesystem	The node file system becomes read-only.	High	90s	Restart the ECS instance if Restart Nodes if Necessary to Patch CVE Vulnerabilities is enabled.

Procedure

The auto repair feature includes the following phases: diagnose node exceptions, determine whether to trigger auto repair, and run auto repair tasks.

Important

Node diagnostics are performed based on statistics provided by NPD and the Kubernetes event center. Before you use the auto repair feature, make sure that NPD is installed and the Kubernetes event center is enabled. For more information, see Event monitoring.

During a complete auto repair procedure, a node transits among the following states:

Normal: The node runs without exceptions.
Error: Exceptions occur on the node.
Failed to Recover: The node fails to recover from the exceptions.

节点自动恢复.png

If a node enters an abnormal state and remains in the abnormal state for a period of time that is longer than the threshold, ACK determines that the node is in the Error state.
After a node is considered in the Error state, ACK runs specific auto repair tasks to fix the exceptions and generates events.
- If the node exceptions are fixed after the repair tasks are completed, the node changes to the Normal state.
- If the node exceptions persist after the repair tasks are completed, the node changes to the Failed to Recover state.

Note

If node exceptions occur in multiple node pools in a cluster, ACK runs auto repair tasks on the node pools in parallel.
If exceptions occur on multiple nodes in a node pool, ACK runs auto repair tasks on the nodes in sequence. If ACK fails to repair one of the nodes, ACK stops running auto repair tasks on the remaining nodes.
If ACK fails to repair a node, ACK stops triggering auto repair for the node. This means that auto repair is resumed for the node only after the node recovers from the exception.
Some complex cases may still require manual intervention.

Auto repair events

After auto repair is triggered, ACK writes the relevant events to the Kubernetes event center. To view the repair records and operations in the Kubernetes event center, go to the cluster details page and choose Operations > Event Center in the left-side navigation pane.

Cause	Level	Description
NodeRepairStart	Normal	The system starts to repair the node.
NodeRepairAction	Normal	The repair operation on the node, such as restarting the kubelet.
NodeRepairSucceed	Normal	The node recovers from exceptions after the repair operation is completed.
NodeRepairFailed	Warning	The node fails to recover from exceptions after the repair operation is completed. For more information, see FAQ.
NodeRepairIgnore	Normal	The node skips the repair operation. If an ECS node is not in the Running state, the system does not perform auto repair operations for the node.

FAQ

What do I do if ACK fails to repair a node?

The auto repair feature may fail to fix complicated node exceptions in some scenarios. If an auto repair task fails on a node or the node exception persists after an auto repair task is completed, the node enters the Failed to Recover state.

If ACK fails to repair a node in a node pool, it stops triggering auto repair for all nodes in the node pool until the node recovers from exceptions. In this case, you can submit a ticket to request technical support to manually repair the node.

How do I disable auto repair for a node?

To disable auto repair for a node in a node pool, add the following label to the node:

alibabacloud.com/repair.policy=disable

References

If you want to resolve the issue by removing the abnormal node and re-adding it, log on to the ACK console and go to the Node Pools page. For more information, see Remove a node and Add existing ECS instances to an ACK cluster.

Do not perform the following operations:

Remove nodes by running the kubectl delete node command.
Remove and add nodes in the ECS console and Auto Scaling (ESS) console.