Check items supported by cluster inspection of Container Intelligence Service - Container Service for Kubernetes

Container Intelligence Service provides the periodic inspection feature. You can configure inspection rules to periodically inspect your cluster and identify potential risks. This topic describes the alerts that are generated by the cluster inspection feature for common issues and the solutions to these issues.

Check items supported by cluster inspection

Note

For more information about the cluster inspection feature, see Work with the cluster inspection feature.
The check items may vary based on the configuration of your cluster. The check items in the inspection report shall prevail.

Check item	Alert
Resource quotas ResourceQuotas	Insufficient quota on VPC route entries
	Insufficient quota on SLB instances that can be associated with an ECS instance
	Insufficient quota on SLB backend servers
	Insufficient quota on SLB listeners
	Insufficient quota on SLB instances
Resource watermarks ResourceLevel	Excessive SLB bandwidth usage
	Excessive number of SLB connections
	Excessively high rate of new SLB connections per second
	Excessively high SLB QPS
	Insufficient number of available pod CIDR blocks
	Excessively high CPU utilization of a node
	Excessively high memory utilization of a node
	Insufficient number of idle vSwitch IP addresses
	Excessively high rate of new SLB connections per second of the Ingress controller
	Excessively high SLB QPS of the Ingress controller
	Insufficient number of idle vSwitch IP addresses of control planes
Versions and certificates Versions&Certificates	Outdated Kubernetes version of a cluster
	Outdated CoreDNS version
	Outdated node systemd version
	Outdated node OS version
	Outdated cluster component version
Cluster risks ClusterRisk	Docker hang errors on nodes
	Incorrect maximum number of pods supported by a node
	Errors in the CoreDNS ConfigMap configuration
	CoreDNS deployment errors
	High availability issues in nodes with CoreDNS installed
	No backend DNS servers available for the DNS service
	Abnormal ClusterIP of the DNS service
	Abnormal NAT gateway status in the cluster
	Excessively high packet loss rate in the cluster because the NAT gateway exceeds the maximum number of concurrent sessions
	Automatic DNSConfig injection disabled for NodeLocal DNSCache
	Errors in the access control configuration of the SLB instance associated with the API server
	Abnormal status of backend servers of the SLB instance associated with the API server
	Errors in the configuration of the listener that listens on port 6443 for the SLB instance associated with the API server
	The SLB instance associated with the API server not exist
	Abnormal status of the SLB instance associated with the API server
	SLB health check failures of the Ingress controller
	Low percentage of ready Ingress pods
	Error logs of the Ingress controller pod
	Use of rewrite-target annotation without specifying capture groups in NGINX Ingresses
	Improper canary releases rules of NGINX Ingresses
	Improper use of annotations in NGINX Ingresses
	Use of deprecated components
	Connectivity errors to the Kubernetes API server
	Issues related to the pod CIDR block of nodes and VPC route table
	Issues related to the read-only status of the node file system
	Issues related to the version of the kubelet on nodes
	Issues related to the outbound rules of the security group specified for nodes
	Issues related to the inbound rules of the security group specified for nodes
	Node inaccessibility to the Internet
	One SLB port shared by multiple Services

Insufficient quota on VPC route entries

Alert description: The number of route entries that you can add to the route table of the cluster virtual private cloud (VPC) is less than five. In a cluster that has Flannel installed, each node occupies one VPC route entry. If the quota on VPC route entries is exhausted, you cannot add nodes to the cluster. Clusters that have Terway installed do not use VPC route entries.

Solution: By default, you can add at most 200 route entries to the route table of a VPC. To increase the quota, submit an application in the log on to the Quota Center console and submit an application. For more information about the quota limit, see Quotas.

Insufficient quota on SLB instances that can be associated with an ECS instance

Alert description: Check the maximum number of backend server groups that can be associated with each Elastic Compute Service (ECS) instance. The number of Server Load Balancer (SLB) instances that can be associated with an ECS instance is limited. For pods that are connected to a LoadBalancer Service, the ECS instances on which the pods are deployed are associated with the SLB instance of the LoadBalancer Service. If the quota is exhausted, the new pods that you deploy and associate with the LoadBalancer Services cannot process requests as expected.

Solution: By default, you can add an ECS instance to at most 50 SLB server groups. To increase the quota, submit an application in the log on to the Quota Center console and submit an application. For more information about the quota limit, see Quotas. For more information about the considerations for configuring load balancing, see Considerations for configuring a LoadBalancer Service.

Insufficient quota on SLB backend servers

Alert description: Check the maximum number of backend server groups that can be associated with an SLB instance. The number of ECS instances that can be associated with an SLB instance is limited. If your LoadBalancer Service serves a large number of pods, the pods are distributed across multiple ECS instances, and the maximum number of ECS instances that can be associated with an SLB instance is exceeded, you cannot associate ECS instances with the SLB instance.

Solution: By default, you can associate at most 200 backend servers with an SLB instance. To increase the quota, submit an application in the log on to the Quota Center console and submit an application. For more information about the quota limit, see Quotas. For more information about the considerations for configuring load balancing, see Considerations for configuring a LoadBalancer Service.

Insufficient quota on SLB listeners

Alert description: Check the maximum number of listeners that you can add to an SLB instance. The number of listeners that you can add to an SLB instance is limited. A LoadBalancer Service listens on specific ports. Each port corresponds to an SLB listener. If the number of ports on which a LoadBalancer Service listens exceeds the quota, the ports that are not monitored by listeners cannot provide services as expected.

Solution: By default, you can add at most 50 listeners to an SLB instance. To increase the quota, submit an application in the log on to the Quota Center console and submit an application. For more information about the quota limit, see Quotas. For more information about the considerations for configuring load balancing, see Considerations for configuring a LoadBalancer Service.

Insufficient quota on SLB instances

Alert description: Check whether the number of SLB instances that you can create is less than five. An SLB instance is created for each LoadBalancer Service. When the SLB instance quota is exhausted, newly created LoadBalancer Services cannot work as normal.

Solution: By default, you can have at most 60 SLB instances within each Alibaba Cloud account. To increase the quota, submit an application in the Quota Center console. For more information about the considerations for configuring load balancing, see Considerations for configuring a LoadBalancer Service.

Excessive SLB bandwidth usage

Alert description: Check whether the peak value of the outbound bandwidth usage within the previous three days is higher than 80% of the bandwidth limit. If the bandwidth resources of the SLB instance are exhausted, the SLB instance may drop packets. This causes network jitters or increases the response latency.

Solution: If the bandwidth usage of the SLB instance is excessively high, upgrade the SLB instance. For more information, see Use an existing SLB instance.

Excessive number of SLB connections

Alert description: Check whether the peak value of SLB connections within the previous three days is higher than 80% of the upper limit. If the number of SLB connections reaches the upper limit, new connections cannot be established within a short period of time. Consequently, clients fail to establish connections to the SLB instance.

Solution: If the number of connections that are established to the SLB instance is excessively high and exceeds 80% of the upper limit within the previous three days, upgrade the SLB instance to avoid service interruptions. For more information, see Use an existing SLB instance.

Excessively high rate of new SLB connections per second

Alert description: Check whether the highest rate of new SLB connections per second within the previous three days is higher than 80% of the upper limit. If the rate reaches the upper limit, clients cannot establish new connections to the SLB instance within a short period of time.

Solution: If the rate of new SLB connections per second is excessively high and exceeds 80% of the upper limit, upgrade the SLB instance to avoid service interruptions. For more information, see Use an existing SLB instance.

Excessively high SLB QPS

Alert description: The highest QPS value of the SLB instance within the previous three days is higher than 80% of the upper limit. If the QPS value reaches the upper limit, clients cannot connect to the SLB instance.

Solution: If the QPS value of the SLB instance is excessively high and exceeds 80% of the upper limit, upgrade the SLB instance to avoid service interruptions. For more information, see Use an existing SLB instance.

Insufficient number of available pod CIDR blocks

Alert description: Check whether the number of available pod CIDR blocks in an ACK cluster that has Flannel installed is less than five. Each node in a cluster is attached to a pod CIDR block. You can add less than five nodes to the cluster. If all of the pod CIDR blocks are used, the new nodes that you add to the cluster cannot work as expected.

Solution: submit a ticket.

Excessively high CPU usage on nodes

Alert description: Check the CPU usage on nodes within the previous seven days. If the CPU usage on nodes is excessively high and a large number of pods are scheduled, the pods compete for resources. This may result in service interruptions.

Solution: To avoid service interruptions, you must set the pod request and limit to proper values in order to avoid running an excessive number of pods on a node. For more information, see Modify the upper and lower limits of CPU and memory resources for a pod.

Excessively high memory usage on nodes

Alert description: Check the memory usage on nodes within the previous seven days. If the memory usage on nodes is excessively high and a large number of pods are scheduled, the pods compete for resources. This may lead to out of memory (OOM) errors and result in service interruptions.

Solution: To avoid service interruptions, you must set the pod request and limit to proper values to avoid running an excessive number of pods on a node. For more information, see Modify the upper and lower limits of CPU and memory resources for a pod.

Insufficient number of idle vSwitch IP addresses

Alert description: Check whether the number of idle vSwitch IP addresses in a cluster that has Terway installed is less than 10. Each pod occupies one vSwitch IP address. If the vSwitch IP addresses are exhausted, new pods cannot be assigned IP addresses and thus cannot launch as expected.

Solution: Create a vSwitch for the cluster or change the vSwitch that is specified for the cluster. For more information, see What do I do if an ACK cluster in which Terway is installed has insufficient idle vSwitch IP addresses?

Excessively high rate of new SLB connections per second of the Ingress controller

Alert description: The highest rate of new SLB connections per second within the previous three days is higher than 80% of the upper limit. If the rate reaches the upper limit, clients cannot establish new connections to the SLB instance within a short period of time.

Solution: If the QPS value of the SLB instance is excessively high, upgrade the SLB instance to avoid service interruptions. For more information, see Use an existing SLB instance.

Excessively high SLB QPS of the Ingress controller

Alert description: The highest QPS value of the SLB instance within the previous three days is higher than 80% of the upper limit. If the QPS value reaches the upper limit, clients cannot connect to the SLB instance.

Solution: If the QPS value of the SLB instance is excessively high, upgrade the SLB instance to avoid service interruptions. For more information, see Use an existing SLB instance.

Insufficient number of idle vSwitch IP addresses of control planes

Alert description: The number of idle vSwitch IP addresses of control planes is less than 6. Consequently, the control plane may not be assigned IP addresses and thus cannot launch as expected.

Solution:

On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, click Cluster Information.
On the Basic information tab, click Edit on the right of Control Plane vSwitch. Read the configuration notes, and create or change the vSwitch as prompted.

Outdated Kubernetes version of a cluster

Alert description: The Kubernetes version of a cluster is outdated or will be outdated soon. Container Service for Kubernetes (ACK) clusters can stably run the latest three versions of Kubernetes. Stability issues or update failures may arise in ACK clusters that run an outdated Kubernetes major version. For more information about the release notes of Kubernetes versions supported by ACK, see Support for Kubernetes versions.

Solution: If your cluster runs an outdated Kubernetes major version, update the cluster at the earliest opportunity. For more information, see Update an ACK cluster or update only the control planes or node pools in an ACK cluster.

Outdated CoreDNS version

Alert description: The version of the CoreDNS component that is installed in the cluster is outdated. The latest CoreDNS version provides higher stability and new features.

Solution: To avoid DNS resolution errors, update the CoreDNS component at the earliest opportunity. For more information, see Manually update CoreDNS.

Outdated node systemd version

Alert description: The systemd version is outdated and has stability issues that can cause the Docker and containerd components to malfunction.

Solution: For more information about how to fix this issue, see What do I do if the error message "Reason:KubeletNotReady Message:PLEG is not healthy:" appears in the logs of the kubelets in a Kubernetes cluster that runs CentOS 7.6?

Outdated node OS version

Alert description: The OS version is outdated and has stability issues that can cause the Docker and containerd components to malfunction.

Solution: Create a new node pool, temporarily migrate the workloads to the new node pool, and then update the OS of the nodes in the current node pool. For more information, see Create a node pool.

Outdated cluster component version

Alert description: The versions of the key components in the cluster are outdated.

Solution: Update the key components to the latest versions at the earliest opportunity in the ACK console.

Docker hang error on nodes

Alert description: Docker hangs on nodes.

Solution: Log on to the nodes and run the sudo systemctl restart docker command to restart Docker. For more information, see Dockerd exceptions - RuntimeOffline.

Incorrect maximum number of pods supported by the node

Alert description: The maximum number of pods supported by the node is different from the theoretical value.

Solution: If the maximum number of pods supported by a node is different from the theoretical value and you have never modified this limit, submit a ticket.

Errors in the CoreDNS ConfigMap configuration

Alert description: Check whether the configuration of the CoreDNS ConfigMap contains errors. Configuration errors can cause the CoreDNS component to malfunction.

Solution: Check the configuration of the CoreDNS ConfigMap. For more information, see Best practices for DNS services.

CoreDNS deployment errors

Alert description: Check whether CoreDNS is deployed on a master node. If CoreDNS is deployed on a master node, the bandwidth usage of the master node may be excessively high, which affects the control planes.

Solution: Deploy CoreDNS pods on a worker node. For more information, see DNS troubleshooting.

High availability issues in nodes with CoreDNS installed

Alert description: Check how CoreDNS pods are deployed. When all CoreDNS pods are deployed on the same node, a single point of failure may occur. If the node fails or restarts, CoreDNS may fail to provide services, which causes business interruptions.

Solution: Deploy the CoreDNS pods on different nodes. For more information, see DNS troubleshooting.

No backend DNS servers available for the DNS service

Alert description: Check the number of backend DNS servers that are associated with the DNS service in the cluster. If the number is 0, the DNS service is unavailable.

Solution: Check the status and logs of the CoreDNS pods to troubleshoot DNS issues. For more information, see DNS troubleshooting.

Abnormal ClusterIP of the DNS service

Alert description: Check whether the ClusterIP of the DNS service in the cluster is successfully assigned. An abnormal DNS service can cause exceptions in cluster features and further impact your businesses.

Solution: Check the status and logs of the CoreDNS pods to troubleshoot DNS issues. For more information, see DNS troubleshooting.

Abnormal NAT gateway status in the cluster

Alert description: Check the status of the NAT gateway in the cluster.

Solution: Log on to the NAT Gateway console to check whether the NAT gateway is locked due to overdue payments.

Excessively high packet loss rate in the cluster because the NAT gateway exceeds the maximum number of concurrent sessions

Alert description: Check whether the packet loss rate in the cluster is excessively high because the NAT gateway exceeds the maximum number of concurrent sessions.

Solution: If the packet loss rate in the cluster is excessively high because the NAT gateway exceeds the maximum number of concurrent sessions, you can resolve the issue by upgrading the NAT gateway. For more information, see FAQ about upgrading standard Internet NAT gateways to enhanced Internet NAT gateways.

Automatic DNSConfig injection disabled for NodeLocal DNSCache

Alert description: Check whether the automatic DNSConfig injection is enabled. NodeLocal DNSCache takes effect only when automatic DNSConfig injection is enabled.

Solution: Enable automatic DNSConfig injection. For more information, see Configure NodeLocal DNSCache.

Errors in the access control configuration of the SLB instance associated with the API server

Alert description: Check whether the access control configuration of the SLB instance associated with the API server contains errors. Check whether the access control configuration of the SLB instance associated with the API server allows access from the VPC CIDR block of the cluster and 100.104.0.0/16. If not, the cluster becomes unavailable.

Solution: Modify the access control configuration of the SLB instance associated with the API server.

Abnormal status of backend servers of the SLB instance associated with the API server

Alert description: Check the status of the backend servers of the SLB instance associated with the API server in an ACK dedicated cluster. The backend servers of the SLB instance associated with the API server in an ACK dedicated cluster must include master nodes. Otherwise, traffic forwarding exceptions occur.

Solution: Add master nodes to the forwarding rules of the SLB instance.

Errors in the configuration of the listener that listens on port 6443 for the SLB instance associated with the API server

Alert description: Check the configuration of the listener that listens on port 6443 for the SLB instance associated with the API server. If the configuration contains errors, the cluster is inaccessible.

Solution: Modify the listener configuration to restore the cluster to the state when the cluster is created.

The SLB instance associated with the API server not exist

Alert description: Check whether the SLB instance associated with the API server exists in the cluster. If the SLB instance does not exist, the cluster is unavailable.

Solution: If the SLB instance is accidentally deleted, submit a ticket.

Abnormal status of the SLB instance associated with the API server

Alert description: Check the status of the SLB instance associated with the API server. If the status of the SLB instance is abnormal, the cluster is unavailable.

Solution: Make sure that the status of the SLB instance is normal.

SLB health check failures of the Ingress controller

Alert description: The SLB instance fails health checks during the previous three days. The failures may be caused by high component loads or incorrect component configurations.

Solution: The SLB instance fails to pass health checks during the previous three days. To avoid service interruptions, check whether abnormal events are generated for the Ingress controller Service and whether the component loads are excessively high. For more information about how to troubleshoot issues, see NGINX Ingress controller troubleshooting.

Low percentage of ready Ingress pods

Alert description: The percentage of ready pods among the pods created for the Ingress Deployment is lower than 100%. In this case, the Ingress Deployment cannot be started and fails health checks.

Solution: Use the pod diagnostics feature or refer to the Ingress troubleshooting documentation to identify the pods that are not ready. For more information, see NGINX Ingress controller troubleshooting.

Error logs of the Ingress controller pod

Alert description: The Ingress controller pod generates error logs. This indicates that Ingress controller does not work as expected.

Solution: Troubleshoot the issues based on the error logs. For more information, see NGINX Ingress controller troubleshooting.

Use of rewrite-target annotation without specifying capture groups in NGINX Ingresses

Alert description: The rewrite-target annotation is specified in the rules of the NGINX Ingress but capture groups are not specified. In Ingress controller 0.22.0 or later, you must specify capture groups if the rewrite-target annotation is configured. Otherwise, traffic forwarding is interrupted.

Solution: Reconfigure the rules of the NGINX Ingress and specify capture groups. For more information, see Advanced NGINX Ingress configurations.

Improper canary releases rules of NGINX Ingresses

Alert description: The service-match or service-weight annotation is configured for more than two Services. The service-match or service-weight annotation supports at most two Services for traffic distribution. If the service-match or service-weight annotation is configured for more than two Services, the additional Services are ignored and traffic is not forwarded as expected.

Solution: Reduce the number of Services to two.

Incorrect NGINX Ingress annotations

Alert description: The open source NGINX Ingress controller uses annotations that start with nginx.com/nginx.org instead of annotations that start with nginx.ingress.kubernetes.io. Annotations that start with nginx.com/nginx.org cannot be recognized by the NGINX Ingress controller. If the annotations are used, the relevant configurations are not applied to the NGINX Ingress controller.

Solution: Use the annotations supported by the NGINX Ingress controller. For more information about NGINX Ingress annotations, see Alibaba Cloud documentation or Community documentation.

Use of deprecated components

Alert description: Deprecated components are installed in the cluster.

Solution: The alicloud-application-controller component is discontinued. If the component is installed in the cluster, you may fail to update or use the cluster as expected. If deprecated components are installed in the cluster, uninstall the components. For more information, see Manage components.

Connectivity errors to the Kubernetes API server

Alert description: Nodes cannot connect to the Kubernetes API server of the cluster.

Solution: Check cluster configurations. For more information, see Troubleshoot ACK clusters.

Issues related to the pod CIDR blocks of nodes and VPC route table

Alert description: Check whether the pod CIDR block is included in the VPC route table.

Solution: If the pod CIDR block of nodes is not included in the VPC route table, add a route whose next hop is the current node for the pod CIDR block. For more information, see Use custom route tables to manage network traffic.

Issues related to the read-only status of the node file system

Alert description: If the file system of a node is read-only, it is typically caused by a disk failure. This issue can prevent the node from writing data and may result in service interruptions.

Solution: Run the fsck command on the node to repair the file system, and then restart the node.

Issues related to the version of the kubelet on nodes

Alert description: Check whether the version of the kubelet on a node is earlier than the version of control planes.

Solution: If the version of the kubelet on a node is earlier than the version of control planes, we recommend that you manually remove the node to avoid stability issues. For more information, see Features and custom configurations.

Issues related to the outbound rules of the security group specified for nodes

Alert description: Check whether the outbound rules of the security group specified for nodes meet the access permission requirements of the cluster.

Solution: If the outbound rules of the security group do not meet the access permission requirements of the cluster, modify the outbound rules. For more information, see Configure security groups for clusters.

Issues related to the inbound rules of the security group specified for nodes

Alert description: Check whether the inbound rules of the security group specified for nodes meet the access permission requirements of the cluster.

Solution: If the inbound rules of the security group do not meet the access permission requirements of the cluster, modify the inbound rules. For more information, see Configure security groups for clusters.

Node inaccessibility to the Internet

Alert description: Check whether the node can access the Internet.

Solution: Check whether SNAT is enabled for the cluster. For more information about how to enable SNAT, see Enable an existing ACK cluster to access the Internet.

One SLB port shared by multiple Services

Alert description: A port of an SLB instance is shared by multiple Services. This causes a Service exception.

Solution: Delete or modify the conflicting Services that share the same SLB port. Make sure that they use different ports when they share the same SLB instance.