Check items supported by cluster inspection of Container Intelligence Service - Container Service for Kubernetes

Container Intelligence Service provides the periodic inspection feature. You can configure inspection rules to periodically inspect your cluster and identify potential risks. This topic describes the alerts that are generated by the cluster inspection feature for common issues and the solutions to these issues.

Check items supported by cluster inspection

Note

For more information about the cluster inspection feature, see Work with the cluster inspection feature.
The check items may vary based on the configuration of your cluster. The check items in the inspection report shall prevail.

Check item	Alert
Resource quotas ResourceQuotas	Insufficient quota on SLB backend servers
	Insufficient quota on SLB listeners
	Insufficient quota on SLB instances
Resource watermarks ResourceLevel	Excessive SLB bandwidth usage
	Excessive number of SLB connections
	Excessively high rate of new SLB connections per second
	Excessively high SLB QPS
	Insufficient number of available pod CIDR blocks
	Insufficient number of idle vSwitch IP addresses
	Excessively high rate of new SLB connections per second of the Ingress controller
	Excessively high SLB QPS of the Ingress controller
	Insufficient number of idle vSwitch IP addresses of control planes
Versions and certificates Versions&Certificates	Outdated Kubernetes version of a cluster
	Outdated CoreDNS version
	Outdated cluster component version
Cluster risks ClusterRisk	Errors in the CoreDNS ConfigMap configuration
	CoreDNS deployment errors
	High availability issues in nodes with CoreDNS installed
	No backend DNS servers available for the DNS service
	Abnormal ClusterIP of the DNS service
	Abnormal NAT gateway status in the cluster
	Excessively high packet loss rate in the cluster because the NAT gateway exceeds the maximum number of concurrent sessions
	Automatic DNSConfig injection disabled for NodeLocal DNSCache
	Errors in the access control configuration of the SLB instance associated with the API server
	Abnormal status of backend servers of the SLB instance associated with the API server
	Errors in the configuration of the listener that listens on port 6443 for the SLB instance associated with the API server
	The SLB instance associated with the API server not exist
	Abnormal status of the SLB instance associated with the API server
	SLB health check failures of the Ingress controller
	Low percentage of ready Ingress pods
	Error logs of the Ingress controller pod
	Use of rewrite-target annotation without specifying capture groups in NGINX Ingresses
	Improper canary releases rules of NGINX Ingresses
	Improper use of annotations in NGINX Ingresses
	Use of deprecated components
	One SLB port shared by multiple Services

Insufficient quota on SLB backend servers

Alert description: Check the maximum number of backend server groups that can be associated with an SLB instance. The number of ECS instances that can be associated with an SLB instance is limited. If your LoadBalancer Service serves a large number of pods, the pods are distributed across multiple ECS instances, and the maximum number of ECS instances that can be associated with an SLB instance is exceeded, you cannot associate ECS instances with the SLB instance.

Solution: By default, you can associate at most 200 backend servers with an SLB instance. To increase the quota, submit an application in the log on to the Quota Center console and submit an application. For more information about the quota limit, see Quotas. For more information about the considerations for configuring load balancing, see Considerations for configuring a LoadBalancer Service.

Insufficient quota on SLB listeners

Alert description: Check the maximum number of listeners that you can add to an SLB instance. The number of listeners that you can add to an SLB instance is limited. A LoadBalancer Service listens on specific ports. Each port corresponds to an SLB listener. If the number of ports on which a LoadBalancer Service listens exceeds the quota, the ports that are not monitored by listeners cannot provide services as expected.

Solution: By default, you can add at most 50 listeners to an SLB instance. To increase the quota, submit an application in the log on to the Quota Center console and submit an application. For more information about the quota limit, see Quotas. For more information about the considerations for configuring load balancing, see Considerations for configuring a LoadBalancer Service.

Insufficient quota on SLB instances

Alert description: Check whether the number of SLB instances that you can create is less than five. An SLB instance is created for each LoadBalancer Service. When the SLB instance quota is exhausted, newly created LoadBalancer Services cannot work as normal.

Solution: By default, you can have at most 60 SLB instances within each Alibaba Cloud account. To increase the quota, submit an application in the Quota Center console. For more information about the considerations for configuring load balancing, see Considerations for configuring a LoadBalancer Service.

Excessive SLB bandwidth usage

Alert description: Check whether the peak value of the outbound bandwidth usage within the previous three days is higher than 80% of the bandwidth limit. If the bandwidth resources of the SLB instance are exhausted, the SLB instance may drop packets. This causes network jitters or increases the response latency.

Solution: If the bandwidth usage of the SLB instance is excessively high, upgrade the SLB instance. For more information, see Use an existing SLB instance.

Excessive number of SLB connections

Alert description: Check whether the peak value of SLB connections within the previous three days is higher than 80% of the upper limit. If the number of SLB connections reaches the upper limit, new connections cannot be established within a short period of time. Consequently, clients fail to establish connections to the SLB instance.

Solution: If the number of connections that are established to the SLB instance is excessively high and exceeds 80% of the upper limit within the previous three days, upgrade the SLB instance to avoid service interruptions. For more information, see Use an existing SLB instance.

Excessively high rate of new SLB connections per second

Alert description: Check whether the highest rate of new SLB connections per second within the previous three days is higher than 80% of the upper limit. If the rate reaches the upper limit, clients cannot establish new connections to the SLB instance within a short period of time.

Solution: If the rate of new SLB connections per second is excessively high and exceeds 80% of the upper limit, upgrade the SLB instance to avoid service interruptions. For more information, see Use an existing SLB instance.

Excessively high SLB QPS

Alert description: The highest QPS value of the SLB instance within the previous three days is higher than 80% of the upper limit. If the QPS value reaches the upper limit, clients cannot connect to the SLB instance.

Solution: If the QPS value of the SLB instance is excessively high and exceeds 80% of the upper limit, upgrade the SLB instance to avoid service interruptions. For more information, see Use an existing SLB instance.

Insufficient number of available pod CIDR blocks

Alert description: Check whether the number of available pod CIDR blocks in an ACK cluster that has Flannel installed is less than five. Each node in a cluster is attached to a pod CIDR block. You can add less than five nodes to the cluster. If all of the pod CIDR blocks are used, the new nodes that you add to the cluster cannot work as expected.

Solution: submit a ticket.

Insufficient number of idle vSwitch IP addresses

Alert description: Check whether the number of idle vSwitch IP addresses in a cluster that has Terway installed is less than 10. Each pod occupies one vSwitch IP address. If the vSwitch IP addresses are exhausted, new pods are not assigned IP addresses and cannot launch as expected.

Solution: Create a vSwitch for the cluster or change the vSwitch that is specified for the cluster. For more information, see What do I do if an ACK cluster in which Terway is installed has insufficient idle vSwitch IP addresses?

Excessively high rate of new SLB connections per second of the Ingress controller

Alert description: The highest rate of new SLB connections per second within the previous three days is higher than 80% of the upper limit. If the rate reaches the upper limit, clients cannot establish new connections to the SLB instance within a short period of time.

Solution: If the QPS value of the SLB instance is excessively high, upgrade the SLB instance to avoid service interruptions. For more information, see Use an existing SLB instance.

Excessively high SLB QPS of the Ingress controller

Alert description: The highest QPS value of the SLB instance within the previous three days is higher than 80% of the upper limit. If the QPS value reaches the upper limit, clients cannot connect to the SLB instance.

Solution: If the QPS value of the SLB instance is excessively high, upgrade the SLB instance to avoid service interruptions. For more information, see Use an existing SLB instance.

Insufficient number of idle vSwitch IP addresses of control planes

Alert description: The number of idle vSwitch IP addresses of control planes is less than 10. Consequently, newly created pods cannot run as normal.

Solution: submit a ticket.

Outdated Kubernetes version of a cluster

Alert description: The Kubernetes version of a cluster is outdated or will be outdated soon. Container Service for Kubernetes (ACK) clusters can stably run the latest three versions of Kubernetes. Stability issues or update failures may arise in ACK clusters that run an outdated Kubernetes major version. For more information about the release notes of Kubernetes versions supported by ACK, see Support for Kubernetes versions.

Solution: If your cluster runs an outdated Kubernetes major version, update the cluster at the earliest opportunity. For more information, see Update an ACK cluster or update only the control planes or node pools in an ACK cluster.

Outdated CoreDNS version

Alert description: The version of the CoreDNS component that is installed in the cluster is outdated. The latest CoreDNS version provides higher stability and new features.

Solution: To avoid DNS resolution errors, update the CoreDNS component at the earliest opportunity. For more information, see Manually update CoreDNS.

Outdated cluster component version

Alert description: The versions of the key components in the cluster are outdated.

Solution: Update the key components to the latest versions at the earliest opportunity in the ACK console.

Errors in the CoreDNS ConfigMap configuration

Alert description: Check whether the configuration of the CoreDNS ConfigMap contains errors. Configuration errors can cause the CoreDNS component to malfunction.

Solution: Check the configuration of the CoreDNS ConfigMap. For more information, see Best practices for DNS services.

CoreDNS deployment errors

Alert description: Check whether CoreDNS is deployed on a master node. If CoreDNS is deployed on a master node, the bandwidth usage of the master node may be excessively high, which affects the control planes.

Solution: Deploy CoreDNS pods on a worker node. For more information, see DNS troubleshooting.

High availability issues in nodes with CoreDNS installed

Alert description: Check how CoreDNS pods are deployed. When all CoreDNS pods are deployed on the same node, a single point of failure may occur. If the node fails or restarts, CoreDNS may fail to provide services, which causes business interruptions.

Solution: Deploy the CoreDNS pods on different nodes. For more information, see DNS troubleshooting.

No backend DNS servers available for the DNS service

Alert description: Check the number of backend DNS servers that are associated with the DNS service in the cluster. If the number is 0, the DNS service is unavailable.

Solution: Check the status and logs of the CoreDNS pods to troubleshoot DNS issues. For more information, see DNS troubleshooting.

Abnormal ClusterIP of the DNS service

Alert description: Check whether the ClusterIP of the DNS service in the cluster is successfully assigned. An abnormal DNS service can cause exceptions in cluster features and further impact your businesses.

Solution: Check the status and logs of the CoreDNS pods to troubleshoot DNS issues. For more information, see DNS troubleshooting.

Abnormal NAT gateway status in the cluster

Alert description: Check the status of the NAT gateway in the cluster.

Solution: Log on to the NAT Gateway console to check whether the NAT gateway is locked due to overdue payments.

Excessively high packet loss rate in the cluster because the NAT gateway exceeds the maximum number of concurrent sessions

Alert description: Check whether the packet loss rate in the cluster is excessively high because the NAT gateway exceeds the maximum number of concurrent sessions.

Solution: If the packet loss rate in the cluster is excessively high because the NAT gateway exceeds the maximum number of concurrent sessions, you can resolve the issue by upgrading the NAT gateway. For more information, see FAQ about upgrading standard Internet NAT gateways to enhanced Internet NAT gateways.

Automatic DNSConfig injection disabled for NodeLocal DNSCache

Alert description: Check whether the automatic DNSConfig injection is enabled. NodeLocal DNSCache takes effect only when automatic DNSConfig injection is enabled.

Solution: Enable automatic DNSConfig injection. For more information, see Configure NodeLocal DNSCache.

Errors in the access control configuration of the SLB instance associated with the API server

Alert description: Check whether the access control configuration of the SLB instance associated with the API server contains errors. Check whether the access control configuration of the SLB instance associated with the API server allows access from the VPC CIDR block of the cluster and 100.104.0.0/16. If not, the cluster becomes unavailable.

Solution: Modify the access control configuration of the SLB instance associated with the API server.

Abnormal status of backend servers of the SLB instance associated with the API server

Alert description: Check the status of the backend servers of the SLB instance associated with the API server in an ACK dedicated cluster. The backend servers of the SLB instance associated with the API server in an ACK dedicated cluster must include master nodes. Otherwise, traffic forwarding exceptions occur.

Solution: Add master nodes to the forwarding rules of the SLB instance.

Errors in the configuration of the listener that listens on port 6443 for the SLB instance associated with the API server

Alert description: Check the configuration of the listener that listens on port 6443 for the SLB instance associated with the API server. If the configuration contains errors, the cluster is inaccessible.

Solution: Modify the listener configuration to restore the cluster to the state when the cluster is created.

The SLB instance associated with the API server not exist

Alert description: Check whether the SLB instance associated with the API server exists in the cluster. If the SLB instance does not exist, the cluster is unavailable.

Solution: If the SLB instance is accidentally deleted, submit a ticket.

Abnormal status of the SLB instance associated with the API server

Alert description: Check the status of the SLB instance associated with the API server. If the status of the SLB instance is abnormal, the cluster is unavailable.

Solution: Make sure that the status of the SLB instance is normal.

SLB health check failures of the Ingress controller

Alert description: The SLB instance fails health checks during the previous three days. The failures may be caused by high component loads or incorrect component configurations.

Solution: The SLB instance fails to pass health checks during the previous three days. To avoid service interruptions, check whether abnormal events are generated for the Ingress controller Service and whether the component loads are excessively high. For more information about how to troubleshoot issues, see NGINX Ingress controller troubleshooting.

Low percentage of ready Ingress pods

Alert description: The percentage of ready pods among the pods created for the Ingress Deployment is lower than 100%. In this case, the Ingress Deployment cannot be started and fails health checks.

Solution: Use the pod diagnostics feature or refer to the Ingress troubleshooting documentation to identify the pods that are not ready. For more information, see NGINX Ingress controller troubleshooting.

Error logs of the Ingress controller pod

Alert description: The Ingress controller pod generates error logs. This indicates that Ingress controller does not work as expected.

Solution: Troubleshoot the issues based on the error logs. For more information, see NGINX Ingress controller troubleshooting.

Use of rewrite-target annotation without specifying capture groups in NGINX Ingresses

Alert description: The rewrite-target annotation is specified in the rules of the NGINX Ingress but capture groups are not specified. In Ingress controller 0.22.0 or later, you must specify capture groups if the rewrite-target annotation is configured. Otherwise, traffic forwarding is interrupted.

Solution: Reconfigure the rules of the NGINX Ingress and specify capture groups. For more information, see Advanced NGINX Ingress configurations.

Improper canary releases rules of NGINX Ingresses

Alert description: The service-match or service-weight annotation is configured for more than two Services. The service-match or service-weight annotation supports at most two Services for traffic distribution. If the service-match or service-weight annotation is configured for more than two Services, the additional Services are ignored and traffic is not forwarded as expected.

Solution: Reduce the number of Services to two.

Incorrect NGINX Ingress annotations

Alert description: The open source NGINX Ingress controller uses annotations that start with nginx.com/nginx.org instead of annotations that start with nginx.ingress.kubernetes.io. Annotations that start with nginx.com/nginx.org cannot be recognized by the NGINX Ingress controller. If the annotations are used, the relevant configurations are not applied to the NGINX Ingress controller.

Solution: Use the annotations supported by the NGINX Ingress controller. For more information about NGINX Ingress annotations, see Alibaba Cloud documentation or Community documentation.

Use of deprecated components

Alert description: Deprecated components are installed in the cluster.

Solution: The alicloud-application-controller component is discontinued. If the component is installed in the cluster, you may fail to update or use the cluster as expected. If deprecated components are installed in the cluster, uninstall the components. For more information, see Manage components.

One SLB port shared by multiple Services

Alert description: A port of an SLB instance is shared by multiple Services. This causes a Service exception.

Solution: Delete or modify the conflicting Services that share the same SLB port. Make sure that they use different ports when they share the same SLB instance.