Alibaba Cloud Container Compute Service (ACS) supports the periodic inspection feature provided by Container Intelligence Service (CIS). You can configure inspection rules to periodically inspect your cluster and identify potential risks. This topic describes the alerts that are generated by the cluster inspection feature for common issues and the solutions to these issues.
Check items and alerts
For more information about how to use the cluster inspection feature, see Work with the cluster inspection feature.
The check items may vary based on the configuration of your cluster. The check items in the inspection report shall prevail.
Check item | Inspection item | Alert |
Resource quotas ResourceQuotas | ||
Quota on SLB backend servers | ||
Quota on SLB listeners | ||
Resource watermarks ResourceLevel | SLB bandwidth usage | |
Number of SLB connections | ||
Versions and certificates Versions&Certificates | Kubernetes version of a cluster | |
Cluster risks ClusterRisk | Whether an SLB instance is associated with the API server | |
Status of the SLB instance associated with the API server | Abnormal status of the SLB instance associated with the API server | |
Configuration of the listener that listens on port 6443 for the SLB instance associated with the API server | ||
Access control configuration of the SLB instance associated with the API server | Errors in the access control configuration of the SLB instance associated with the API server | |
Cluster IP address of the DNS service | ||
Endpoints of the DNS service | ||
Whether one SLB port is shared by multiple Services |
Insufficient quota on SLB instances in a VPC
Alert description: The number of SLB instances that you can create in the cluster VPC is less than five. Each LoadBalancer Service in an ACK cluster occupies one SLB instance. If the quota is exhausted, the new LoadBalancer Services that you create cannot work as expected.
Solution: By default, you can have at most 60 SLB instances within each Alibaba Cloud account. To increase the quota, submit an application in the log on to the Quota Center console and submit an application. For more information, see Quotas.
Insufficient quota on SLB backend servers
Alert description: The quota on the number of ECS instances that can be associated with an SLB instance is insufficient. If you create a large number of LoadBalancer Services, the backend pods are distributed across multiple ECS instances. If the quota on ECS instances that can be associated with an SLB instance is exhausted, you cannot associate ECS instances with the SLB instance.
Solution: By default, you can associate at most 200 backend servers with an SLB instance. To increase the quota, submit an application in the log on to the Quota Center console and submit an application. For more information, see Quotas.
Insufficient quota on SLB listeners
Alert description: The quota on the number of listeners that you can add to an SLB instance is insufficient. A LoadBalancer Service listens on specific ports. Each port corresponds to an SLB listener. If the number of ports on which a LoadBalancer Service listens exceeds the quota, the ports that are not monitored by listeners cannot provide services as expected.
Solution: By default, you can add at most 50 listeners to an SLB instance. To increase the quota, submit an application in the log on to the Quota Center console and submit an application. For more information, see Quotas.
Insufficient quota on SLB instances
Alert description: The number of SLB instances that you can create is less than five. An SLB instance is created for each LoadBalancer Service. When the SLB instance quota is exhausted, newly created LoadBalancer Services cannot work as expected.
Solution: By default, you can have at most 60 SLB instances within each Alibaba Cloud account. To increase the quota, submit an application in the Quota Center console.
Excessive SLB bandwidth usage
Alert description: The maximum outbound bandwidth in the previous three days is higher than 80% of the bandwidth limit. If the bandwidth resources of the SLB instance are exhausted, the SLB instance may drop packets. This causes network jitters or increases the response latency.
Solution: If the bandwidth usage of the SLB instance is excessively high, upgrade the SLB instance. For more information, see Use an existing SLB instance.
Excessive number of SLB connections
Alert description: The peak value of SLB connections within the previous three days is higher than 80% of the upper limit. If the number of SLB connections reaches the upper limit, clients cannot establish connections to the SLB instance.
Solution: If the number of connections that are established to the SLB instance is excessively high and exceeds 80% of the upper limit within the previous three days, upgrade the SLB instance to avoid service interruptions. For more information, see Use an existing SLB instance.
Excessively high rate of new SLB connections
Alert description: The highest rate of new SLB connections within the previous three days is higher than 80% of the upper limit. If the rate reaches the upper limit, clients cannot establish new connections to the SLB instance within a short period of time.
Solution: If the rate of new SLB connections is excessively high and exceeds 80% of the upper limit, upgrade the SLB instance to avoid service interruptions. For more information, see Use an existing SLB instance.
Excessively high SLB QPS
Alert description: The highest QPS value of the SLB instance within the previous three days is higher than 80% of the upper limit. If the QPS value reaches the upper limit, clients cannot connect to the SLB instance.
Solution: If the QPS value of the SLB instance is excessively high and exceeds 80% of the upper limit, upgrade the SLB instance to avoid service interruptions. For more information, see Use an existing SLB instance.
Rate of new SLB connections check
Alert description: Check whether the highest rate of new SLB connections within the previous three days is higher than 80% of the upper limit. If the rate reaches the upper limit, clients cannot establish new connections to the SLB instance within a short period of time.
Solution: If the rate of new SLB connections is excessively high, upgrade the SLB instance to avoid service interruptions. For more information, see Use an existing SLB instance.
Whether an SLB instance is associated with the API server
Alert description: Check whether the SLB instance associated with the API server exists in the cluster. If no SLB instance exists and only one API server is running, the API server becomes a single point of failure (SPOF). If this API server fails, the cluster fails.
Solution: Associate an SLB instance with the API server.
Abnormal status of the SLB instance associated with the API server
Alert description: All cluster operations, such as pod scheduling, service deployment, and scale-out, are interrupted or delayed. The service discovery mechanism of the cluster depends on the API server. If the SLB instance is abnormal, service discovery may fail.
Solution: Check whether the configurations of the SLB instance are correct, including the configurations of backend servers, listening ports, and health checks.
Errors in the configuration of the listener that listens on port 6443 for the SLB instance associated with the API server
Alert description: Errors exist in the configurations of the listener that listens on port 6443 for the SLB instance associated with the API server. All requests to access the API server through SLB fail, including kubectl operations, dashboard access, and API requests from other services. Services in the cluster may fail to resolve other services by service names because resolution requests also depend on the API server.
Solution: Check whether the configurations of the SLB instance are correct, including the configurations of backend servers, listening ports, and health checks. Verify that an HTTPS listener that listens on port 6443 is configured.
Errors in the access control configurations of the SLB instance associated with the API server
Alert description: Errors exist in the access control configurations of the SLB instance associated with the API server. Cluster management operations, such as node management, pod scheduling, and service deployment, are interrupted or limited. Services in a cluster depend on the API Server for communication and service discovery. Access control exceptions cause requests that depend on the API server to fail.
Solution: Check the access control configurations of the SLB instance, including security groups and access control lists (ACLs) Verify that valid IP addresses and ports are allowed to access the API server, especially port 6443. Verify that the TLS/SSL configurations of the SLB instance and API server are correct, and the certificate is valid.
Abnormal cluster IP address of the DNS service
Alert description: Check whether the cluster IP address of the DNS service is assigned. An abnormal DNS service can cause exceptions in cluster features and further impact your business.
Solution: Check the network plug-ins and configurations of the Kubernetes cluster for conflicts or errors. Redeploy the DNS service, such as CoreDNS
, to ensure that the service is properly configured and the cluster IP address is assigned.
No backend servers available for the DNS service
Alert description: Check the number of backend servers that are associated with the DNS service in the cluster. If the number is 0, the DNS service is unavailable.
Solution: Check whether Corefile is correctly configured. Make sure that the forward
or proxy
command points to a valid set of backend DNS servers.
SLB QPS check
Alert description: Check whether the highest QPS value of the SLB instance within the previous three days is higher than 80% of the upper limit. If the QPS value reaches the upper limit, clients cannot connect to the SLB instance.
Solution: If the QPS value of the SLB instance is excessively high, upgrade the SLB instance to avoid service interruptions. For more information, see Use an existing SLB instance.
Outdated Kubernetes version of a cluster
Alert description: The Kubernetes version of a cluster is outdated or will become outdated soon.
Solution: If your cluster runs an outdated Kubernetes major version, update the cluster at the earliest opportunity.
One SLB port shared by multiple Services
Alert description: A port of an SLB instance is shared by multiple Services. This causes a Service exception.
Solution: Delete or modify the conflicting Services that share the same SLB port. Make sure that they use different ports when they share the same SLB instance.