Container Service for Kubernetes (ACK) supports open source Prometheus and Managed Service for Prometheus. If the predefined Prometheus metrics cannot meet your business requirements, you can use custom PromQL statements to create alert rules to monitor the heath of cluster nodes, hosts, replicated pods, and workloads. Alerts can be triggered and sent to you when the metric values exceed the thresholds or certain conditions are met.
Prerequisite
Prometheus monitoring is enabled for the ACK cluster. For more information, see Managed Service for Prometheus (recommended) and Use open source Prometheus to monitor an ACK cluster.
Use custom PromQL statements to configure Prometheus alert rules
By default, ACK clusters are compatible with Managed Service for Prometheus and open source Prometheus. You can use custom PromQL statements to define alert rules for Prometheus monitoring. If the conditions specified in the alert rules are met, the system generates alerts and sends notifications.
Managed Service for Prometheus
To use custom PromQL statements to configure alert rules in Managed Service for Prometheus, see Create an alert rule for a Prometheus instance.
Open source Prometheus
Configure alert notification policies.
Open source Prometheus supports various alert notification methods, including webhook URLs, DingTalk chatbots, and emails. You can set the alert notification method by configuring the
receiver
parameter in the configurations of the ack-prometheus-operator application. For more information, see Configure alerts.Create an alert rule.
Create a PrometheusRule CustomResourceDefinition (CRD) to define an alert rule. For more information, see Deploying Prometheus Rules.
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: labels: # The labels must be the same as the labels specified in the match ruleSelector -> matchLabels parameter of the Prometheus CRD. prometheus: example role: alert-rules name: prometheus-example-rules spec: groups: - name: example.rules rules: - alert: ExampleAlert # The expr parameter specifies the data query and trigger condition of PromQL. For more information, see the PromQL statement column of the following alert rule table in this topic. expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 90
Check whether the alert rule takes effect.
Run the following command to map Prometheus in the cluster to local port 9090:
kubectl port-forward svc/ack-prometheus-operator-prometheus 9090:9090 -n monitoring
Enter localhost:9090 into the address bar of your web browser to go to the Prometheus Server console.
In the upper part of the Prometheus Server console, choose
.On the Rules page, you can view alert rules. If the alert rule that you created is displayed on the Rules page, the alert rule has taken effect.
Alert rules
Based on the O&M experience of clusters and applications, ACK provides a set of Prometheus alert rules. You can use these rules to identify cluster stability issues, node exceptions, node resource utilization issues, pod errors, workload errors, storage errors, and network errors.
The alert rules are classified into the following severities based on the level of impact caused by pod errors and workload errors:
Critical: Clusters, applications, and even business are affected. You need to troubleshoot the issues immediately.
Warning: Clusters, applications, and even business are affected. You need to troubleshoot the issues at the earliest opportunity.
Normal: Important feature changes are involved.
The portal in the Description column refers to the Alert Rules tab of the Alerts page. You can log on to the ACK console and click the name of your cluster on the Clusters page. In the left-side navigation pane, choose On the Alerts page, click the Alert Rules tab and update the corresponding alert rules. .
Pod errors
Description | Severity | PromQL statement | Description | SOP for handling alerts |
Abnormal pod status | Critical | min_over_time(sum by (namespace, pod, phase) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[5m:1m]) > 0 | This rule triggers alerts if abnormal pod status is detected within the last 5 minutes. In the portal, click Alert Rule Set for Pod Exceptions and set the Pod anomaly alert rule. For more information, see Alert management. | For more information about how to troubleshoot abnormal pod status, see Pod troubleshooting. |
Pod launch failures | Critical | sum_over_time(increase(kube_pod_container_status_restarts_total{}[1m])[5m:1m]) > 3 | This rule triggers alerts if the number of pod launch failures exceeds 3 within the last 5 minutes. In the portal, click Alert Rule Set for Pod Exceptions and set the Pod startup failures alert rule. For more information, see Alert management. | For more information about how to troubleshoot pod launch failures, see Pod troubleshooting. |
Over 1,000 pending pods | Critical | sum(sum(max_over_time(kube_pod_status_phase{ phase=~"Pending"}[5m])) by (pod)) > 1000 | This rule triggers alerts if the number of pending pods exceeds 1,000 within the last 5 minutes. | This issue occurs because the specifications of the cluster cannot meet the requirements for scheduling more than 1,000 pods. ACK Pro clusters provide enhanced capabilities for scheduling pods and are covered by SLAs. We recommend that you upgrade the cluster to an ACK Pro cluster. For more information, see Overview of ACK Pro clusters. |
Frequent CPU throttling | Warning | rate(container_cpu_cfs_throttled_seconds_total[3m]) * 100 > 25 | CPU throttling is frequently enforced on pods. This rule triggers alerts if the percentage of throttled CPU time slices within the last 3 minutes exceeds 25%. | CPU throttling limits the CPU time slices that the processes in pods can use. This reduces the uptime of processes in the pods and may slow down the processes in the pods. If this issue occurs, check whether the CPU limit of the pod is set to a small value. To resolve this issue, we recommend that you enable CPU Burst. For more information, see CPU Burst. If you cluster contains multi-core ECS instances, we recommend that you enable topology-aware CPU scheduling to maximize the utilization of CPU fragments. For more information, see Topology-aware CPU scheduling. |
CPU usage of pods higher than 85% | Warning | sum(irate(container_cpu_usage_seconds_total{pod=~"{{PodName}}.*",namespace=~"{{Namespace}}.*",container!="",container!="POD"}[1m])) by (namespace,pod) / sum(container_spec_cpu_quota{pod=~"{{PodName}}.*",namespace=~"{{Namespace}}.*",container!="",container!="POD"}/100000) by (namespace,pod) * 100 <= 100 or on() vector(0) >= 85 | This rule triggers alerts if the CPU usage of a pod or the pods in a namespace exceeds 85% of the pod CPU limit. The threshold is 0 when no CPU limit is set. The default threshold 85% is a suggested value. You can change the value based on your business requirements. To specify a pod or namespace, replace | When the CPU usage of a pod is high, CPU throttling is triggered. As a result, CPU time slices are insufficient, which further affects the processes running in the pod. To avoid this issue, check whether the CPU |
Memory usage of pods higher than 85% | Warning | (sum(container_memory_working_set_bytes{pod=~"{{PodName}}.*",namespace=~"{{Namespace}}.*",container !="",container!="POD"}) by (pod,namespace)/ sum(container_spec_memory_limit_bytes{pod=~"{{PodName}}.*",namespace=~"{{Namespace}}.*",container !="",container!="POD"}) by (pod, namespace) * 100) <= 100 or on() vector(0) >= 85 | This rule triggers alerts if the memory usage of a pod exceeds 85% of the pod memory limit. The threshold is 0 when no memory limit is set. The default threshold 85% is a suggested value. You can change the value based on your business requirements. To specify a pod or namespace, replace | When the memory usage of a pod is high, the pod may be killed due to an OOM error. Consequently, the pod is restarted. To avoid this issue, check whether the memory |
Workload exceptions
Description | Severity | PromQL statement | Description | SOP for handling alerts |
Deployment pod anomalies | Critical | kube_deployment_spec_replicas{} != kube_deployment_status_replicas_available{} | This rule triggers alerts if the number of replicated pods created by a Deployment is less than the specified value. In the portal, click Alert Rule Set for Workload Exceptions and set the Deployment pod anomaly alert rule. For more information, see Alert management. | Check whether pods that are provisioned by Deployments fail to be launched.
|
DaemonSet pod anomaly | Critical | ((100 - kube_daemonset_status_number_ready{} / kube_daemonset_status_desired_number_scheduled{} * 100) or (kube_daemonset_status_desired_number_scheduled{} - kube_daemonset_status_current_number_scheduled{})) > 0 | This rule triggers alerts if the number of replicated pods created by a DaemonSet is less than the specified value. In the portal, click Alert Rule Set for Workload Exceptions and set the DaemonSet pod anomaly alert rule. For more information, see Alert management. | Check whether pods that are provisioned by Deployments fail to be launched.
|
DaemonSet pod scheduling errors | Critical | kube_daemonset_status_number_misscheduled{job} > 0 | This rule triggers alerts if scheduling errors occur on the pods that are provisioned by a DaemonSet. In the portal, click Alert Rule Set for Workload Exceptions and set the DaemonSet pod scheduling errors alert rule. For more information, see Alert management. | Check whether pods that are provisioned by Deployments fail to be launched.
|
Job execution failures | Critical | kube_job_status_failed{} > 0 | This rule triggers alerts if a Job fails. In the portal, click Alert Rule Set for Workload Exceptions and set the Job execution failures alert rule. For more information, see Alert management. | Check the logs of the pods created by the Job.
|
Storage errors
Description | Severity | PromQL statement | Description | SOP for handling alerts |
PV anomalies | Critical | kube_persistentvolume_status_phase{phase=~"Failed|Pending"} > 0 | This rule triggers alerts if a persistent volume (PV) is in an abnormal state. In the portal, click Alert Rule Set for Storage Exceptions and set the PV anomaly alert rule. For more information, see Alert management. | For more information about how to troubleshoot PV anomalies, see the disk mounting section in FAQ about disk volumes. |
Disk space less than 10% | Critical | ((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes) < 10 | This rule triggers alerts if the free space of a disk is less than 10%. In the portal, click Alert Rule Set for Workload Exceptions and set the Node - Disk usage ≥ 85% alert rule. For more information, see Alert management. | Add nodes and disks. For more information, see the disk mounting section in FAQ about disk volumes. |
Node anomalies
Description | Severity | PromQL statement | Description | SOP for handling alerts |
Node remaining in the NotReady state for 3 minutes | Critical | (sum(max_over_time(kube_node_status_condition{condition="Ready",status="true"}[3m]) <= 0) by (node)) or (absent(kube_node_status_condition{condition="Ready",status="true"})) > 0 | This rule triggers alerts if a node remains in the NotReady state for 3 minutes. In the portal, click Alert Rule Set for Node Exceptions and set the Node changes to the unschedulable state alert rule. For more information, see Alert management. |
|
Abnormal resource watermarks of hosts
Host resource metrics and node resource metrics have the following differences:
This metric is a host resource metric. It displays the resource statistics of the physical machine or virtual machine where a node resides.
The value of the metric is calculated based on the following formula: Resource usage of all processes on the host / Resource capacity of the host
Description | Severity | PromQL statement | Description | SOP for handling alerts |
Memory usage of the host higher than 85% | Warning | (100 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) >= 85 | This rule triggers alerts if the memory usage of a host exceeds 85%. In the portal, click Alert Rule Set for Resource Exceptions and set the Node - Memory usage ≥ 85% alert rule. For more information, see Alert management. Note Alert rules used in ACK are provided by CloudMonitor. This metric is the same as the metric used in Prometheus alert rules. The default threshold 85% is a suggested value. You can change the value based on your business requirements. |
|
Memory usage of a host higher than 90% | Critical | (100 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) >= 90 | This rule triggers alerts if the memory usage of a host exceeds 90%. |
|
CPU usage of the host higher than 85% | Warning | 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) >= 85 | This rule triggers alerts if the CPU usage of a host exceeds 85%. In the portal, click Alert Rule Set for Resource Exceptions and set the Node - CPU usage ≥ 85% alert rule. Note Alert rules used in ACK are provided by CloudMonitor (ECS monitoring). This metric is the same as the metric used in Prometheus alert rules. The default threshold 85% is a suggested value. You can change the value based on your business requirements. For more information, see Alert management. |
|
CPU usage of the host higher than 90% | Critical | 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) >= 90 | This rule triggers alerts if the CPU usage of a host exceeds 90%. |
|
Abnormal resource watermarks of nodes
Host resource metrics and node resource metrics have the following differences:
This metric is a node resource metric. This metric displays the statistics of resources occupied by the container engine and the allocatable resources on a node. The value of this metric is calculated based on the following formula: Resources occupied by containers on the node / Allocatable resources on the node.
Take memory resources as an example:
Occupied resources: the total amount of working memory of all containers on a node. Working memory includes memory allocated to and used by containers and memory allocated to page caches.
Allocatable resources: the amount of memory that can be allocated on a node, excluding memory occupied by the container engine of the host (ACK reserved node resources). For more information, see Resource reservation policy.
The value of this metric is calculated based on the following formula: Resources occupied by containers on the node / Allocatable resources on the node.
Description | Severity | PromQL statement | Description | SOP for handling alerts |
CPU usage of a node higher than 85% | Warning | sum(irate(container_cpu_usage_seconds_total{pod!=""}[1m])) by (node) / sum(kube_node_status_allocatable{resource="cpu"}) by (node) * 100 >= 85 | This rule triggers alerts if the CPU usage of a node exceeds 85%. Formula:
|
|
CPU allocation rate of a node higher than 85% | Normal | (sum(sum(kube_pod_container_resource_requests{resource="cpu"}) by (pod, node) * on (pod) group_left max(kube_pod_status_ready) by (pod, node)) by (node)) / sum(kube_node_status_allocatable{resource="cpu"}) by (node) * 100 >= 85 | This rule triggers alerts if the CPU allocation rate of a node exceeds 85%. The CPU allocation rate is calculated based on the following formula: |
|
CPU overcommitment rate of a node higher than 300% | Warning | (sum(sum(kube_pod_container_resource_limits{resource="cpu"}) by (pod, node) * on (pod) group_left max(kube_pod_status_ready) by (pod, node)) by (node)) / sum(kube_node_status_allocatable{resource="cpu"}) by (node) * 100 >= 300 | This rule triggers alerts if the CPU overcommitment rate of a node exceeds 300%. The CPU overcommitment rate is calculated based on the following formula: The default threshold 300% is a suggested value. You can modify the value based on your business requirements. |
|
Memory usage of a node higher than 85% | Warning | sum(container_memory_working_set_bytes{pod!=""}) by (node) / sum(kube_node_status_allocatable{resource="memory"}) by (node) * 100 >= 85 | This rule triggers alerts if the memory usage of a node exceeds 85%. Formula:
|
|
Memory allocation rate of a node higher than 85% | Normal | (sum(sum(kube_pod_container_resource_requests{resource="memory"}) by (pod, node) * on (pod) group_left max(kube_pod_status_ready) by (pod, node)) by (node)) / sum(kube_node_status_allocatable{resource="memory"}) by (node) * 100 >= 85 | This rule triggers alerts if the memory allocation rate of a node exceeds 85%. The memory allocation rate is calculated based on the following formula: |
|
Memory overcommitment rate of a node higher than 300% | Warning | (sum(sum(kube_pod_container_resource_limits{resource="memory"}) by (pod, node) * on (pod) group_left max(kube_pod_status_ready) by (pod, node)) by (node)) / sum(kube_node_status_allocatable{resource="memory"}) by (node) * 100 >= 300 | This rule triggers alerts if the memory overcommitment rate of a node exceeds 300%. The memory overcommitment rate is calculated based on the following formula: The default threshold 300% is a suggested value. You can modify the value based on your business requirements. |
|
Network errors
Description | Severity | PromQL statement | Description | SOP for handling alerts |
CoreDNS Unavailability - Number of requests drops to 0 | Critical | (sum(rate(coredns_dns_request_count_total{}[1m]))by(server,zone)<=0) or (sum(rate(coredns_dns_requests_total{}[1m]))by(server,zone)<=0) | This rule applies only to ACK managed clusters (ACK Pro clusters and ACK basic clusters). | Check whether CoreDNS pods in the cluster run as expected. |
CoreDNS Unavailability - Panics | Critical | sum(rate(coredns_panic_count_total{}[3m])) > 0 | This rule applies only to ACK managed clusters (ACK Pro clusters and ACK basic clusters). | Check whether CoreDNS pods in the cluster run as expected. If CoreDNS pods in the cluster do not run as expected, submit a ticket. |
Ingress controller certificates about to expire | Warning | ((nginx_ingress_controller_ssl_expire_time_seconds - time()) / 24 / 3600) < 14 | You must create Ingresses and install the ACK Ingress controller. | Issue the Ingress controller certificates again. |
Scaling issues
Description | Severity | PromQL statement | Description | SOP for handling alerts |
Maximum number of pods in the HPA configuration reached | Warning | max(kube_horizontalpodautoscaler_spec_max_replicas) by (namespace, horizontalpodautoscaler) - max(kube_horizontalpodautoscaler_status_current_replicas) by (namespace, horizontalpodautoscaler) <= 0 | You need to enable the | Check whether the HPA scaling policy meets the requirements. |
References
For more information about how to use the console or API to obtain Prometheus monitoring data, see Use PromQL to query Prometheus monitoring data.
For more information about how to use ACK Net Exporter to identify and troubleshoot container network issues, see Use ACK Net Exporter to troubleshoot network issues.
For more information about how to troubleshoot issues related to Managed Service for Prometheus, see FAQ about observability.