Best practices for configuring alert rules in Prometheus - Container Service for Kubernetes

Container Service for Kubernetes (ACK) supports open source Prometheus and Managed Service for Prometheus. If the predefined Prometheus metrics cannot meet your business requirements, you can use custom PromQL statements to create alert rules to monitor the heath of cluster nodes, hosts, replicated pods, and workloads. Alerts can be triggered and sent to you when the metric values exceed the thresholds or certain conditions are met.

Prerequisite

Prometheus monitoring is enabled for the ACK cluster. For more information, see Managed Service for Prometheus (recommended) and Use open source Prometheus to monitor an ACK cluster.

Use custom PromQL statements to configure Prometheus alert rules

By default, ACK clusters are compatible with Managed Service for Prometheus and open source Prometheus. You can use custom PromQL statements to define alert rules for Prometheus monitoring. If the conditions specified in the alert rules are met, the system generates alerts and sends notifications.

Managed Service for Prometheus

To use custom PromQL statements to configure alert rules in Managed Service for Prometheus, see Create an alert rule for a Prometheus instance.

Open source Prometheus

Configure alert notification policies.
Open source Prometheus supports various alert notification methods, including webhook URLs, DingTalk chatbots, and emails. You can set the alert notification method by configuring the receiver parameter in the configurations of the ack-prometheus-operator application. For more information, see Configure alerts.

Create an alert rule.

Create a PrometheusRule CustomResourceDefinition (CRD) to define an alert rule. For more information, see Deploying Prometheus Rules.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    # The labels must be the same as the labels specified in the match ruleSelector -> matchLabels parameter of the Prometheus CRD. 
    prometheus: example
    role: alert-rules
  name: prometheus-example-rules
spec:
  groups:
  - name: example.rules
    rules:
    - alert: ExampleAlert
      # The expr parameter specifies the data query and trigger condition of PromQL. For more information, see the PromQL statement column of the following alert rule table in this topic. 
      expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 90

Check whether the alert rule takes effect.
1. Run the following command to map Prometheus in the cluster to local port 9090:
```
kubectl port-forward svc/ack-prometheus-operator-prometheus 9090:9090 -n monitoring
```
2. Enter localhost:9090 into the address bar of your web browser to go to the Prometheus Server console.
3. In the upper part of the Prometheus Server console, choose Status > Rules.
  On the Rules page, you can view alert rules. If the alert rule that you created is displayed on the Rules page, the alert rule has taken effect.

Alert rules

Based on the O&M experience of clusters and applications, ACK provides a set of Prometheus alert rules. You can use these rules to identify cluster stability issues, node exceptions, node resource utilization issues, pod errors, workload errors, storage errors, and network errors.

The alert rules are classified into the following severities based on the level of impact caused by pod errors and workload errors:

Critical: Clusters, applications, and even business are affected. You need to troubleshoot the issues immediately.
Warning: Clusters, applications, and even business are affected. You need to troubleshoot the issues at the earliest opportunity.
Normal: Important feature changes are involved.

Note

The portal in the Description column refers to the Alert Rules tab of the Alerts page. You can log on to the ACK console and click the name of your cluster on the Clusters page. In the left-side navigation pane, choose Operations > Alerts. On the Alerts page, click the Alert Rules tab and update the corresponding alert rules.

Pod errors

Description	Severity	PromQL statement	Description	SOP for handling alerts
Abnormal pod status	Critical	min_over_time(sum by (namespace, pod, phase) (kube_pod_status_phase{phase=~"Pending\|Unknown\|Failed"})[5m:1m]) > 0	This rule triggers alerts if abnormal pod status is detected within the last 5 minutes. In the portal, click Alert Rule Set for Pod Exceptions and set the Pod anomaly alert rule. For more information, see Alert management.	For more information about how to troubleshoot abnormal pod status, see Pod troubleshooting.
Pod launch failures	Critical	sum_over_time(increase(kube_pod_container_status_restarts_total{}[1m])[5m:1m]) > 3	This rule triggers alerts if the number of pod launch failures exceeds 3 within the last 5 minutes. In the portal, click Alert Rule Set for Pod Exceptions and set the Pod startup failures alert rule. For more information, see Alert management.	For more information about how to troubleshoot pod launch failures, see Pod troubleshooting.
Over 1,000 pending pods	Critical	sum(sum(max_over_time(kube_pod_status_phase{ phase=~"Pending"}[5m])) by (pod)) > 1000	This rule triggers alerts if the number of pending pods exceeds 1,000 within the last 5 minutes.	This issue occurs because the specifications of the cluster cannot meet the requirements for scheduling more than 1,000 pods. ACK Pro clusters provide enhanced capabilities for scheduling pods and are covered by SLAs. We recommend that you upgrade the cluster to an ACK Pro cluster. For more information, see Overview of ACK Pro clusters.
Frequent CPU throttling	Warning	rate(container_cpu_cfs_throttled_seconds_total[3m]) * 100 > 25	CPU throttling is frequently enforced on pods. This rule triggers alerts if the percentage of throttled CPU time slices within the last 3 minutes exceeds 25%.	CPU throttling limits the CPU time slices that the processes in pods can use. This reduces the uptime of processes in the pods and may slow down the processes in the pods. If this issue occurs, check whether the CPU limit of the pod is set to a small value. To resolve this issue, we recommend that you enable CPU Burst. For more information, see Enable CPU Burst. If you cluster contains multi-core ECS instances, we recommend that you enable topology-aware CPU scheduling to maximize the utilization of CPU fragments. For more information, see Enable topology-aware CPU scheduling.
CPU usage of pods higher than 85%	Warning	sum(irate(container_cpu_usage_seconds_total{pod=~"{{PodName}}.",namespace=~"{{Namespace}}.",container!="",container!="POD"}[1m])) by (namespace,pod) / sum(container_spec_cpu_quota{pod=~"{{PodName}}.",namespace=~"{{Namespace}}.",container!="",container!="POD"}/100000) by (namespace,pod) * 100 <= 100 or on() vector(0) >= 85	This rule triggers alerts if the CPU usage of a pod or the pods in a namespace exceeds 85% of the pod CPU limit. The threshold is 0 when no CPU limit is set. The default threshold 85% is a suggested value. You can change the value based on your business requirements. To specify a pod or namespace, replace `pod=~"{{PodName}}.",namespace=~"{{Namespace}}."` with the actual values. To query all pods in the current cluster, delete the filter condition.	When the CPU usage of a pod is high, CPU throttling is triggered. As a result, CPU time slices are insufficient, which further affects the processes running in the pod. To avoid this issue, check whether the CPU `resource limit` of the pod is set to a small value. We recommend that you use CPU Burst to avoid CPU throttling. For more information, see Enable CPU Burst. If you cluster contains multi-core ECS instances, we recommend that you enable topology-aware CPU scheduling to maximize the utilization of CPU fragments. For more information, see Enable topology-aware CPU scheduling.
Memory usage of pods higher than 85%	Warning	(sum(container_memory_working_set_bytes{pod=~"{{PodName}}.",namespace=~"{{Namespace}}.",container !="",container!="POD"}) by (pod,namespace)/ sum(container_spec_memory_limit_bytes{pod=~"{{PodName}}.",namespace=~"{{Namespace}}.",container !="",container!="POD"}) by (pod, namespace) * 100) <= 100 or on() vector(0) >= 85	This rule triggers alerts if the memory usage of a pod exceeds 85% of the pod memory limit. The threshold is 0 when no memory limit is set. The default threshold 85% is a suggested value. You can change the value based on your business requirements. To specify a pod or namespace, replace `pod=~"{{PodName}}.",namespace=~"{{Namespace}}."` with the actual values. To query all pods in the current cluster, delete the filter condition.	When the memory usage of a pod is high, the pod may be killed due to an OOM error. Consequently, the pod is restarted. To avoid this issue, check whether the memory `resource limit` of the pod is set to a small value. We recommend that you use the resource profiling feature to configure the memory limit. For more information, see Resource profiling.

Workload exceptions

Description	Severity	PromQL statement	Description	SOP for handling alerts
Deployment pod anomalies	Critical	kube_deployment_spec_replicas{} != kube_deployment_status_replicas_available{}	This rule triggers alerts if the number of replicated pods created by a Deployment is less than the specified value. In the portal, click Alert Rule Set for Workload Exceptions and set the Deployment pod anomaly alert rule. For more information, see Alert management.	Check whether pods that are provisioned by Deployments fail to be launched. If the pods fail to be launched or are in an abnormal state, troubleshoot the pods. For more information, see Pod troubleshooting. If the pods fail to be launched or are in an abnormal state, submit a ticket, provide the cluster ID, and describe the issue to the technical support.
DaemonSet pod anomaly	Critical	((100 - kube_daemonset_status_number_ready{} / kube_daemonset_status_desired_number_scheduled{} * 100) or (kube_daemonset_status_desired_number_scheduled{} - kube_daemonset_status_current_number_scheduled{})) > 0	This rule triggers alerts if the number of replicated pods created by a DaemonSet is less than the specified value. In the portal, click Alert Rule Set for Workload Exceptions and set the DaemonSet pod anomaly alert rule. For more information, see Alert management.	Check whether pods that are provisioned by DaemonSets fail to be launched. If the pods fail to be launched or are in an abnormal state, troubleshoot the pods. For more information, see Pod troubleshooting. If the pods fail to be launched or are in an abnormal state, submit a ticket, provide the cluster ID, and describe the issue to the technical support.
DaemonSet pod scheduling errors	Critical	kube_daemonset_status_number_misscheduled{job} > 0	This rule triggers alerts if scheduling errors occur on the pods that are provisioned by a DaemonSet. In the portal, click Alert Rule Set for Workload Exceptions and set the DaemonSet pod scheduling errors alert rule. For more information, see Alert management.	Check whether pods that are provisioned by DaemonSets fail to be launched. If the pods fail to be launched or are in an abnormal state, troubleshoot the pods. For more information, see Pod troubleshooting. If the pods fail to be launched or are in an abnormal state, submit a ticket, provide the cluster ID, and describe the issue to the technical support.
Job execution failures	Critical	kube_job_status_failed{} > 0	This rule triggers alerts if a Job fails. In the portal, click Alert Rule Set for Workload Exceptions and set the Job execution failures alert rule. For more information, see Alert management.	Check the logs of the pods created by the Job. If the pods fail to be launched or are in an abnormal state, troubleshoot the pods. For more information, see Pod troubleshooting. If the pods fail to be launched or are in an abnormal state, submit a ticket, provide the cluster ID, and describe the issue to the technical support.

Storage errors

Description

Severity

PromQL statement

Description

SOP for handling alerts

PV anomalies

Critical

kube_persistentvolume_status_phase{phase=~"Failed|Pending"} > 0

This rule triggers alerts if a persistent volume (PV) is in an abnormal state.

In the portal, click Alert Rule Set for Storage Exceptions and set the PV anomaly alert rule. For more information, see Alert management.

For more information about how to troubleshoot PV anomalies, see the disk mounting section in FAQ about disk volumes.

Disk space less than 10%

Critical

((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes) < 10

This rule triggers alerts if the free space of a disk is less than 10%.

In the portal, click Alert Rule Set for Workload Exceptions and set the Node - Disk usage ≥ 85% alert rule. For more information, see Alert management.

Add nodes and disks. For more information, see the disk mounting section in FAQ about disk volumes.

Node anomalies

Description

Severity

PromQL statement

Description

SOP for handling alerts

Node remaining in the NotReady state for 3 minutes

Critical

(sum(max_over_time(kube_node_status_condition{condition="Ready",status="true"}[3m]) <= 0) by (node)) or (absent(kube_node_status_condition{condition="Ready",status="true"})) > 0

This rule triggers alerts if a node remains in the NotReady state for 3 minutes.

In the portal, click Alert Rule Set for Node Exceptions and set the Node changes to the unschedulable state alert rule. For more information, see Alert management.

Check whether the node is being replaced, removed, or manually set to unavailable.
If the node is not in any of the preceding status, you must evict the pods on the node to avoid business interruptions.
Check the node conditions to identify the cause. For example, check whether the memory resources or disk space on the node is insufficient.

Abnormal resource watermarks of hosts

Note

Host resource metrics and node resource metrics have the following differences:

This metric is a host resource metric. It displays the resource statistics of the physical machine or virtual machine where a node resides.
The value of the metric is calculated based on the following formula: Resource usage of all processes on the host / Resource capacity of the host

Description	Severity	PromQL statement	Description	SOP for handling alerts
Memory usage of the host higher than 85%	Warning	(100 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) >= 85	This rule triggers alerts if the memory usage of a host exceeds 85%. In the portal, click Alert Rule Set for Resource Exceptions and set the Node - Memory usage ≥ 85% alert rule. For more information, see Alert management. Note Alert rules used in ACK are provided by CloudMonitor. This metric is the same as the metric used in Prometheus alert rules. The default threshold 85% is a suggested value. You can change the value based on your business requirements.	Release resources. We recommend that you use the cost insights feature to check whether some pods occupy the schedulable resources and whether the memory requests of pods are proper. For more information, see Enable cost insights. We recommend that you use the resource profiling feature to set the memory request. For more information, see Resource profiling. Plan the capacity of the cluster and add nodes to the cluster. For more information, see Increase the number of nodes in an ACK cluster.
Memory usage of a host higher than 90%	Critical	(100 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) >= 90	This rule triggers alerts if the memory usage of a host exceeds 90%.	Release resources. We recommend that you use the cost insights feature to check whether some pods occupy the schedulable resources and whether the memory requests of pods are proper. For more information, see Enable cost insights. We recommend that you use the resource profiling feature to set the memory request. For more information, see Resource profiling. Plan the capacity of the cluster and add nodes to the cluster. For more information, see Increase the number of nodes in an ACK cluster.
CPU usage of the host higher than 85%	Warning	100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) >= 85	This rule triggers alerts if the CPU usage of a host exceeds 85%. In the portal, click Alert Rule Set for Resource Exceptions and set the Node - CPU usage ≥ 85% alert rule. Note Alert rules used in ACK are provided by CloudMonitor (ECS monitoring). This metric is the same as the metric used in Prometheus alert rules. The default threshold 85% is a suggested value. You can change the value based on your business requirements. For more information, see Alert management.	Release resources. We recommend that you use the cost insights feature to check whether some pods occupy the schedulable resources and whether the CPU requests of pods are proper. For more information, see Enable cost insights. We recommend that you use the resource profiling feature to set the CPU request. For more information, see Resource profiling. Plan the capacity of the cluster and add nodes to the cluster. For more information, see Increase the number of nodes in an ACK cluster.
CPU usage of the host higher than 90%	Critical	100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) >= 90	This rule triggers alerts if the CPU usage of a host exceeds 90%.	Release resources. We recommend that you use the cost insights feature to check whether some pods occupy the schedulable resources and whether the CPU requests of pods are proper. For more information, see Enable cost insights. We recommend that you use the resource profiling feature to set the CPU request. For more information, see Resource profiling. Plan the capacity of the cluster and add nodes to the cluster. For more information, see Increase the number of nodes in an ACK cluster.

Abnormal resource watermarks of nodes

Note

Host resource metrics and node resource metrics have the following differences:

This metric is a node resource metric. This metric displays the statistics of resources occupied by the container engine and the allocatable resources on a node. The value of this metric is calculated based on the following formula: Resources occupied by containers on the node / Allocatable resources on the node.
Take memory resources as an example:
- Occupied resources: the total amount of working memory of all containers on a node. Working memory includes memory allocated to and used by containers and memory allocated to page caches.
- Allocatable resources: the amount of memory that can be allocated on a node, excluding memory occupied by the container engine of the host (ACK reserved node resources). For more information, see Resource reservation policy.
The value of this metric is calculated based on the following formula: Resources occupied by containers on the node / Allocatable resources on the node.

Description	Severity	PromQL statement	Description	SOP for handling alerts
CPU usage of a node higher than 85%	Warning	sum(irate(container_cpu_usage_seconds_total{pod!=""}[1m])) by (node) / sum(kube_node_status_allocatable{resource="cpu"}) by (node) * 100 >= 85	This rule triggers alerts if the CPU usage of a node exceeds 85%. Formula: `Resources used on the node / Allocatable resources on the node`.	Release resources. We recommend that you use the cost insights feature to check whether some pods occupy the schedulable resources and whether the CPU requests of pods are proper. For more information, see Enable cost insights. We recommend that you use the resource profiling feature to set the CPU request in order to spread pods to different nodes. This way, resource usage can be balanced among nodes. For more information, see Resource profiling. Plan the capacity of the cluster and add nodes to the cluster. For more information, see Increase the number of nodes in an ACK cluster.
CPU allocation rate of a node higher than 85%	Normal	(sum(sum(kube_pod_container_resource_requests{resource="cpu"}) by (pod, node) * on (pod) group_left max(kube_pod_status_ready) by (pod, node)) by (node)) / sum(kube_node_status_allocatable{resource="cpu"}) by (node) * 100 >= 85	This rule triggers alerts if the CPU allocation rate of a node exceeds 85%. The CPU allocation rate is calculated based on the following formula: `Sum of resource requests of all pods on the node / Allocatable resources on the node`.	When the node does not have sufficient allocatable resources, pods are scheduled to other nodes. Check whether the pods on the node have unused resources. If yes, the actual resource usage will be much lower than the sum of resource requests. We recommend that you use the cost insights feature to check whether some pods occupy the schedulable resources and whether the CPU requests of pods are proper. For more information, see Enable cost insights. We recommend that you use the resource profiling feature to set the CPU request. For more information, see Resource profiling. Plan the capacity of the cluster and add nodes to the cluster. For more information, see Increase the number of nodes in an ACK cluster.
CPU overcommitment rate of a node higher than 300%	Warning	(sum(sum(kube_pod_container_resource_limits{resource="cpu"}) by (pod, node) * on (pod) group_left max(kube_pod_status_ready) by (pod, node)) by (node)) / sum(kube_node_status_allocatable{resource="cpu"}) by (node) * 100 >= 300	This rule triggers alerts if the CPU overcommitment rate of a node exceeds 300%. The CPU overcommitment rate is calculated based on the following formula: `Sum of resource limits of all pods on the node / Allocatable resources on the node`. The default threshold 300% is a suggested value. You can modify the value based on your business requirements.	The sum of resource limits of pods on a node is much higher than the amount of allocatable resources on the node. During peak hours, resource usage spikes may result in insufficient CPU time slices. As a result, pods compete for resources and CPU throttling is triggered,. which further affect the performance of processes running in the pods. We recommend that you use the cost insights feature to check whether some pods occupy the schedulable resources and whether the CPU requests of pods are proper. For more information, see Enable cost insights. We recommend that you use the resource profiling feature to set the CPU request and limit. For more information, see Resource profiling. Plan the capacity of the cluster and add nodes to the cluster. For more information, see Increase the number of nodes in an ACK cluster.
Memory usage of a node higher than 85%	Warning	sum(container_memory_working_set_bytes{pod!=""}) by (node) / sum(kube_node_status_allocatable{resource="memory"}) by (node) * 100 >= 85	This rule triggers alerts if the memory usage of a node exceeds 85%. Formula: `Resources used on the node / Allocatable resources on the node`.	Release resources. We recommend that you use the cost insights feature to check whether some pods occupy the schedulable resources and whether the CPU requests of pods are proper. For more information, see Enable cost insights. We recommend that you use the resource profiling feature to set the memory request in order to spread pods to different nodes. This way, resource usage can be balanced among nodes. For more information, see Resource profiling. Plan the capacity of the cluster and add nodes to the cluster. For more information, see Increase the number of nodes in an ACK cluster.
Memory allocation rate of a node higher than 85%	Normal	(sum(sum(kube_pod_container_resource_requests{resource="memory"}) by (pod, node) * on (pod) group_left max(kube_pod_status_ready) by (pod, node)) by (node)) / sum(kube_node_status_allocatable{resource="memory"}) by (node) * 100 >= 85	This rule triggers alerts if the memory allocation rate of a node exceeds 85%. The memory allocation rate is calculated based on the following formula: `Sum of resource requests of all pods on the node / Allocatable resources on the node`.	When the node does not have sufficient allocatable resources, pods are scheduled to other nodes. Check whether the pods on the node have unused resources. If yes, the actual resource usage will be much lower than the sum of resource requests. We recommend that you use the cost insights feature to check whether some pods occupy the schedulable resources and whether the CPU requests of pods are proper. For more information, see Enable cost insights. We recommend that you use the resource profiling feature to set the memory request. For more information, see Resource profiling. Plan the capacity of the cluster and add nodes to the cluster. For more information, see Increase the number of nodes in an ACK cluster.
Memory overcommitment rate of a node higher than 300%	Warning	(sum(sum(kube_pod_container_resource_limits{resource="memory"}) by (pod, node) * on (pod) group_left max(kube_pod_status_ready) by (pod, node)) by (node)) / sum(kube_node_status_allocatable{resource="memory"}) by (node) * 100 >= 300	This rule triggers alerts if the memory overcommitment rate of a node exceeds 300%. The memory overcommitment rate is calculated based on the following formula: `Sum of resource limits of all pods on the node / Allocatable resources on the node`. The default threshold 300% is a suggested value. You can modify the value based on your business requirements.	The sum of resource limits of pods on a node is much higher than the amount of allocatable resources on the node. During peak hours, resource usage spikes may result in memory bottlenecks. As a result, OOM errors occur. Processes may be killed and businesses may be interrupted. Configure proper pod resource limits. We recommend that you use the cost insights feature to check whether some pods occupy the schedulable resources and whether the CPU requests of pods are proper. For more information, see Enable cost insights. We recommend that you use the resource profiling feature to set the CPU request and limit. For more information, see Resource profiling. Plan the capacity of the cluster and add nodes to the cluster. For more information, see Increase the number of nodes in an ACK cluster.

Network errors

Description	Severity	PromQL statement	Description	SOP for handling alerts
CoreDNS Unavailability - Number of requests drops to 0	Critical	(sum(rate(coredns_dns_request_count_total{}[1m]))by(server,zone)<=0) or (sum(rate(coredns_dns_requests_total{}[1m]))by(server,zone)<=0)	This rule applies only to ACK managed clusters (ACK Pro clusters and ACK basic clusters).	Check whether CoreDNS pods in the cluster run as expected.
CoreDNS Unavailability - Panics	Critical	sum(rate(coredns_panic_count_total{}[3m])) > 0	This rule applies only to ACK managed clusters (ACK Pro clusters and ACK basic clusters).	Check whether CoreDNS pods in the cluster run as expected. If CoreDNS pods in the cluster do not run as expected, submit a ticket.
Ingress controller certificates about to expire	Warning	((nginx_ingress_controller_ssl_expire_time_seconds - time()) / 24 / 3600) < 14	You must create Ingresses and install the ACK Ingress controller.	Issue the Ingress controller certificates again.

Scaling issues

Description	Severity	PromQL statement	Description	SOP for handling alerts
Maximum number of pods in the HPA configuration reached	Warning	max(kube_horizontalpodautoscaler_spec_max_replicas) by (namespace, horizontalpodautoscaler) - max(kube_horizontalpodautoscaler_status_current_replicas) by (namespace, horizontalpodautoscaler) <= 0	You need to enable the `horizontalpodautoscaler` metric of Application Real-Time Monitoring Service (ARMS) Prometheus. By default, this metric is disabled. This metric is free of charge.	Check whether the HPA scaling policy meets the requirements.

References

For more information about how to use the console or API to obtain Prometheus monitoring data, see Use PromQL to query Prometheus monitoring data.
For more information about how to use ACK Net Exporter to identify and troubleshoot container network issues, see Use ACK Net Exporter to troubleshoot network issues.
For more information about how to troubleshoot issues related to Managed Service for Prometheus, see FAQ about observability.