All Products
Search
Document Center

Container Service for Kubernetes:Best practices for configuring alert rules in Prometheus

Last Updated:Sep 10, 2024

Container Service for Kubernetes (ACK) supports open source Prometheus and Managed Service for Prometheus. If the predefined Prometheus metrics cannot meet your business requirements, you can use custom PromQL statements to create alert rules to monitor the heath of cluster nodes, hosts, replicated pods, and workloads. Alerts can be triggered and sent to you when the metric values exceed the thresholds or certain conditions are met.

Prerequisite

Prometheus monitoring is enabled for the ACK cluster. For more information, see Managed Service for Prometheus (recommended) and Use open source Prometheus to monitor an ACK cluster.

Use custom PromQL statements to configure Prometheus alert rules

By default, ACK clusters are compatible with Managed Service for Prometheus and open source Prometheus. You can use custom PromQL statements to define alert rules for Prometheus monitoring. If the conditions specified in the alert rules are met, the system generates alerts and sends notifications.

Managed Service for Prometheus

To use custom PromQL statements to configure alert rules in Managed Service for Prometheus, see Create an alert rule for a Prometheus instance.

Open source Prometheus

  1. Configure alert notification policies.

    Open source Prometheus supports various alert notification methods, including webhook URLs, DingTalk chatbots, and emails. You can set the alert notification method by configuring the receiver parameter in the configurations of the ack-prometheus-operator application. For more information, see Configure alerts.

  2. Create an alert rule.

    Create a PrometheusRule CustomResourceDefinition (CRD) to define an alert rule. For more information, see Deploying Prometheus Rules.

    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      labels:
        # The labels must be the same as the labels specified in the match ruleSelector -> matchLabels parameter of the Prometheus CRD. 
        prometheus: example
        role: alert-rules
      name: prometheus-example-rules
    spec:
      groups:
      - name: example.rules
        rules:
        - alert: ExampleAlert
          # The expr parameter specifies the data query and trigger condition of PromQL. For more information, see the PromQL statement column of the following alert rule table in this topic. 
          expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 90
  3. Check whether the alert rule takes effect.

    1. Run the following command to map Prometheus in the cluster to local port 9090:

      kubectl port-forward svc/ack-prometheus-operator-prometheus 9090:9090 -n monitoring
    2. Enter localhost:9090 into the address bar of your web browser to go to the Prometheus Server console.

    3. In the upper part of the Prometheus Server console, choose Status > Rules.

      On the Rules page, you can view alert rules. If the alert rule that you created is displayed on the Rules page, the alert rule has taken effect.

Alert rules

Based on the O&M experience of clusters and applications, ACK provides a set of Prometheus alert rules. You can use these rules to identify cluster stability issues, node exceptions, node resource utilization issues, pod errors, workload errors, storage errors, and network errors.

The alert rules are classified into the following severities based on the level of impact caused by pod errors and workload errors:

  • Critical: Clusters, applications, and even business are affected. You need to troubleshoot the issues immediately.

  • Warning: Clusters, applications, and even business are affected. You need to troubleshoot the issues at the earliest opportunity.

  • Normal: Important feature changes are involved.

Note

The portal in the Description column refers to the Alert Rules tab of the Alerts page. You can log on to the ACK console and click the name of your cluster on the Clusters page. In the left-side navigation pane, choose Operations > Alerts. On the Alerts page, click the Alert Rules tab and update the corresponding alert rules.

Pod errors

Description

Severity

PromQL statement

Description

SOP for handling alerts

Abnormal pod status

Critical

min_over_time(sum by (namespace, pod, phase) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[5m:1m]) > 0

This rule triggers alerts if abnormal pod status is detected within the last 5 minutes.

In the portal, click Alert Rule Set for Pod Exceptions and set the Pod anomaly alert rule. For more information, see Alert management.

For more information about how to troubleshoot abnormal pod status, see Pod troubleshooting.

Pod launch failures

Critical

sum_over_time(increase(kube_pod_container_status_restarts_total{}[1m])[5m:1m]) > 3

This rule triggers alerts if the number of pod launch failures exceeds 3 within the last 5 minutes.

In the portal, click Alert Rule Set for Pod Exceptions and set the Pod startup failures alert rule. For more information, see Alert management.

For more information about how to troubleshoot pod launch failures, see Pod troubleshooting.

Over 1,000 pending pods

Critical

sum(sum(max_over_time(kube_pod_status_phase{ phase=~"Pending"}[5m])) by (pod)) > 1000

This rule triggers alerts if the number of pending pods exceeds 1,000 within the last 5 minutes.

This issue occurs because the specifications of the cluster cannot meet the requirements for scheduling more than 1,000 pods. ACK Pro clusters provide enhanced capabilities for scheduling pods and are covered by SLAs. We recommend that you upgrade the cluster to an ACK Pro cluster. For more information, see Overview of ACK Pro clusters.

Frequent CPU throttling

Warning

rate(container_cpu_cfs_throttled_seconds_total[3m]) * 100 > 25

CPU throttling is frequently enforced on pods. This rule triggers alerts if the percentage of throttled CPU time slices within the last 3 minutes exceeds 25%.

CPU throttling limits the CPU time slices that the processes in pods can use. This reduces the uptime of processes in the pods and may slow down the processes in the pods.

If this issue occurs, check whether the CPU limit of the pod is set to a small value. To resolve this issue, we recommend that you enable CPU Burst. For more information, see CPU Burst. If you cluster contains multi-core ECS instances, we recommend that you enable topology-aware CPU scheduling to maximize the utilization of CPU fragments. For more information, see Topology-aware CPU scheduling.

CPU usage of pods higher than 85%

Warning

sum(irate(container_cpu_usage_seconds_total{pod=~"{{PodName}}.*",namespace=~"{{Namespace}}.*",container!="",container!="POD"}[1m])) by (namespace,pod) / sum(container_spec_cpu_quota{pod=~"{{PodName}}.*",namespace=~"{{Namespace}}.*",container!="",container!="POD"}/100000) by (namespace,pod) * 100 <= 100 or on() vector(0) >= 85

This rule triggers alerts if the CPU usage of a pod or the pods in a namespace exceeds 85% of the pod CPU limit.

The threshold is 0 when no CPU limit is set.

The default threshold 85% is a suggested value. You can change the value based on your business requirements.

To specify a pod or namespace, replace pod=~"{{PodName}}.*",namespace=~"{{Namespace}}.*" with the actual values. To query all pods in the current cluster, delete the filter condition.

When the CPU usage of a pod is high, CPU throttling is triggered. As a result, CPU time slices are insufficient, which further affects the processes running in the pod.

To avoid this issue, check whether the CPU resource limit of the pod is set to a small value. We recommend that you use CPU Burst to avoid CPU throttling. For more information, see CPU Burst. If you cluster contains multi-core ECS instances, we recommend that you enable topology-aware CPU scheduling to maximize the utilization of CPU fragments. For more information, see Topology-aware CPU scheduling.

Memory usage of pods higher than 85%

Warning

(sum(container_memory_working_set_bytes{pod=~"{{PodName}}.*",namespace=~"{{Namespace}}.*",container !="",container!="POD"}) by (pod,namespace)/ sum(container_spec_memory_limit_bytes{pod=~"{{PodName}}.*",namespace=~"{{Namespace}}.*",container !="",container!="POD"}) by (pod, namespace) * 100) <= 100 or on() vector(0) >= 85

This rule triggers alerts if the memory usage of a pod exceeds 85% of the pod memory limit.

The threshold is 0 when no memory limit is set.

The default threshold 85% is a suggested value. You can change the value based on your business requirements.

To specify a pod or namespace, replace pod=~"{{PodName}}.*",namespace=~"{{Namespace}}.*" with the actual values. To query all pods in the current cluster, delete the filter condition.

When the memory usage of a pod is high, the pod may be killed due to an OOM error. Consequently, the pod is restarted.

To avoid this issue, check whether the memory resource limit of the pod is set to a small value. We recommend that you use the resource profiling feature to configure the memory limit. For more information, see Resource profiling.

Workload exceptions

Description

Severity

PromQL statement

Description

SOP for handling alerts

Deployment pod anomalies

Critical

kube_deployment_spec_replicas{} != kube_deployment_status_replicas_available{}

This rule triggers alerts if the number of replicated pods created by a Deployment is less than the specified value.

In the portal, click Alert Rule Set for Workload Exceptions and set the Deployment pod anomaly alert rule. For more information, see Alert management.

Check whether pods that are provisioned by Deployments fail to be launched.

  • If the pods fail to be launched or are in an abnormal state, troubleshoot the pods. For more information, see Pod troubleshooting.

  • If the pods fail to be launched or are in an abnormal state, submit a ticket, provide the cluster ID, and describe the issue to the technical support.

DaemonSet pod anomaly

Critical

((100 - kube_daemonset_status_number_ready{} / kube_daemonset_status_desired_number_scheduled{} * 100) or (kube_daemonset_status_desired_number_scheduled{} - kube_daemonset_status_current_number_scheduled{})) > 0

This rule triggers alerts if the number of replicated pods created by a DaemonSet is less than the specified value.

In the portal, click Alert Rule Set for Workload Exceptions and set the DaemonSet pod anomaly alert rule. For more information, see Alert management.

Check whether pods that are provisioned by Deployments fail to be launched.

  • If the pods fail to be launched or are in an abnormal state, troubleshoot the pods. For more information, see Pod troubleshooting.

  • If the pods fail to be launched or are in an abnormal state, submit a ticket, provide the cluster ID, and describe the issue to the technical support.

DaemonSet pod scheduling errors

Critical

kube_daemonset_status_number_misscheduled{job} > 0

This rule triggers alerts if scheduling errors occur on the pods that are provisioned by a DaemonSet.

In the portal, click Alert Rule Set for Workload Exceptions and set the DaemonSet pod scheduling errors alert rule. For more information, see Alert management.

Check whether pods that are provisioned by Deployments fail to be launched.

  • If the pods fail to be launched or are in an abnormal state, troubleshoot the pods. For more information, see Pod troubleshooting.

  • If the pods fail to be launched or are in an abnormal state, submit a ticket, provide the cluster ID, and describe the issue to the technical support.

Job execution failures

Critical

kube_job_status_failed{} > 0

This rule triggers alerts if a Job fails.

In the portal, click Alert Rule Set for Workload Exceptions and set the Job execution failures alert rule. For more information, see Alert management.

Check the logs of the pods created by the Job.

  • If the pods fail to be launched or are in an abnormal state, troubleshoot the pods. For more information, see Pod troubleshooting.

  • If the pods fail to be launched or are in an abnormal state, submit a ticket, provide the cluster ID, and describe the issue to the technical support.

Storage errors

Description

Severity

PromQL statement

Description

SOP for handling alerts

PV anomalies

Critical

kube_persistentvolume_status_phase{phase=~"Failed|Pending"} > 0

This rule triggers alerts if a persistent volume (PV) is in an abnormal state.

In the portal, click Alert Rule Set for Storage Exceptions and set the PV anomaly alert rule. For more information, see Alert management.

For more information about how to troubleshoot PV anomalies, see the disk mounting section in FAQ about disk volumes.

Disk space less than 10%

Critical

((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes) < 10

This rule triggers alerts if the free space of a disk is less than 10%.

In the portal, click Alert Rule Set for Workload Exceptions and set the Node - Disk usage ≥ 85% alert rule. For more information, see Alert management.

Add nodes and disks. For more information, see the disk mounting section in FAQ about disk volumes.

Node anomalies

Description

Severity

PromQL statement

Description

SOP for handling alerts

Node remaining in the NotReady state for 3 minutes

Critical

(sum(max_over_time(kube_node_status_condition{condition="Ready",status="true"}[3m]) <= 0) by (node)) or (absent(kube_node_status_condition{condition="Ready",status="true"})) > 0

This rule triggers alerts if a node remains in the NotReady state for 3 minutes.

In the portal, click Alert Rule Set for Node Exceptions and set the Node changes to the unschedulable state alert rule. For more information, see Alert management.

  • Check whether the node is being replaced, removed, or manually set to unavailable.

    If the node is not in any of the preceding status, you must evict the pods on the node to avoid business interruptions.

  • Check the node conditions to identify the cause. For example, check whether the memory resources or disk space on the node is insufficient.

Abnormal resource watermarks of hosts

Note

Host resource metrics and node resource metrics have the following differences:

  • This metric is a host resource metric. It displays the resource statistics of the physical machine or virtual machine where a node resides.

  • The value of the metric is calculated based on the following formula: Resource usage of all processes on the host / Resource capacity of the host

Description

Severity

PromQL statement

Description

SOP for handling alerts

Memory usage of the host higher than 85%

Warning

(100 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) >= 85

This rule triggers alerts if the memory usage of a host exceeds 85%.

In the portal, click Alert Rule Set for Resource Exceptions and set the Node - Memory usage ≥ 85% alert rule. For more information, see Alert management.

Note

Alert rules used in ACK are provided by CloudMonitor. This metric is the same as the metric used in Prometheus alert rules.

The default threshold 85% is a suggested value. You can change the value based on your business requirements.

  • Release resources.

    We recommend that you use the cost insights feature to check whether some pods occupy the schedulable resources and whether the memory requests of pods are proper. For more information, see Enable cost insights. We recommend that you use the resource profiling feature to set the memory request. For more information, see Resource profiling.

  • Plan the capacity of the cluster and add nodes to the cluster. For more information, see Increase the number of nodes in an ACK cluster.

Memory usage of a host higher than 90%

Critical

(100 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100) >= 90

This rule triggers alerts if the memory usage of a host exceeds 90%.

  • Release resources.

    We recommend that you use the cost insights feature to check whether some pods occupy the schedulable resources and whether the memory requests of pods are proper. For more information, see Enable cost insights. We recommend that you use the resource profiling feature to set the memory request. For more information, see Resource profiling.

  • Plan the capacity of the cluster and add nodes to the cluster. For more information, see Increase the number of nodes in an ACK cluster.

CPU usage of the host higher than 85%

Warning

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) >= 85

This rule triggers alerts if the CPU usage of a host exceeds 85%.

In the portal, click Alert Rule Set for Resource Exceptions and set the Node - CPU usage ≥ 85% alert rule.

Note

Alert rules used in ACK are provided by CloudMonitor (ECS monitoring). This metric is the same as the metric used in Prometheus alert rules.

The default threshold 85% is a suggested value. You can change the value based on your business requirements.

For more information, see Alert management.

  • Release resources.

    We recommend that you use the cost insights feature to check whether some pods occupy the schedulable resources and whether the CPU requests of pods are proper. For more information, see Enable cost insights. We recommend that you use the resource profiling feature to set the CPU request. For more information, see Resource profiling.

  • Plan the capacity of the cluster and add nodes to the cluster. For more information, see Increase the number of nodes in an ACK cluster.

CPU usage of the host higher than 90%

Critical

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) >= 90

This rule triggers alerts if the CPU usage of a host exceeds 90%.

  • Release resources. We recommend that you use the cost insights feature to check whether some pods occupy the schedulable resources and whether the CPU requests of pods are proper. For more information, see Enable cost insights. We recommend that you use the resource profiling feature to set the CPU request. For more information, see Resource profiling.

  • Plan the capacity of the cluster and add nodes to the cluster. For more information, see Increase the number of nodes in an ACK cluster.

Abnormal resource watermarks of nodes

Note

Host resource metrics and node resource metrics have the following differences:

  • This metric is a node resource metric. This metric displays the statistics of resources occupied by the container engine and the allocatable resources on a node. The value of this metric is calculated based on the following formula: Resources occupied by containers on the node / Allocatable resources on the node.

    Take memory resources as an example:

    • Occupied resources: the total amount of working memory of all containers on a node. Working memory includes memory allocated to and used by containers and memory allocated to page caches.

    • Allocatable resources: the amount of memory that can be allocated on a node, excluding memory occupied by the container engine of the host (ACK reserved node resources). For more information, see Resource reservation policy.

  • The value of this metric is calculated based on the following formula: Resources occupied by containers on the node / Allocatable resources on the node.

Description

Severity

PromQL statement

Description

SOP for handling alerts

CPU usage of a node higher than 85%

Warning

sum(irate(container_cpu_usage_seconds_total{pod!=""}[1m])) by (node) / sum(kube_node_status_allocatable{resource="cpu"}) by (node) * 100 >= 85

This rule triggers alerts if the CPU usage of a node exceeds 85%.

Formula:

Resources used on the node / Allocatable resources on the node.

  • Release resources.

    We recommend that you use the cost insights feature to check whether some pods occupy the schedulable resources and whether the CPU requests of pods are proper. For more information, see Enable cost insights. We recommend that you use the resource profiling feature to set the CPU request in order to spread pods to different nodes. This way, resource usage can be balanced among nodes. For more information, see Resource profiling.

  • Plan the capacity of the cluster and add nodes to the cluster. For more information, see Increase the number of nodes in an ACK cluster.

CPU allocation rate of a node higher than 85%

Normal

(sum(sum(kube_pod_container_resource_requests{resource="cpu"}) by (pod, node) * on (pod) group_left max(kube_pod_status_ready) by (pod, node)) by (node)) / sum(kube_node_status_allocatable{resource="cpu"}) by (node) * 100 >= 85

This rule triggers alerts if the CPU allocation rate of a node exceeds 85%.

The CPU allocation rate is calculated based on the following formula: Sum of resource requests of all pods on the node / Allocatable resources on the node.

  • When the node does not have sufficient allocatable resources, pods are scheduled to other nodes.

  • Check whether the pods on the node have unused resources. If yes, the actual resource usage will be much lower than the sum of resource requests. We recommend that you use the cost insights feature to check whether some pods occupy the schedulable resources and whether the CPU requests of pods are proper. For more information, see Enable cost insights. We recommend that you use the resource profiling feature to set the CPU request. For more information, see Resource profiling.

  • Plan the capacity of the cluster and add nodes to the cluster. For more information, see Increase the number of nodes in an ACK cluster.

CPU overcommitment rate of a node higher than 300%

Warning

(sum(sum(kube_pod_container_resource_limits{resource="cpu"}) by (pod, node) * on (pod) group_left max(kube_pod_status_ready) by (pod, node)) by (node)) / sum(kube_node_status_allocatable{resource="cpu"}) by (node) * 100 >= 300

This rule triggers alerts if the CPU overcommitment rate of a node exceeds 300%.

The CPU overcommitment rate is calculated based on the following formula: Sum of resource limits of all pods on the node / Allocatable resources on the node.

The default threshold 300% is a suggested value. You can modify the value based on your business requirements.

  • The sum of resource limits of pods on a node is much higher than the amount of allocatable resources on the node. During peak hours, resource usage spikes may result in insufficient CPU time slices. As a result, pods compete for resources and CPU throttling is triggered,. which further affect the performance of processes running in the pods.

  • We recommend that you use the cost insights feature to check whether some pods occupy the schedulable resources and whether the CPU requests of pods are proper. For more information, see Enable cost insights. We recommend that you use the resource profiling feature to set the CPU request and limit. For more information, see Resource profiling.

  • Plan the capacity of the cluster and add nodes to the cluster. For more information, see Increase the number of nodes in an ACK cluster.

Memory usage of a node higher than 85%

Warning

sum(container_memory_working_set_bytes{pod!=""}) by (node) / sum(kube_node_status_allocatable{resource="memory"}) by (node) * 100 >= 85

This rule triggers alerts if the memory usage of a node exceeds 85%.

Formula:

Resources used on the node / Allocatable resources on the node.

  • Release resources.

    We recommend that you use the cost insights feature to check whether some pods occupy the schedulable resources and whether the CPU requests of pods are proper. For more information, see Enable cost insights. We recommend that you use the resource profiling feature to set the memory request in order to spread pods to different nodes. This way, resource usage can be balanced among nodes. For more information, see Resource profiling.

  • Plan the capacity of the cluster and add nodes to the cluster. For more information, see Increase the number of nodes in an ACK cluster.

Memory allocation rate of a node higher than 85%

Normal

(sum(sum(kube_pod_container_resource_requests{resource="memory"}) by (pod, node) * on (pod) group_left max(kube_pod_status_ready) by (pod, node)) by (node)) / sum(kube_node_status_allocatable{resource="memory"}) by (node) * 100 >= 85

This rule triggers alerts if the memory allocation rate of a node exceeds 85%.

The memory allocation rate is calculated based on the following formula: Sum of resource requests of all pods on the node / Allocatable resources on the node.

  • When the node does not have sufficient allocatable resources, pods are scheduled to other nodes.

  • Check whether the pods on the node have unused resources. If yes, the actual resource usage will be much lower than the sum of resource requests. We recommend that you use the cost insights feature to check whether some pods occupy the schedulable resources and whether the CPU requests of pods are proper. For more information, see Enable cost insights. We recommend that you use the resource profiling feature to set the memory request. For more information, see Resource profiling.

  • Plan the capacity of the cluster and add nodes to the cluster. For more information, see Increase the number of nodes in an ACK cluster.

Memory overcommitment rate of a node higher than 300%

Warning

(sum(sum(kube_pod_container_resource_limits{resource="memory"}) by (pod, node) * on (pod) group_left max(kube_pod_status_ready) by (pod, node)) by (node)) / sum(kube_node_status_allocatable{resource="memory"}) by (node) * 100 >= 300

This rule triggers alerts if the memory overcommitment rate of a node exceeds 300%.

The memory overcommitment rate is calculated based on the following formula: Sum of resource limits of all pods on the node / Allocatable resources on the node.

The default threshold 300% is a suggested value. You can modify the value based on your business requirements.

  • The sum of resource limits of pods on a node is much higher than the amount of allocatable resources on the node. During peak hours, resource usage spikes may result in memory bottlenecks. As a result, OOM errors occur. Processes may be killed and businesses may be interrupted.

  • Configure proper pod resource limits. We recommend that you use the cost insights feature to check whether some pods occupy the schedulable resources and whether the CPU requests of pods are proper. For more information, see Enable cost insights. We recommend that you use the resource profiling feature to set the CPU request and limit. For more information, see Resource profiling.

  • Plan the capacity of the cluster and add nodes to the cluster. For more information, see Increase the number of nodes in an ACK cluster.

Network errors

Description

Severity

PromQL statement

Description

SOP for handling alerts

CoreDNS Unavailability - Number of requests drops to 0

Critical

(sum(rate(coredns_dns_request_count_total{}[1m]))by(server,zone)<=0) or (sum(rate(coredns_dns_requests_total{}[1m]))by(server,zone)<=0)

This rule applies only to ACK managed clusters (ACK Pro clusters and ACK basic clusters).

Check whether CoreDNS pods in the cluster run as expected.

CoreDNS Unavailability - Panics

Critical

sum(rate(coredns_panic_count_total{}[3m])) > 0

This rule applies only to ACK managed clusters (ACK Pro clusters and ACK basic clusters).

Check whether CoreDNS pods in the cluster run as expected. If CoreDNS pods in the cluster do not run as expected, submit a ticket.

Ingress controller certificates about to expire

Warning

((nginx_ingress_controller_ssl_expire_time_seconds - time()) / 24 / 3600) < 14

You must create Ingresses and install the ACK Ingress controller.

Issue the Ingress controller certificates again.

Scaling issues

Description

Severity

PromQL statement

Description

SOP for handling alerts

Maximum number of pods in the HPA configuration reached

Warning

max(kube_horizontalpodautoscaler_spec_max_replicas) by (namespace, horizontalpodautoscaler) - max(kube_horizontalpodautoscaler_status_current_replicas) by (namespace, horizontalpodautoscaler) <= 0

You need to enable the horizontalpodautoscaler metric of Application Real-Time Monitoring Service (ARMS) Prometheus. By default, this metric is disabled. This metric is free of charge. 弹性伸缩

Check whether the HPA scaling policy meets the requirements.

References