Managed Service for Prometheus alert rules are originated from Application Real-Time Monitoring Service (ARMS), Kubernetes, MongoDB, MySQL, NGINX, and Redis.
ARMS alert rules
Name | Expression | Data collection interval (Unit: minutes) | Trigger condition |
PodCpu75 | 100 * (sum(rate(container_cpu_usage_seconds_total[1m])) by (pod_name) / sum(label_replace(kube_pod_container_resource_limits_cpu_cores, "pod_name", "$1", "pod", "(.*)")) by (pod_name))>75 | 7 | The CPU utilization of a pod is greater than 75%. |
PodMemory75 | 100 * (sum(container_memory_working_set_bytes) by (pod_name) / sum(label_replace(kube_pod_container_resource_limits_memory_bytes, "pod_name", "$1", "pod", "(.*)")) by (pod_name))>75 | 5 | The memory usage of a pod is greater than 75%. |
pod_status_no_running | sum (kube_pod_status_phase{phase!="Running"}) by (pod,phase) | 5 | A pod is not running. |
PodMem4GbRestart | (sum (container_memory_working_set_bytes{id!="/"})by (pod_name,container_name) /1024/1024/1024)>4 | 5 | The memory of a pod is larger than 4 GB. |
PodRestart | sum (increase (kube_pod_container_status_restarts_total{}[2m])) by (namespace,pod) >0 | 5 | A pod is restarted. |
Kubernetes alert rules
Name | Expression | Data collection interval (Unit: minutes) | Trigger condition |
KubeStateMetricsListErrors | (sum(rate(kube_state_metrics_list_total{job="kube-state-metrics",result="error"}[5m])) / sum(rate(kube_state_metrics_list_total{job="kube-state-metrics"}[5m]))) > 0.01 | 15 | An error occurs in a metric list. |
KubeStateMetricsWatchErrors | (sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics",result="error"}[5m])) / sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics"}[5m]))) > 0.01 | 15 | An error occurs in Metric Watch. |
NodeFilesystemAlmostOutOfSpace | ( node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 5 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) | 60 | A node file system is running out of space. |
NodeFilesystemSpaceFillingUp | ( node_filesystem_avail_bytes{job="node-exporter",fstype!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!=""} * 100 < 40 and predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) | 60 | A node file system is about to be fully occupied. |
NodeFilesystemFilesFillingUp | ( node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 40 and predict_linear(node_filesystem_files_free{job="node-exporter",fstype!=""}[6h], 24*60*60) < 0 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) | 60 | Files in a node file system are about to be fully occupied. |
NodeFilesystemAlmostOutOfFiles | ( node_filesystem_files_free{job="node-exporter",fstype!=""} / node_filesystem_files{job="node-exporter",fstype!=""} * 100 < 3 and node_filesystem_readonly{job="node-exporter",fstype!=""} == 0 ) | 60 | Almost no files exist in a node file system. |
NodeNetworkReceiveErrs | increase(node_network_receive_errs_total[2m]) > 10 | 60 | A network reception error occurs in a node. |
NodeNetworkTransmitErrs | increase(node_network_transmit_errs_total[2m]) > 10 | 60 | A network transmission error occurs in a node. |
NodeHighNumberConntrackEntriesUsed | (node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.75 | None | A large number of conntrack entries are used. |
NodeClockSkewDetected | ( node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0 ) or ( node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0 ) | 10 | Time deviation occurs. |
NodeClockNotSynchronising | min_over_time(node_timex_sync_status[5m]) == 0 | 10 | Time inconsistency occurs. |
KubePodCrashLooping | rate(kube_pod_container_status_restarts_total{job="kube-state-metrics"}[15m]) * 60 * 5 > 0 | 15 | A loop crash occurs. |
KubePodNotReady | sum by (namespace, pod) (max by(namespace, pod) (kube_pod_status_phase{job="kube-state-metrics", phase=~"Pending|Unknown"}) * on(namespace, pod) group_left(owner_kind) max by(namespace, pod, owner_kind) (kube_pod_owner{owner_kind!="Job"})) > 0 | 15 | A pod is not ready. |
KubeDeploymentGenerationMismatch | kube_deployment_status_observed_generation{job="kube-state-metrics"} != kube_deployment_metadata_generation{job="kube-state-metrics"} | 15 | Deployment versions do not match. |
KubeDeploymentReplicasMismatch | ( kube_deployment_spec_replicas{job="kube-state-metrics"} != kube_deployment_status_replicas_available{job="kube-state-metrics"} ) and ( changes(kube_deployment_status_replicas_updated{job="kube-state-metrics"}[5m]) == 0 ) | 15 | Deployment replicas do not match. |
KubeStatefulSetReplicasMismatch | ( kube_statefulset_status_replicas_ready{job="kube-state-metrics"} != kube_statefulset_status_replicas{job="kube-state-metrics"} ) and ( changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics"}[5m]) == 0 ) | 15 | State set replicas do not match. |
KubeStatefulSetGenerationMismatch | kube_statefulset_status_observed_generation{job="kube-state-metrics"} != kube_statefulset_metadata_generation{job="kube-state-metrics"} | 15 | State set versions do not match. |
KubeStatefulSetUpdateNotRolledOut | max without (revision) ( kube_statefulset_status_current_revision{job="kube-state-metrics"} unless kube_statefulset_status_update_revision{job="kube-state-metrics"} ) * ( kube_statefulset_replicas{job="kube-state-metrics"} != kube_statefulset_status_replicas_updated{job="kube-state-metrics"} ) | 15 | A state set update is not rolled out. |
KubeDaemonSetRolloutStuck | kube_daemonset_status_number_ready{job="kube-state-metrics"} / kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"} < 1.00 | 15 | A DaemonSet rollout is stuck. |
KubeContainerWaiting | sum by (namespace, pod, container) (kube_pod_container_status_waiting_reason{job="kube-state-metrics"}) > 0 | 60 | A container is waiting. |
KubeDaemonSetNotScheduled | kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"} - kube_daemonset_status_current_number_scheduled{job="kube-state-metrics"} > 0 | 10 | A DaemonSet is not scheduled. |
KubeDaemonSetMisScheduled | kube_daemonset_status_number_misscheduled{job="kube-state-metrics"} > 0 | 15 | A DaemonSet is misscheduled. |
KubeCronJobRunning | time() - kube_cronjob_next_schedule_time{job="kube-state-metrics"} > 3600 | 60 | A cron job takes more than 1 hour to complete. |
KubeJobCompletion | kube_job_spec_completions{job="kube-state-metrics"} - kube_job_status_succeeded{job="kube-state-metrics"} > 0 | 60 | A job is complete. |
KubeJobFailed | kube_job_failed{job="kube-state-metrics"} > 0 | 15 | A job failed. |
KubeHpaReplicasMismatch | (kube_hpa_status_desired_replicas{job="kube-state-metrics"} != kube_hpa_status_current_replicas{job="kube-state-metrics"}) and changes(kube_hpa_status_current_replicas[15m]) == 0 | 15 | Host protected area (HPA) replicas do not match. |
KubeHpaMaxedOut | kube_hpa_status_current_replicas{job="kube-state-metrics"} == kube_hpa_spec_max_replicas{job="kube-state-metrics"} | 15 | The maximum number of HPA replicas is reached. |
KubeCPUOvercommit | sum(namespace:kube_pod_container_resource_requests_cpu_cores:sum{}) / sum(kube_node_status_allocatable_cpu_cores) > (count(kube_node_status_allocatable_cpu_cores)-1) / count(kube_node_status_allocatable_cpu_cores) | 5 | The CPU is overcommitted. |
KubeMemoryOvercommit | sum(namespace:kube_pod_container_resource_requests_memory_bytes:sum{}) / sum(kube_node_status_allocatable_memory_bytes) > (count(kube_node_status_allocatable_memory_bytes)-1) / count(kube_node_status_allocatable_memory_bytes) | 5 | The storage is overcommitted. |
KubeCPUQuotaOvercommit | sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="cpu"}) / sum(kube_node_status_allocatable_cpu_cores) > 1.5 | 5 | The CPU quota is overcommitted. |
KubeMemoryQuotaOvercommit | sum(kube_resourcequota{job="kube-state-metrics", type="hard", resource="memory"}) / sum(kube_node_status_allocatable_memory_bytes{job="node-exporter"}) > 1.5 | 5 | The storage quota is overcommitted. |
KubeQuotaExceeded | kube_resourcequota{job="kube-state-metrics", type="used"} / ignoring(instance, job, type) (kube_resourcequota{job="kube-state-metrics", type="hard"} > 0) > 0.90 | 15 | The quota is exceeded. |
CPUThrottlingHigh | sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container, pod, namespace) / sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container, pod, namespace) > ( 25 / 100 ) | 15 | The CPU is overheated. |
KubePersistentVolumeFillingUp | kubelet_volume_stats_available_bytes{job="kubelet", metrics_path="/metrics"} / kubelet_volume_stats_capacity_bytes{job="kubelet", metrics_path="/metrics"} < 0.03 | 1 | The volume capacity is insufficient. |
KubePersistentVolumeErrors | kube_persistentvolume_status_phase{phase=~"Failed|Pending",job="kube-state-metrics"} > 0 | 5 | An error occurs in the volume capacity. |
KubeVersionMismatch | count(count by (gitVersion) (label_replace(kubernetes_build_info{job! ~"kube-dns|coredns"},"gitVersion","$1","gitVersion","(v[0-9]*.[ 0-9]*.[ 0-9]*). *"))) > 1 | 15 | Versions do not match. |
KubeClientErrors | (sum(rate(rest_client_requests_total{code=~"5.."}[5m])) by (instance, job) / sum(rate(rest_client_requests_total[5m])) by (instance, job)) > 0.01 | 15 | An error occurs in the client. |
KubeAPIErrorBudgetBurn | sum(apiserver_request:burnrate1h) > (14.40 * 0.01000) and sum(apiserver_request:burnrate5m) > (14.40 * 0.01000) | 2 | Excessive API errors occur. |
KubeAPILatencyHigh | ( cluster:apiserver_request_duration_seconds:mean5m{job="apiserver"} > on (verb) group_left() ( avg by (verb) (cluster:apiserver_request_duration_seconds:mean5m{job="apiserver"} >= 0) + 2*stddev by (verb) (cluster:apiserver_request_duration_seconds:mean5m{job="apiserver"} >= 0) ) ) > on (verb) group_left() 1.2 * avg by (verb) (cluster:apiserver_request_duration_seconds:mean5m{job="apiserver"} >= 0) and on (verb,resource) cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{job="apiserver",quantile="0.99"} > 1 | 5 | The API latency is high. |
KubeAPIErrorsHigh | sum(rate(apiserver_request_total{job="apiserver",code=~"5.."}[5m])) by (resource,subresource,verb) / sum(rate(apiserver_request_total{job="apiserver"}[5m])) by (resource,subresource,verb) > 0.05 | 10 | Excessive API errors occur. |
KubeClientCertificateExpiration | apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 604800 | None | The client certificate expires. |
AggregatedAPIErrors | sum by(name, namespace)(increase(aggregator_unavailable_apiservice_count[5m])) > 2 | None | An error occurs in the aggregated API. |
AggregatedAPIDown | sum by(name, namespace)(sum_over_time(aggregator_unavailable_apiservice[5m])) > 0 | 5 | The aggregated API is offline. |
KubeAPIDown | absent(up{job="apiserver"} == 1) | 15 | An API operation is offline. |
KubeNodeNotReady | kube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"} == 0 | 15 | A node is not ready. |
KubeNodeUnreachable | kube_node_spec_taint{job="kube-state-metrics",key="node.kubernetes.io/unreachable",effect="NoSchedule"} == 1 | 2 | A node is unreachable. |
KubeletTooManyPods | max(max(kubelet_running_pod_count{job="kubelet", metrics_path="/metrics"}) by(instance) * on(instance) group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"}) by(node) / max(kube_node_status_capacity_pods{job="kube-state-metrics"} != 1) by(node) > 0.95 | 15 | Excessive pods exist. |
KubeNodeReadinessFlapping | sum(changes(kube_node_status_condition{status="true",condition="Ready"}[15m])) by (node) > 2 | 15 | The readiness status changes frequently. |
KubeletPlegDurationHigh | node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile{quantile="0.99"} >= 10 | 5 | The pod lifecycle event generator (PLEG) lasts for an extended period of time. |
KubeletPodStartUpLatencyHigh | histogram_quantile(0.99, sum(rate(kubelet_pod_worker_duration_seconds_bucket{job="kubelet", metrics_path="/metrics"}[5m])) by (instance, le)) * on(instance) group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"} > 60 | 15 | The startup latency of a pod is high. |
KubeletDown | absent(up{job="kubelet", metrics_path="/metrics"} == 1) | 15 | The kubelet is offline. |
KubeSchedulerDown | absent(up{job="kube-scheduler"} == 1) | 15 | The Kubernetes scheduler is offline. |
KubeControllerManagerDown | absent(up{job="kube-controller-manager"} == 1) | 15 | The controller manager is offline. |
TargetDown | 100 * (count(up == 0) BY (job, namespace, service) / count(up) BY (job, namespace, service)) > 10 | 10 | The target is offline. |
NodeNetworkInterfaceFlapping | changes(node_network_up{job="node-exporter",device! ~"veth.+"}[2m]) > 2 | 2 | The network interface status changes frequently. |
MongoDB alert rules
Name | Expression | Data collection interval (Unit: minutes) | Trigger condition |
MongodbReplicationLag | avg(mongodb_replset_member_optime_date{state="PRIMARY"}) - avg(mongodb_replset_member_optime_date{state="SECONDARY"}) > 10 | 5 | The replication lantency is long. |
MongodbReplicationHeadroom | (avg(mongodb_replset_oplog_tail_timestamp - mongodb_replset_oplog_head_timestamp) - (avg(mongodb_replset_member_optime_date{state="PRIMARY"}) - avg(mongodb_replset_member_optime_date{state="SECONDARY"}))) <= 0 | 5 | The replication limit is reached. |
MongodbReplicationStatus3 | mongodb_replset_member_state == 3 | 5 | The replication status is 3. |
MongodbReplicationStatus6 | mongodb_replset_member_state == 6 | 5 | The replication status is 6. |
MongodbReplicationStatus8 | mongodb_replset_member_state == 8 | 5 | The replication status is 8. |
MongodbReplicationStatus10 | mongodb_replset_member_state == 10 | 5 | The replication status is 10. |
MongodbNumberCursorsOpen | mongodb_metrics_cursor_open{state="total_open"} > 10000 | 5 | Excessive cursors exist. |
MongodbCursorsTimeouts | sum (increase increase(mongodb_metrics_cursor_timed_out_total[10m]) > 100 | 5 | The cursor times out. |
MongodbTooManyConnections | mongodb_connections{state="current"} > 500 | 5 | Excessive connections exist. |
MongodbVirtualMemoryUsage | (sum(mongodb_memory{type="virtual"}) BY (ip) / sum(mongodb_memory{type="mapped"}) BY (ip)) > 3 | 5 | The virtual memory usage is high. |
MySQL alert rules
Name | Expression | Data collection interval (Unit: minutes) | Trigger condition |
MySQL is down | mysql_up == 0 | 1 | MySQL is offline. |
open files high | mysql_global_status_innodb_num_open_files > (mysql_global_variables_open_files_limit) * 0.75 | 1 | Excessive files are opened. |
Read buffer size is bigger than max. allowed packet size | mysql_global_variables_read_buffer_size > mysql_global_variables_slave_max_allowed_packet | 1 | The size of the read buffer exceeds the maximum allowed packet size. |
Sort buffer possibly missconfigured | mysql_global_variables_innodb_sort_buffer_size <256*1024 or mysql_global_variables_read_buffer_size > 4*1024*1024 | 1 | A configuration error may exist in the sort buffer. |
Thread stack size is too small | mysql_global_variables_thread_stack <196608 | 1 | The thread stack size is small. |
Used more than 80% of max connections limited | mysql_global_status_max_used_connections > mysql_global_variables_max_connections * 0.8 | 1 | The connection limit (80%) is reached. |
InnoDB Force Recovery is enabled | mysql_global_variables_innodb_force_recovery != 0 | 1 | Forcible recovery is enabled. |
InnoDB Log File size is too small | mysql_global_variables_innodb_log_file_size < 16777216 | 1 | The log file size is small. |
InnoDB Flush Log at Transaction Commit | mysql_global_variables_innodb_flush_log_at_trx_commit != 1 | 1 | Logs are refreshed when transactions are committed. |
Table definition cache too small | mysql_global_status_open_table_definitions > mysql_global_variables_table_definition_cache | 1 | The number of cached table definitions is small. |
Table open cache too small | mysql_global_status_open_tables >mysql_global_variables_table_open_cache * 99/100 | 1 | The number of cached open tables is small. |
Thread stack size is possibly too small | mysql_global_variables_thread_stack < 262144 | 1 | The thread stack size may be small. |
InnoDB Buffer Pool Instances is too small | mysql_global_variables_innodb_buffer_pool_instances == 1 | 1 | The number of instances in the buffer pool is small. |
InnoDB Plugin is enabled | mysql_global_variables_ignore_builtin_innodb == 1 | 1 | The plug-in is enabled. |
Binary Log is disabled | mysql_global_variables_log_bin != 1 | 1 | Binary logs are disabled. |
Binlog Cache size too small | mysql_global_variables_binlog_cache_size < 1048576 | 1 | The cache size is small. |
Binlog Statement Cache size too small | mysql_global_variables_binlog_stmt_cache_size <1048576 and mysql_global_variables_binlog_stmt_cache_size > 0 | 1 | The statement cache size is small. |
Binlog Transaction Cache size too small | mysql_global_variables_binlog_cache_size <1048576 | 1 | The transaction cache size is small. |
Sync Binlog is enabled | mysql_global_variables_sync_binlog == 1 | 1 | Binary logs are enabled. |
IO thread stopped | mysql_slave_status_slave_io_running != 1 | 1 | I/O threads are stopped. |
SQL thread stopped | mysql_slave_status_slave_sql_running == 0 | 1 | SQL threads are stopped. |
Mysql_Too_Many_Connections | rate(mysql_global_status_threads_connected[5m])>200 | 5 | Excessive connections exist. |
Mysql_Too_Many_slow_queries | rate(mysql_global_status_slow_queries[5m])>3 | 5 | Excessive slow queries exist. |
Slave lagging behind Master | rate(mysql_slave_status_seconds_behind_master[1m]) >30 | 1 | The primary node outperforms the secondary nodes. |
Slave is NOT read only(Please ignore this warning indicator.) | mysql_global_variables_read_only != 0 | 1 | Permissions on the secondary nodes are not read-only permissions. |
NGINX alert rules
Name | Expression | Data collection interval (Unit: minutes) | Trigger condition |
NginxHighHttp4xxErrorRate | sum(rate(nginx_http_requests_total{status=~"^4.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5 | 5 | The rate of HTTP 4xx errors is high. |
NginxHighHttp5xxErrorRate | sum(rate(nginx_http_requests_total{status=~"^5.."}[1m])) / sum(rate(nginx_http_requests_total[1m])) * 100 > 5 | 5 | The rate of HTTP 5xx errors is high. |
NginxLatencyHigh | histogram_quantile(0.99, sum(rate(nginx_http_request_duration_seconds_bucket[30m])) by (host, node)) > 10 | 5 | The latency is high. |
Redis alert rules
Name | Expression | Data collection interval (Unit: minutes) | Trigger condition |
RedisDown | redis_up == 0 | 5 | Redis is offline. |
RedisMissingMaster | count(redis_instance_info{role="master"}) == 0 | 5 | The primary node is missing. |
RedisTooManyMasters | count(redis_instance_info{role="master"}) > 1 | 5 | Excessive primary nodes exist. |
RedisDisconnectedSlaves | count without (instance, job) (redis_connected_slaves) - sum without (instance, job) (redis_connected_slaves) - 1 > 1 | 5 | Secondary nodes are disconnected. |
RedisReplicationBroken | delta(redis_connected_slaves[1m]) < 0 | 5 | The replication is interrupted. |
RedisClusterFlapping | changes(redis_connected_slaves[5m]) > 2 | 5 | Changes are detected in the connection to replica nodes. |
RedisMissingBackup | time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 24 | 5 | The backup is interrupted. |
RedisOutOfMemory | redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90 | 5 | The memory is insufficient. |
RedisTooManyConnections | redis_connected_clients > 100 | 5 | Excessive connections exist. |
RedisNotEnoughConnections | redis_connected_clients < 5 | 5 | Connections are insufficient. |
RedisRejectedConnections | increase(redis_rejected_connections_total[1m]) > 0 | 5 | The connection is rejected. |