Alert rule set | Alert rule | Description | Rule_Type | ACK_CR_Rule_Name | SLS_Event_ID |
Alert rule set for critical events in the cluster. | Errors | An alert is triggered when an error occurs in the cluster. | event | error-event | sls.app.ack.error |
Warnings | An alert is triggered when a warning occurs in the cluster, except for warnings that can be ignored. | event | warn-event | sls.app.ack.warn |
Alert rule set for cluster exceptions | Docker process exceptions on nodes | An alert is triggered when a dockerd exception or a containerd exception occurs on a node. | event | docker-hang | sls.app.ack.docker.hang |
Evictions in the cluster | An alert is triggered when a pod is evicted. | event | eviction-event | sls.app.ack.eviction |
GPU Xid errors | An alert is triggered when a GPU Xid error occurs. | event | gpu-xid-error | sls.app.ack.gpu.xid_error |
Node changes to the unschedulable state | An alert is triggered when the status of a node changes to unschedulable. | event | node-down | sls.app.ack.node.down |
Node restarts | An alert is triggered when a node restarts. | event | node-restart | sls.app.ack.node.restart |
NTP service failures on nodes | An alert is triggered when the Network Time Protocol (NTP) service fails. | event | node-ntp-down | sls.app.ack.ntp.down |
PLEG errors on nodes | An alert is triggered when a Lifecycle Event Generator (PLEG) error occurs on a node. | event | node-pleg-error | sls.app.ack.node.pleg_error |
Process errors on nodes | An alert is triggered when a process error occurs on a node. | event | ps-hang | sls.app.ack.ps.hang |
Alert rule set for resource exceptions | Node - CPU usage ≥ 85% | An alert is triggered when the CPU usage of a node exceeds the threshold. The default threshold is 85%. If the percentage of available CPU resources is less than 15%, the CPU resources reserved for components may become insufficient. For more information, see Resource reservation policy. Consequently, CPU throttling may be frequently triggered and processes may respond slowly. We recommend that you optimize the CPU usage or adjust the threshold at the earliest opportunity. For more information about how to adjust the threshold, see Example - Modify the alert threshold for basic cluster resources by using a CRD. | metric-cms | node_cpu_util_high | cms.host.cpu.utilization |
Node - Memory usage ≥ 85% | An alert is triggered when the memory usage of a node exceeds the threshold. The default threshold is 85%. If the percentage of available memory resources is less than 15%, the memory resources reserved for components may become insufficient. For more information, see Resource reservation policy. In this scenario, kubelet forcibly evicts pods from the node. We recommend that you optimize the memory usage or adjust the threshold at the earliest opportunity. For more information about how to adjust the threshold, see Example - Modify the alert threshold for basic cluster resources by using a CRD. | metric-cms | node_mem_util_high | cms.host.memory.utilization |
Node - Disk usage ≥ 85% | An alert is triggered when the disk usage of a node exceeds the threshold. The default threshold is 85%. For more information about how to adjust the threshold, see Example - Modify the alert threshold for basic cluster resources by using a CRD. | metric-cms | node_disk_util_high | cms.host.disk.utilization |
Node - Usage of outbound public bandwidth ≥ 85% | An alert is triggered when the usage of the outbound public bandwidth of a node exceeds the threshold. The default threshold is 85%. For more information about how to adjust the threshold, see Example - Modify the alert threshold for basic cluster resources by using a CRD. | metric-cms | node_public_net_util_high | cms.host.public.network.utilization |
Node - Inode usage ≥ 85% | An alert is triggered when the inode usage of a node exceeds the threshold. The default threshold is 85%. For more information about how to adjust the threshold, see Example - Modify the alert threshold for basic cluster resources by using a CRD. | metric-cms | node_fs_inode_util_high | cms.host.fs.inode.utilization |
Resources - Usage of the maximum connections of an SLB instance ≥ 85% | An alert is triggered when the usage of the maximum number of connections of a Server Load Balancer (SLB) instance exceeds the threshold. The default threshold is 85%. Note This rule applies to all SLB instances that are created for the Kubernetes API server and Ingresses. For more information about how to adjust the threshold, see Example - Modify the alert threshold for basic cluster resources by using a CRD. | metric-cms | slb_qps_util_high | cms.slb.qps.utilization |
Resources - Usage of SLB outbound bandwidth ≥ 85% | An alert is triggered when the usage of the outbound bandwidth of an SLB instance exceeds the threshold. The default threshold is 85%. Note This rule applies to all SLB instances that are created for the Kubernetes API server and Ingresses. For more information about how to adjust the threshold, see Example - Modify the alert threshold for basic cluster resources by using a CRD. | metric-cms | slb_traff_tx_util_high | cms.slb.traffic.tx.utilization |
Resources - Usage of the maximum connections of an SLB instance ≥ 85% | An alert is triggered when the usage of the maximum number of connections of an SLB instance exceeds the threshold. The default threshold is 85%. Note This rule applies to all SLB instances that are created for the Kubernetes API server and Ingresses. For more information about how to adjust the threshold, see Example - Modify the alert threshold for basic cluster resources by using a CRD. | metric-cms | slb_max_con_util_high | cms.slb.max.connection.utilization |
Resources - Connection drops per second of the listeners of an SLB instance remains ≥ 1 | An alert is triggered when the number of connections dropped per second by the listeners of an SLB instance remains at 1 or more. The default threshold is 1. Note This rule applies to all SLB instances that are created for the Kubernetes API server and Ingresses. For more information about how to adjust the threshold, see Example - Modify the alert threshold for basic cluster resources by using a CRD. | metric-cms | slb_drop_con_high | cms.slb.drop.connection |
Excessive file handles on nodes | An alert is triggered when excessive file handles exist on a node. | event | node-fd-pressure | sls.app.ack.node.fd_pressure |
Insufficient node disk space | An alert is triggered when the disk space of a node is insufficient. | event | node-disk-pressure | sls.app.ack.node.disk_pressure |
Excessive processes on nodes | An alert is triggered when excessive processes run on a node. | event | node-pid-pressure | sls.app.ack.node.pid_pressure |
Insufficient node resources for scheduling | An alert is triggered when a node has insufficient resources for scheduling. | event | node-res-insufficient | sls.app.ack.resource.insufficient |
Insufficient node IP addresses | An alert is triggered when node IP addresses are insufficient. | event | node-ip-pressure | sls.app.ack.ip.not_enough |
Alert rule set for pod exceptions | Pod OOM errors | An alert is triggered when an out of memory (OOM) error occurs in a pod. | event | pod-oom | sls.app.ack.pod.oom |
Pod restart failures | An alert is triggered when a pod fails to restart. | event | pod-failed | sls.app.ack.pod.failed |
Image pull failures | An alert is triggered when an image fails to be pulled. | event | image-pull-back-off | sls.app.ack.image.pull_back_off |
Alert rule set for O&M exceptions | No available SLB instance | An alert is triggered when an SLB instance fails to be created. In this case, submit a ticket to contact the ACK technical team. | event | slb-no-ava | sls.app.ack.ccm.no_ava_slb |
SLB instance update failures | An alert is triggered when an SLB instance fails to be updated. In this case, submit a ticket to contact the ACK technical team. | event | slb-sync-err | sls.app.ack.ccm.sync_slb_failed |
SLB instance deletion failures | An alert is triggered when an SLB instance fails to be deleted. In this case, submit a ticket to contact the ACK technical team. | event | slb-del-err | sls.app.ack.ccm.del_slb_failed |
Node deletion failures | An alert is triggered when a node fails to be deleted. In this case, submit a ticket to contact the ACK technical team. | event | node-del-err | sls.app.ack.ccm.del_node_failed |
Node adding failures | An alert is triggered when a node fails to be added to the cluster. In this case, submit a ticket to contact the ACK technical team. | event | node-add-err | sls.app.ack.ccm.add_node_failed |
Route creation failures | An alert is triggered when a cluster fails to create a route in the virtual private cloud (VPC). In this case, submit a ticket to contact the ACK technical team. | event | route-create-err | sls.app.ack.ccm.create_route_failed |
Route update failures | An alert is triggered when a cluster fails to update the routes of the VPC. In this case, submit a ticket to contact the ACK technical team. | event | route-sync-err | sls.app.ack.ccm.sync_route_failed |
Command execution failures in managed node pools | An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team. | event | nlc-run-cmd-err | sls.app.ack.nlc.run_command_fail |
Node removal failures in managed node pools | An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team. | event | nlc-empty-cmd | sls.app.ack.nlc.empty_task_cmd |
Unimplemented URL mode in managed node pools | An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team. | event | nlc-url-m-unimp | sls.app.ack.nlc.url_mode_unimpl |
Unknown repairing operations in managed node pools | An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team. | event | nlc-opt-no-found | sls.app.ack.nlc.op_not_found |
Node draining and removal failures in managed node pools | An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team. | event | nlc-des-node-err | sls.app.ack.nlc.destroy_node_fail |
Node draining failures in managed node pools | An alert is triggered when a node in a managed node pool fails to be drained. In this case, submit a ticket to contact the ACK technical team. | event | nlc-drain-node-err | sls.app.ack.nlc.drain_node_fail |
ECS restart timeouts in managed node pools | An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team. | event | nlc-restart-ecs-wait | sls.app.ack.nlc.restart_ecs_wait_fail |
ECS restart failures in managed node pools | An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team. | event | nlc-restart-ecs-err | sls.app.ack.nlc.restart_ecs_fail |
ECS reset failures in managed node pools | An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team. | event | nlc-reset-ecs-err | sls.app.ack.nlc.reset_ecs_fail |
Auto-repair task failures in managed node pools | An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team. | event | nlc-sel-repair-err | sls.app.ack.nlc.repair_fail |
Alert rule set for network exceptions | Invalid Terway resources | An alert is triggered when a Terway resource is invalid. In this case, submit a ticket to contact the ACK technical team. | event | terway-invalid-res | sls.app.ack.terway.invalid_resource |
IP allocation failures of Terway | An alert is triggered when an IP address fails to be allocated in Terway mode. In this case, submit a ticket to contact the ACK technical team. | event | terway-alloc-ip-err | sls.app.ack.terway.alloc_ip_fail |
Ingress bandwidth configuration parsing failures | An alert is triggered when the bandwidth configuration of an Ingress fails to be parsed. In this case, submit a ticket to contact the ACK technical team. | event | terway-parse-err | sls.app.ack.terway.parse_fail |
Network resource allocation failures of Terway | An alert is triggered when a network resource fails to be allocated in Terway mode. In this case, submit a ticket to contact the ACK technical team. | event | terway-alloc-res-err | sls.app.ack.terway.allocate_failure |
Network resource reclaiming failures of Terway | An alert is triggered when a network resource fails to be reclaimed in Terway mode. In this case, submit a ticket to contact the ACK technical team. | event | terway-dispose-err | sls.app.ack.terway.dispose_failure |
Terway virtual mode changes | An alert is triggered when the Terway virtual mode is changed. | event | terway-virt-mod-err | sls.app.ack.terway.virtual_mode_change |
Pod IP checks executed by Terway | An alert is triggered when a pod IP is checked in Terway mode. | event | terway-ip-check | sls.app.ack.terway.config_check |
Ingress configuration reload failures | An alert is triggered when the configuration of an Ingress fails to be reloaded. In this case, check whether the Ingress configuration is valid. | event | ingress-reload-err | sls.app.ack.ingress.err_reload_nginx |
Alert rule set for storage exceptions | Cloud disk size less than 20 GiB | ACK does not allow you to mount a disk of less than 20 GiB. You can check the sizes of the disks that are attached to your cluster. | event | csi_invalid_size | sls.app.ack.csi.invalid_disk_size |
Subscription cloud disks cannot be mounted | ACK does not allow you to mount a subscription disk. You can check the billing methods of the disks that are attached to your cluster. | event | csi_not_portable | sls.app.ack.csi.disk_not_portable |
Mount target unmounting failures because the mount target is being used | An alert is triggered when an unmount failure occurs because the mount target is in use. | event | csi_device_busy | sls.app.ack.csi.deivce_busy |
No available cloud disk | An alert is triggered when no disk is available. In this case, submit a ticket to contact the ACK technical team. | event | csi_no_ava_disk | sls.app.ack.csi.no_ava_disk |
I/O hangs of cloud disks | An alert is triggered when I/O hangs occur on a disk. In this case, submit a ticket to contact the ACK technical team. | event | csi_disk_iohang | sls.app.ack.csi.disk_iohang |
Slow I/O rate of PVC used to mount cloud disks | An alert is triggered when the I/O of a disk that is mounted by using a persistent volume claim (PVC) is slow. In this case, submit a ticket to contact the ACK technical team. | event | csi_latency_high | sls.app.ack.csi.latency_too_high |
Disk usage exceeds the threshold | An alert is triggered when the usage of a disk exceeds the specified threshold. You can check the usage of a disk that is mounted to your cluster. | event | disk_space_press | sls.app.ack.csi.no_enough_disk_space |
Alert rule set for cluster security events | High-risk configurations detected in inspections | An alert is triggered when a high-risk configuration is detected during a cluster inspection. In this case, submit a ticket to contact the ACK technical team. | event | si-c-a-risk | sls.app.ack.si.config_audit_high_risk |