Container Service for Kubernetes (ACK) allows you to configure alerts to centrally manage exceptions in clusters and provides various metrics for different scenarios. You can deploy Custom Resource Definitions (CRDs) in clusters to configure and manage alert rules. This topic describes how to set up alerting and configure alert rules for a registered cluster.
Prerequisites
An external Kubernetes cluster is registered in the Container Service for Kubernetes (ACK) console. For more information, see Create a registered cluster.
A kubectl client is connected to the registered cluster. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.
Scenarios
ACK allows you to centrally configure alert rules and manage alerts in various scenarios. The alert management feature is commonly used in the following scenarios:
Cluster O&M
You can configure alert rules to detect exceptions in cluster management, storage, networks, and elastic scaling at the earliest opportunity. For example, the following plug-ins are automatically installed when the system creates a cluster:
Use the alert rule set for resource exceptions to get notified when the key metrics of basic cluster resources exceed thresholds. Alerts are triggered when key metrics, such as CPU usage, memory usage, and network latency, exceed the specified thresholds. If you receive alert notifications, you can take measures to ensure cluster stability.
Use the alert rule set for cluster exceptions to get notified of node or container exceptions. Alerts are triggered upon events such as Docker process exceptions, node process exceptions, or pod restart failures.
Use the alert rule set for storage exceptions to get notified of storage changes and exceptions.
Use the alert rule set for network exceptions to get notified of network changes and exceptions.
Use the alert rule set for O&M exceptions to get notified of changes and exceptions that are related to cluster control.
Application development
You can configure alert rules to get notified of exceptions and abnormal metrics of running applications in the cluster at the earliest opportunity. For example, you can configure alert rules to receive notifications about exceptions of replicated pods and when the CPU and memory usage of a Deployment exceeds the thresholds. You can use the default alert rule template to quickly set up alerts to receive notifications about exceptions of replicated pods in the cluster. For example, you can configure and enable the alert rule set for pod exceptions to get notified of exceptions in the pods of your application.
Application management
To get notified of the issues that occur throughout the lifecycle of an application, we recommend that you take note of application health, capacity planning, cluster stability, exceptions, and errors. You can configure and enable the alert rule set for critical events to get notified of warnings and errors in the cluster. You can configure and enable the alert rule set for resource exceptions to get notified of abnormal resource usage in the cluster and optimize capacity planning.
Multi-cluster management
When you manage multiple clusters, you may find it a complex task to configure and synchronize alert rules across the clusters. ACK allows you to deploy CRDs in the cluster to manage alert rules. You can configure the same CRDs to synchronize alert rules across multiple clusters.
Configure alicloud-monitor-controller in the registered cluster
Step 1: Grant RAM permissions to alicloud-monitor-controller
Use onectl
Install onectl on your on-premises machine. For more information, see Use onectl to manage registered clusters.
Run the following command to grant Resource Access Management (RAM) permissions to alicloud-monitor-controller:
onectl ram-user grant --addon alicloud-monitor-controller
Expected output:
Ram policy ack-one-registered-cluster-policy-alicloud-monitor-controller granted to ram user ack-one-user-ce313528c3 successfully.
Use the console
Before you install a component in a registered cluster, you must set the AccessKey pair to grant the registered cluster the permissions to access Alibaba Cloud resources. Before you set the AccessKey pair, create a RAM user and grant the RAM user the permissions to access Alibaba Cloud resources.
Create a RAM user. For more information, see Create a RAM user.
Create a custom policy. For more information, see Create custom policies.
Example:
{ "Action": [ "log:*", "arms:*", "cms:*", "cs:UpdateContactGroup" ], "Resource": [ "*" ], "Effect": "Allow" }
Attach the custom policy to the RAM user. For more information, see Grant permissions to a RAM user.
Create an AccessKey pair for the RAM user. For more information, see Obtain an AccessKey pair.
Use the AccessKey pair to create a Secret named alibaba-addon-secret in the registered cluster.
The system automatically uses the AccessKey pair to access cloud resources when you install alicloud-monitor-controller.
kubectl -n kube-system create secret generic alibaba-addon-secret --from-literal='access-key-id=<your access key id>' --from-literal='access-key-secret=<your access key secret>'
NoteReplace
<your access key id>
and<your access key secret>
with the AccessKey pair that you obtained in the previous step.
Step 2: Install and update alicloud-monitor-controller
Use onectl
Run the following command to install alicloud-monitor-controller:
onectl addon install alicloud-monitor-controller
Expected output:
Addon alicloud-monitor-controller, version **** installed.
Use the console
The console automatically checks whether the alerting configuration meets the requirements and guides you to activate, install, or update alicloud-monitor-controller.
Log on to the ACK console.
In the left-side navigation pane of the ACK console, click Clusters.
On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.
In the left-side navigation pane, choose .
On the Alerts page, the console automatically checks whether the following conditions are met.
If not all conditions are met, follow the on-screen instructions to install or update the required components.
Simple Log Service is activated. If Simple Log Service is not activated, log on to the Simple Log Service console and follow the on-screen instructions to activate the service.
NoteFor more information about the billing rules of Simple Log Service, see Billable items of pay-by-feature.
Event Center is installed. For more information, see Event monitoring.
The alicloud-monitor-controller component is updated to the latest version. For more information, see alicloud-monitor-controller.
Set up alerting
Step 1: Enable default alert rules
Log on to the ACK console.
In the left-side navigation pane of the ACK console, click Clusters.
On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.
In the left-side navigation pane, choose .
On the Alert Rules tab, enable the alert rule set.
Step 2: Configure alert rules
Log on to the ACK console.
In the left-side navigation pane of the ACK console, click Clusters.
On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.
In the left-side navigation pane, choose .
Feature
Description
Alert Rules
By default, ACK provides an alert rule template that you can use to generate alerts based on exceptions and metrics.
Alert rules are classified into several alert rule sets. You can enable an alert rule set, disable an alert rule set, and configure multiple alert contact groups for an alert rule set.
An alert rule set contains multiple alert rules. Each alert rule corresponds to an alert item. You can create a YAML file to configure multiple alert rule sets in a cluster. You can also modify the YAML file to update alert rules.
For more information about how to configure alert rules by using a YAML file, see Configure alert rules by using CRDs.
For more information about default alert templates, see Default alert rule templates.
Alert History
You can view up to 100 historical alerts. You can select an alert and click the link in the Alert Rule column to view rule details in the monitoring system. You can click Details to go to the resource page where the alert is triggered. The alert may be triggered by an exception or an abnormal metric.
Alert Contacts
You can create, edit, or delete alert contacts.
Alert Contact Groups
You can create, edit, or delete alert contact groups. If no alert contact group exists, the ACK console automatically creates a default alert contact group based on the information that you provided during registration.
On the Alert Rules tab, click Modify Contacts to specify the contact groups to which the alerts are sent. You can turn on or turn off Status to enable or disable the alert rule set.
Configure alert rules by using CRDs
When the alerting feature is enabled, the system automatically creates an AckAlertRule in the kube-system namespace. The AckAlertRule contains default alert rule templates. You can use the AckAlertRule to configure alert rule sets.
Log on to the ACK console.
In the left-side navigation pane of the ACK console, click Clusters.
On the Clusters page, find the cluster that you want to manage and click the name of the cluster or click Details in the Actions column. The details page of the cluster appears.
In the left-side navigation pane, choose .
In the upper-right corner of the Alert Rules tab, click Configure Alert Rule. You can view the configuration of the AckAlertRule object and modify the YAML file to update the configuration.
Example:
apiVersion: alert.alibabacloud.com/v1beta1 kind: AckAlertRule metadata: name: default spec: groups: # The following code is a sample alert rule based on cluster events. - name: pod-exceptions # The name of the alert rule set. This parameter corresponds to the Group_Name field in the alert rule template. rules: - name: pod-oom # The name of the alert rule. type: event # The type of the alert rule, which corresponds to the Rule_Type parameter. Valid values: event and metric-cms. expression: sls.app.ack.pod.oom # The alert rule expression. If you set the rule type to event, the expression is set to the value of Rule_Expression_Id in the default alert rule template. enable: enable # The status of the alert rule. Valid values: enable and disable. - name: pod-failed type: event expression: sls.app.ack.pod.failed enable: enable # The following code is a sample alert rule for basic cluster resources. - name: res-exceptions # The name of the alert rule set. This parameter corresponds to the Group_Name field in the alert rule template. rules: - name: node_cpu_util_high # The name of the alert rule. type: metric-cms # The type of the alert rule, which corresponds to the Rule_Type parameter. Valid values: event and metric-cms. expression: cms.host.cpu.utilization # The alert rule expression. If you set the rule type to event, the expression is set to the value of Rule_Expression_Id in the default alert rule template. contactGroups: # The contact group that is associated with the alert rule. The contacts created by an Alibaba Cloud account are shared by all clusters within the account. enable: enable # The status of the alert rule. Valid values: enable and disable. thresholds: # The alert threshold. For more information, see the "Modify the alert threshold for basic cluster resources" section of this topic. - key: CMS_ESCALATIONS_CRITICAL_Threshold unit: percent value: '1'
Default alert rule templates
ACK creates default alert rules in registered clusters based on the following conditions:
Default alert rules are enabled.
You go to the Alert Rules tab for the first time and default alert rules are not enabled.
The following table describes default alert rules.
Alert rule set | Alert rule | Description | Rule_Type | ACK_CR_Rule_Name | SLS_Event_ID |
Alert rule set for critical events in the cluster. | Errors | An alert is triggered when an error occurs in the cluster. | event | error-event | sls.app.ack.error |
Warnings | An alert is triggered when a warning occurs in the cluster, except for warnings that can be ignored. | event | warn-event | sls.app.ack.warn | |
Alert rule set for cluster exceptions | Docker process exceptions on nodes | An alert is triggered when a dockerd exception or a containerd exception occurs on a node. | event | docker-hang | sls.app.ack.docker.hang |
Evictions in the cluster | An alert is triggered when a pod is evicted. | event | eviction-event | sls.app.ack.eviction | |
GPU XID errors | An alert is triggered when a GPU XID error occurs. | event | gpu-xid-error | sls.app.ack.gpu.xid_error | |
Node changes to the unschedulable state | An alert is triggered when the status of a node changes to unschedulable. | event | node-down | sls.app.ack.node.down | |
Node restarts | An alert is triggered when a node restarts. | event | node-restart | sls.app.ack.node.restart | |
NTP service failures on nodes | An alert is triggered when the Network Time Protocol (NTP) service fails. | event | node-ntp-down | sls.app.ack.ntp.down | |
PLEG errors on nodes | An alert is triggered when a Lifecycle Event Generator (PLEG) error occurs on a node. | event | node-pleg-error | sls.app.ack.node.pleg_error | |
Process errors on nodes | An alert is triggered when a process error occurs on a node. | event | ps-hang | sls.app.ack.ps.hang | |
Alert rule set for resource exceptions | Node - CPU usage ≥ 85% | An alert is triggered when the CPU usage of a node exceeds the threshold. The default threshold is 85%. If the percentage of available CPU resources is less than 15%, the CPU resources reserved for components may become insufficient. For more information, see Resource reservation policy. Consequently, CPU throttling may be frequently triggered and processes may respond slowly. We recommend that you optimize the CPU usage or adjust the threshold at the earliest opportunity. For more information about how to adjust the threshold, see Modify the alert threshold for basic cluster resources. | metric-cms | node_cpu_util_high | cms.host.cpu.utilization |
Node - Memory usage ≥ 85% | An alert is triggered when the memory usage of a node exceeds the threshold. The default threshold is 85%. If the percentage of available memory resources is less than 15%, the memory resources reserved for components may become insufficient. For more information, see Resource reservation policy. In this scenario, kubelet forcibly evicts pods from the node. We recommend that you optimize the memory usage or adjust the threshold at the earliest opportunity. For more information about how to adjust the threshold, see Modify the alert threshold for basic cluster resources. | metric-cms | node_mem_util_high | cms.host.memory.utilization | |
Node - Disk usage ≥ 85% | An alert is triggered when the disk usage of a node exceeds the threshold. The default threshold is 85%. For more information about how to adjust the threshold, see Modify the alert threshold for basic cluster resources. | metric-cms | node_disk_util_high | cms.host.disk.utilization | |
Node - Usage of outbound public bandwidth ≥ 85% | An alert is triggered when the usage of the outbound public bandwidth of a node exceeds the threshold. The default threshold is 85%. For more information about how to adjust the threshold, see Modify the alert threshold for basic cluster resources. | metric-cms | node_public_net_util_high | cms.host.public.network.utilization | |
Node - Inode usage ≥ 85% | An alert is triggered when the inode usage of a node exceeds the threshold. The default threshold is 85%. For more information about how to adjust the threshold, see Modify the alert threshold for basic cluster resources. | metric-cms | node_fs_inode_util_high | cms.host.fs.inode.utilization | |
Resources - Usage of the maximum connections of an SLB instance ≥ 85% | An alert is triggered when the usage of the maximum number of connections of a Server Load Balancer (SLB) instance exceeds the threshold. The default threshold is 85%. Note This rule applies to all SLB instances that are created for the Kubernetes API server and Ingresses. For more information about how to adjust the threshold, see Modify the alert threshold for basic cluster resources. | metric-cms | slb_qps_util_high | cms.slb.qps.utilization | |
Resources - Usage of SLB outbound bandwidth ≥ 85% | An alert is triggered when the usage of the outbound bandwidth of an SLB instance exceeds the threshold. The default threshold is 85%. Note This rule applies to all SLB instances that are created for the Kubernetes API server and Ingresses. For more information about how to adjust the threshold, see Modify the alert threshold for basic cluster resources. | metric-cms | slb_traff_tx_util_high | cms.slb.traffic.tx.utilization | |
Resources - Usage of the maximum connections of an SLB instance ≥ 85% | An alert is triggered when the usage of the maximum number of connections of an SLB instance exceeds the threshold. The default threshold is 85%. Note This rule applies to all SLB instances that are created for the Kubernetes API server and Ingresses. For more information about how to adjust the threshold, see Modify the alert threshold for basic cluster resources. | metric-cms | slb_max_con_util_high | cms.slb.max.connection.utilization | |
Resources - Connection drops per second of the listeners of an SLB instance remains ≥ 1 | An alert is triggered when the number of connections dropped per second by the listeners of an SLB instance remains at 1 or more. The default threshold is 1. Note This rule applies to all SLB instances that are created for the Kubernetes API server and Ingresses. For more information about how to adjust the threshold, see Modify the alert threshold for basic cluster resources. | metric-cms | slb_drop_con_high | cms.slb.drop.connection | |
Excessive file handles on nodes | An alert is triggered when excessive file handles exist on a node. | event | node-fd-pressure | sls.app.ack.node.fd_pressure | |
Insufficient node disk space | An alert is triggered when the disk space of a node is insufficient. | event | node-disk-pressure | sls.app.ack.node.disk_pressure | |
Excessive processes on nodes | An alert is triggered when excessive processes run on a node. | event | node-pid-pressure | sls.app.ack.node.pid_pressure | |
Insufficient node resources for scheduling | An alert is triggered when a node has insufficient resources for scheduling. | event | node-res-insufficient | sls.app.ack.resource.insufficient | |
Insufficient node IP addresses | An alert is triggered when node IP addresses are insufficient. | event | node-ip-pressure | sls.app.ack.ip.not_enough | |
Alert rule set for pod exceptions | Pod OOM errors | An alert is triggered when an out of memory (OOM) error occurs in a pod. | event | pod-oom | sls.app.ack.pod.oom |
Pod restart failures | An alert is triggered when a pod fails to restart. | event | pod-failed | sls.app.ack.pod.failed | |
Image pull failures | An alert is triggered when an image fails to be pulled. | event | image-pull-back-off | sls.app.ack.image.pull_back_off | |
Alert rule set for O&M exceptions | No available SLB instance | An alert is triggered when an SLB instance fails to be created. In this case, submit a ticket to contact the ACK technical team. | event | slb-no-ava | sls.app.ack.ccm.no_ava_slb |
SLB instance update failures | An alert is triggered when an SLB instance fails to be updated. In this case, submit a ticket to contact the ACK technical team. | event | slb-sync-err | sls.app.ack.ccm.sync_slb_failed | |
SLB instance deletion failures | An alert is triggered when an SLB instance fails to be deleted. In this case, submit a ticket to contact the ACK technical team. | event | slb-del-err | sls.app.ack.ccm.del_slb_failed | |
Node deletion failures | An alert is triggered when a node fails to be deleted. In this case, submit a ticket to contact the ACK technical team. | event | node-del-err | sls.app.ack.ccm.del_node_failed | |
Node adding failures | An alert is triggered when a node fails to be added to the cluster. In this case, submit a ticket to contact the ACK technical team. | event | node-add-err | sls.app.ack.ccm.add_node_failed | |
Route creation failures | An alert is triggered when a cluster fails to create a route in the virtual private cloud (VPC). In this case, submit a ticket to contact the ACK technical team. | event | route-create-err | sls.app.ack.ccm.create_route_failed | |
Route update failures | An alert is triggered when a cluster fails to update the routes of the VPC. In this case, submit a ticket to contact the ACK technical team. | event | route-sync-err | sls.app.ack.ccm.sync_route_failed | |
Command execution failures in managed node pools | An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team. | event | nlc-run-cmd-err | sls.app.ack.nlc.run_command_fail | |
Node removal failures in managed node pools | An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team. | event | nlc-empty-cmd | sls.app.ack.nlc.empty_task_cmd | |
Unimplemented URL mode in managed node pools | An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team. | event | nlc-url-m-unimp | sls.app.ack.nlc.url_mode_unimpl | |
Unknown repairing operations in managed node pools | An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team. | event | nlc-opt-no-found | sls.app.ack.nlc.op_not_found | |
Node draining and removal failures in managed node pools | An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team. | event | nlc-des-node-err | sls.app.ack.nlc.destroy_node_fail | |
Node draining failures in managed node pools | An alert is triggered when a node in a managed node pool fails to be drained. In this case, submit a ticket to contact the ACK technical team. | event | nlc-drain-node-err | sls.app.ack.nlc.drain_node_fail | |
ECS restart timeouts in managed node pools | An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team. | event | nlc-restart-ecs-wait | sls.app.ack.nlc.restart_ecs_wait_fail | |
ECS restart failures in managed node pools | An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team. | event | nlc-restart-ecs-err | sls.app.ack.nlc.restart_ecs_fail | |
ECS reset failures in managed node pools | An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team. | event | nlc-reset-ecs-err | sls.app.ack.nlc.reset_ecs_fail | |
Auto-repair task failures in managed node pools | An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team. | event | nlc-sel-repair-err | sls.app.ack.nlc.repair_fail | |
Alert rule set for network exceptions | Invalid Terway resources | An alert is triggered when a Terway resource is invalid. In this case, submit a ticket to contact the ACK technical team. | event | terway-invalid-res | sls.app.ack.terway.invalid_resource |
IP allocation failures of Terway | An alert is triggered when an IP address fails to be allocated in Terway mode. In this case, submit a ticket to contact the ACK technical team. | event | terway-alloc-ip-err | sls.app.ack.terway.alloc_ip_fail | |
Ingress bandwidth configuration parsing failures | An alert is triggered when the bandwidth configuration of an Ingress fails to be parsed. In this case, submit a ticket to contact the ACK technical team. | event | terway-parse-err | sls.app.ack.terway.parse_fail | |
Network resource allocation failures of Terway | An alert is triggered when a network resource fails to be allocated in Terway mode. In this case, submit a ticket to contact the ACK technical team. | event | terway-alloc-res-err | sls.app.ack.terway.allocate_failure | |
Network resource reclaiming failures of Terway | An alert is triggered when a network resource fails to be reclaimed in Terway mode. In this case, submit a ticket to contact the ACK technical team. | event | terway-dispose-err | sls.app.ack.terway.dispose_failure | |
Terway virtual mode changes | An alert is triggered when the Terway virtual mode is changed. | event | terway-virt-mod-err | sls.app.ack.terway.virtual_mode_change | |
Pod IP checks executed by Terway | An alert is triggered when a pod IP is checked in Terway mode. | event | terway-ip-check | sls.app.ack.terway.config_check | |
Ingress configuration reload failures | An alert is triggered when the configuration of an Ingress fails to be reloaded. In this case, check whether the Ingress configuration is valid. | event | ingress-reload-err | sls.app.ack.ingress.err_reload_nginx | |
Alert rule set for storage exceptions | Cloud disk size less than 20 GiB | ACK does not allow you to mount a disk of less than 20 GiB. You can check the sizes of the disks that are attached to your cluster. | event | csi_invalid_size | sls.app.ack.csi.invalid_disk_size |
Subscription cloud disks cannot be mounted | ACK does not allow you to mount a subscription disk. You can check the billing methods of the disks that are attached to your cluster. | event | csi_not_portable | sls.app.ack.csi.disk_not_portable | |
Mount target unmounting failures because the mount target is being used | An alert is triggered when an unmount failure occurs because the mount target is in use. | event | csi_device_busy | sls.app.ack.csi.deivce_busy | |
No available cloud disk | An alert is triggered when no disk is available. In this case, submit a ticket to contact the ACK technical team. | event | csi_no_ava_disk | sls.app.ack.csi.no_ava_disk | |
I/O hangs of cloud disks | An alert is triggered when I/O hangs occur on a disk. In this case, submit a ticket to contact the ACK technical team. | event | csi_disk_iohang | sls.app.ack.csi.disk_iohang | |
Slow I/O rate of PVC used to mount cloud disks | An alert is triggered when the I/O of a disk that is mounted by using a persistent volume claim (PVC) is slow. In this case, submit a ticket to contact the ACK technical team. | event | csi_latency_high | sls.app.ack.csi.latency_too_high | |
Disk usage exceeds the threshold | An alert is triggered when the usage of a disk exceeds the specified threshold. You can check the usage of a disk that is mounted to your cluster. | event | disk_space_press | sls.app.ack.csi.no_enough_disk_space | |
Alert rule set for cluster security events | High-risk configurations detected in inspections | An alert is triggered when a high-risk configuration is detected during a cluster inspection. In this case, submit a ticket to contact the ACK technical team. | event | si-c-a-risk | sls.app.ack.si.config_audit_high_risk |