Alert management

Container Service for Kubernetes (ACK) provides the alert management feature to allow you to centrally configure alerting for containers. You can configure alert rules to get notified when a service exception occurs or one of the following metrics exceeds the threshold: key metrics of basic cluster resources, metrics of core cluster components, and application metrics. You can modify the default alert rules of a cluster by deploying CustomResourceDefinitions (CRDs) in the cluster. This allows you to detect abnormal changes in the cluster.

Feature introduction

Alerts that are triggered by events of cluster exceptions. The event data is synchronized from the event center of ACK. You must enable the Kubernetes event center feature of Simple Log Service and Managed Service for Prometheus. For more information, see Event monitoring and Managed Service for Prometheus.
Alerts that are triggered when the key metrics of basic cluster resources exceed thresholds. The metrics are synchronized from CloudMonitor. For more information, see Monitor basic resources.

Scenarios

Cluster O&M

You can configure alert rule sets to detect exceptions in cluster management, storage, networks, and elastic scaling at the earliest opportunity:

Alert rule set for resource exceptions: notifies you when the key metrics of basic cluster resources exceed thresholds. Alerts are triggered when key metrics, such as CPU usage, memory usage, and network latency, exceed the specified thresholds. If you receive alert notifications, you can take measures to ensure cluster stability.
Alert rule set for cluster exceptions: notifies you of node or container exceptions. Alerts are triggered upon events such as Docker process exceptions, node process exceptions, or pod startup failures.
Alert rule set for storage exceptions: notifies you of storage changes and exceptions.
Alert rule set for network exceptions: notifies you of network changes and exceptions.
Alert rule set for O&M exceptions: notifies you of changes and exceptions that are related to cluster control.

Application development

You can configure alert rules to get notified of exceptions and abnormal metrics of running applications in the cluster. For example, you can configure alert rules to receive notifications about exceptions of pod replicas and when the CPU and memory usage of a Deployment exceeds the thresholds. You can use the default alert rule template to quickly set up alerts to receive notifications about exceptions of pod replicas in the cluster. For example, you can configure and enable the alert rule set for pod exceptions to get notified of exceptions in the pods of your application.

Application management

To get notified of the issues that occur throughout the lifecycle of an application, we recommend that you take note of application health, capacity planning, cluster stability, exceptions, and errors. You can configure and enable the alert rule set for critical events to get notified of warnings and errors in the cluster. You can configure and enable the alert rule set for resource exceptions to get notified of abnormal resource usage in the cluster and optimize capacity planning.

Multi-cluster management

When you manage multiple clusters, you may find it a complex task to configure and synchronize alert rules across the clusters. ACK allows you to deploy CRDs in the cluster to manage alert rules. You can configure the same CRDs to synchronize alert rules across multiple clusters.

Step 1: Enable alert management

You can enable alert management only for ACK managed clusters and ACK dedicated clusters.

ACK managed cluster

ACK dedicated cluster

Enable Managed Service for Prometheus when you create a cluster

On the Component Configurations wizard page, select Use Default Alert Rule Template on the right of Alerts and select a contact group. For more information, see Create an ACK managed cluster.

After the cluster is created, the system automatically enables default alert rules for the cluster and sends notifications to the default contact group when the default alert rules are triggered. You can modify the information of an alert contact or alert contact group. For more information, see Modify an alert contact or alert contact group.

Enable Managed Service for Prometheus for an existing cluster

To enable Managed Service for Prometheus for an existing cluster, perform the following steps:

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Operations > Alerts.
On the Alerts page, follow the on-screen instructions to install and update the required components.
The ACK console automatically checks whether the cluster meets the following conditions and provides instructions on how to activate, install, and update the required components.
If not all conditions are met, follow the on-screen instructions to install or update the required components.
- Simple Log Service is activated. If Log Service is not activated, log on to the Log Service console and follow the on-screen instructions to activate the service.
  Note
  For more information about the billing rules of Simple Log Service, see Billable items of pay-by-feature.
- Event Center is installed. For more information, see Event monitoring.
- The alicloud-monitor-controller component is updated to the latest version. For more information, see alicloud-monitor-controller.

After you install and update the required components, you can configure alert rules on the Alerts page.

On the Alert Rules tab, select an alert rule set and turn on Status to enable the alert rule set. You can click Modify Contacts to specify the contact groups to which the alerts are sent.
- By default, ACK provides an alert rule template that you can use to generate alerts based on exceptions and metrics.
- Alert rules are classified into several alert rule sets. You can enable an alert rule set, disable an alert rule set, and configure multiple alert contact groups for an alert rule set.
- An alert rule set contains multiple alert rules. Each alert rule corresponds to an alert item. You can create a YAML file to configure multiple alert rule sets in a cluster. You can also modify the YAML file to update alert rules.
- For more information about how to configure alert rules by using a YAML file, see Step 2: Configure alert rules by using CRDs. For more information about the default alert rule template, see Default alert rule template.

The following table describes the tabs on the Alerts page.

Tab	Description
Alert History	You can view up to 100 historical alerts. You can select an alert and click the link in the Alert Rule column to view rule details in the monitoring system. You can click Details to go to the resource page on which the alert is triggered. The alert may be triggered by an exception or an abnormal metric.
Alert Contacts	You can create, edit, or delete alert contacts. The alert rule set for resource exceptions includes alert rules for basic node resources. Before an alert contact can receive alerts on basic cluster resources, the mobile phone number and email address of the contact must be verified in the CloudMonitor console. You can view and update information about an alert contact in the CloudMonitor console. If the verification has expired, delete the contact in the CloudMonitor console, and then refresh the Alert Contacts page in the ACK console.
Alert Contact Groups	You can create, edit, or delete alert contact groups. If no alert contact group exists, the ACK console automatically creates a default alert contact group based on the information that you provided during registration.

Before you enable alert management and use the default alert rules in an ACK dedicated cluster, you must grant the required permissions to the worker Resource Access Management (RAM) role of the cluster.

Note

The system automatically grants ACK managed clusters the permissions to access resources that are related to the alerting feature of Simple Log Service.

1. Grant permissions to the worker RAM role

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, click Cluster Information.
On the Cluster Information page, copy the role name on the right of Worker RAM Role in the Cluster Resources section and click the name to go to the role details page in the RAM console. You can grant permissions to the role in the RAM console.
1. Create a custom RAM policy based on the following code block. For more information, see Create a custom policy on the JSON tab.
```
{
            "Action": [
                "log:*",
                "arms:*",
                "cms:*",
                "cs:UpdateContactGroup"
            ],
            "Resource": [
                "*"
            ],
            "Effect": "Allow"
}
```
2. On the Roles page, find the worker RAM role of the cluster and attach the preceding custom policy to the role. For more information, see Method 1: Grant permissions to a RAM role by clicking Grant Permission on the Roles page.
Check the component logs to verify that the permissions are granted.
1. In the left-side navigation pane of the details page, choose Workloads > Deployments.
2. Set Namespace to kube-system, find alicloud-monitor-controller in the Deployments list, and then click the link in the Name column.
3. Click the Logs tab and check whether the logs include information that indicates successful authorization.

2. Enable alert management and configure the default alert rules.

In the left-side navigation pane, choose Operations > Alerts.

On the Alerts page, perform the following operations to configure the default alert rules:

On the Alert Rules tab, select an alert rule set and turn on Status to enable the alert rule set. You can click Modify Contacts to specify the contact groups to which the alerts are sent.
- By default, ACK provides an alert rule template that you can use to generate alerts based on exceptions and metrics.
- Alert rules are classified into several alert rule sets. You can enable an alert rule set, disable an alert rule set, and configure multiple alert contact groups for an alert rule set.
- An alert rule set contains multiple alert rules. Each alert rule corresponds to an alert item. You can create a YAML file to configure multiple alert rule sets in a cluster. You can also modify the YAML file to update alert rules.
- For more information about how to configure alert rules by using a YAML file, see Step 2: Configure alert rules by using CRDs. For more information about the default alert rule template, see Default alert rule template.

The following table describes the tabs on the Alerts page.

Tab	Description
Alert History	You can view up to 100 historical alerts. You can select an alert and click the link in the Alert Rule column to view rule details in the monitoring system. You can click Details to go to the resource page on which the alert is triggered. The alert may be triggered by an exception or an abnormal metric.
Alert Contacts	You can create, edit, or delete alert contacts. The alert rule set for resource exceptions includes alert rules for basic node resources. Before an alert contact can receive alerts on basic cluster resources, the mobile phone number and email address of the contact must be verified in the CloudMonitor console. You can view and update information about an alert contact in the CloudMonitor console. If the verification has expired, delete the contact in the CloudMonitor console, and then refresh the Alert Contacts page in the ACK console.
Alert Contact Groups	You can create, edit, or delete alert contact groups. If no alert contact group exists, the ACK console automatically creates a default alert contact group based on the information that you provided during registration.

Step 2: Configure alert rules by using CRDs

When the alerting feature is enabled, the system automatically creates an AckAlertRule object in the kube-system namespace. The AckAlertRule object contains the default alert rule template. You modify the AckAlertRule object to modify the default alert rules based on your business requirements.

Default alert rule template

The following table describes the alert rules in the default alert rule template.

Click to view the alert rules in the default alert rule template

Alert rule set	Alert rule	Description	Rule_Type	ACK_CR_Rule_Name	SLS_Event_ID
Alert rule set for critical events in the cluster.	Errors	An alert is triggered when an error occurs in the cluster.	event	error-event	sls.app.ack.error
Alert rule set for critical events in the cluster.	Warnings	An alert is triggered when a warning occurs in the cluster, except for warnings that can be ignored.	event	warn-event	sls.app.ack.warn
Alert rule set for cluster exceptions	Docker process exceptions on nodes	An alert is triggered when a dockerd exception or a containerd exception occurs on a node.	event	docker-hang	sls.app.ack.docker.hang
	Evictions in the cluster	An alert is triggered when a pod is evicted.	event	eviction-event	sls.app.ack.eviction
	GPU Xid errors	An alert is triggered when a GPU Xid error occurs.	event	gpu-xid-error	sls.app.ack.gpu.xid_error
	Node changes to the unschedulable state	An alert is triggered when the status of a node changes to unschedulable.	event	node-down	sls.app.ack.node.down
	Node restarts	An alert is triggered when a node restarts.	event	node-restart	sls.app.ack.node.restart
	NTP service failures on nodes	An alert is triggered when the Network Time Protocol (NTP) service fails.	event	node-ntp-down	sls.app.ack.ntp.down
	PLEG errors on nodes	An alert is triggered when a Lifecycle Event Generator (PLEG) error occurs on a node.	event	node-pleg-error	sls.app.ack.node.pleg_error
	Process errors on nodes	An alert is triggered when a process error occurs on a node.	event	ps-hang	sls.app.ack.ps.hang
Alert rule set for resource exceptions	Node - CPU usage ≥ 85%	An alert is triggered when the CPU usage of a node exceeds the threshold. The default threshold is 85%. If the percentage of available CPU resources is less than 15%, the CPU resources reserved for components may become insufficient. For more information, see Resource reservation policy. Consequently, CPU throttling may be frequently triggered and processes may respond slowly. We recommend that you optimize the CPU usage or adjust the threshold at the earliest opportunity. For more information about how to adjust the threshold, see Example - Modify the alert threshold for basic cluster resources by using a CRD.	metric-cms	node_cpu_util_high	cms.host.cpu.utilization
	Node - Memory usage ≥ 85%	An alert is triggered when the memory usage of a node exceeds the threshold. The default threshold is 85%. If the percentage of available memory resources is less than 15%, the memory resources reserved for components may become insufficient. For more information, see Resource reservation policy. In this scenario, kubelet forcibly evicts pods from the node. We recommend that you optimize the memory usage or adjust the threshold at the earliest opportunity. For more information about how to adjust the threshold, see Example - Modify the alert threshold for basic cluster resources by using a CRD.	metric-cms	node_mem_util_high	cms.host.memory.utilization
	Node - Disk usage ≥ 85%	An alert is triggered when the disk usage of a node exceeds the threshold. The default threshold is 85%. For more information about how to adjust the threshold, see Example - Modify the alert threshold for basic cluster resources by using a CRD.	metric-cms	node_disk_util_high	cms.host.disk.utilization
	Node - Usage of outbound public bandwidth ≥ 85%	An alert is triggered when the usage of the outbound public bandwidth of a node exceeds the threshold. The default threshold is 85%. For more information about how to adjust the threshold, see Example - Modify the alert threshold for basic cluster resources by using a CRD.	metric-cms	node_public_net_util_high	cms.host.public.network.utilization
	Node - Inode usage ≥ 85%	An alert is triggered when the inode usage of a node exceeds the threshold. The default threshold is 85%. For more information about how to adjust the threshold, see Example - Modify the alert threshold for basic cluster resources by using a CRD.	metric-cms	node_fs_inode_util_high	cms.host.fs.inode.utilization
	Resources - Usage of the maximum connections of an SLB instance ≥ 85%	An alert is triggered when the usage of the maximum number of connections of a Server Load Balancer (SLB) instance exceeds the threshold. The default threshold is 85%. Note This rule applies to all SLB instances that are created for the Kubernetes API server and Ingresses. For more information about how to adjust the threshold, see Example - Modify the alert threshold for basic cluster resources by using a CRD.	metric-cms	slb_qps_util_high	cms.slb.qps.utilization
	Resources - Usage of SLB outbound bandwidth ≥ 85%	An alert is triggered when the usage of the outbound bandwidth of an SLB instance exceeds the threshold. The default threshold is 85%. Note This rule applies to all SLB instances that are created for the Kubernetes API server and Ingresses. For more information about how to adjust the threshold, see Example - Modify the alert threshold for basic cluster resources by using a CRD.	metric-cms	slb_traff_tx_util_high	cms.slb.traffic.tx.utilization
	Resources - Usage of the maximum connections of an SLB instance ≥ 85%	An alert is triggered when the usage of the maximum number of connections of an SLB instance exceeds the threshold. The default threshold is 85%. Note This rule applies to all SLB instances that are created for the Kubernetes API server and Ingresses. For more information about how to adjust the threshold, see Example - Modify the alert threshold for basic cluster resources by using a CRD.	metric-cms	slb_max_con_util_high	cms.slb.max.connection.utilization
	Resources - Connection drops per second of the listeners of an SLB instance remains ≥ 1	An alert is triggered when the number of connections dropped per second by the listeners of an SLB instance remains at 1 or more. The default threshold is 1. Note This rule applies to all SLB instances that are created for the Kubernetes API server and Ingresses. For more information about how to adjust the threshold, see Example - Modify the alert threshold for basic cluster resources by using a CRD.	metric-cms	slb_drop_con_high	cms.slb.drop.connection
	Excessive file handles on nodes	An alert is triggered when excessive file handles exist on a node.	event	node-fd-pressure	sls.app.ack.node.fd_pressure
	Insufficient node disk space	An alert is triggered when the disk space of a node is insufficient.	event	node-disk-pressure	sls.app.ack.node.disk_pressure
	Excessive processes on nodes	An alert is triggered when excessive processes run on a node.	event	node-pid-pressure	sls.app.ack.node.pid_pressure
	Insufficient node resources for scheduling	An alert is triggered when a node has insufficient resources for scheduling.	event	node-res-insufficient	sls.app.ack.resource.insufficient
	Insufficient node IP addresses	An alert is triggered when node IP addresses are insufficient.	event	node-ip-pressure	sls.app.ack.ip.not_enough
Alert rule set for pod exceptions	Pod OOM errors	An alert is triggered when an out of memory (OOM) error occurs in a pod.	event	pod-oom	sls.app.ack.pod.oom
	Pod restart failures	An alert is triggered when a pod fails to restart.	event	pod-failed	sls.app.ack.pod.failed
	Image pull failures	An alert is triggered when an image fails to be pulled.	event	image-pull-back-off	sls.app.ack.image.pull_back_off
Alert rule set for O&M exceptions	No available SLB instance	An alert is triggered when an SLB instance fails to be created. In this case, submit a ticket to contact the ACK technical team.	event	slb-no-ava	sls.app.ack.ccm.no_ava_slb
	SLB instance update failures	An alert is triggered when an SLB instance fails to be updated. In this case, submit a ticket to contact the ACK technical team.	event	slb-sync-err	sls.app.ack.ccm.sync_slb_failed
	SLB instance deletion failures	An alert is triggered when an SLB instance fails to be deleted. In this case, submit a ticket to contact the ACK technical team.	event	slb-del-err	sls.app.ack.ccm.del_slb_failed
	Node deletion failures	An alert is triggered when a node fails to be deleted. In this case, submit a ticket to contact the ACK technical team.	event	node-del-err	sls.app.ack.ccm.del_node_failed
	Node adding failures	An alert is triggered when a node fails to be added to the cluster. In this case, submit a ticket to contact the ACK technical team.	event	node-add-err	sls.app.ack.ccm.add_node_failed
	Route creation failures	An alert is triggered when a cluster fails to create a route in the virtual private cloud (VPC). In this case, submit a ticket to contact the ACK technical team.	event	route-create-err	sls.app.ack.ccm.create_route_failed
	Route update failures	An alert is triggered when a cluster fails to update the routes of the VPC. In this case, submit a ticket to contact the ACK technical team.	event	route-sync-err	sls.app.ack.ccm.sync_route_failed
	Command execution failures in managed node pools	An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team.	event	nlc-run-cmd-err	sls.app.ack.nlc.run_command_fail
	Node removal failures in managed node pools	An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team.	event	nlc-empty-cmd	sls.app.ack.nlc.empty_task_cmd
	Unimplemented URL mode in managed node pools	An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team.	event	nlc-url-m-unimp	sls.app.ack.nlc.url_mode_unimpl
	Unknown repairing operations in managed node pools	An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team.	event	nlc-opt-no-found	sls.app.ack.nlc.op_not_found
	Node draining and removal failures in managed node pools	An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team.	event	nlc-des-node-err	sls.app.ack.nlc.destroy_node_fail
	Node draining failures in managed node pools	An alert is triggered when a node in a managed node pool fails to be drained. In this case, submit a ticket to contact the ACK technical team.	event	nlc-drain-node-err	sls.app.ack.nlc.drain_node_fail
	ECS restart timeouts in managed node pools	An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team.	event	nlc-restart-ecs-wait	sls.app.ack.nlc.restart_ecs_wait_fail
	ECS restart failures in managed node pools	An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team.	event	nlc-restart-ecs-err	sls.app.ack.nlc.restart_ecs_fail
	ECS reset failures in managed node pools	An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team.	event	nlc-reset-ecs-err	sls.app.ack.nlc.reset_ecs_fail
	Auto-repair task failures in managed node pools	An alert is triggered when a node pool error occurs. In this case, submit a ticket to contact the ACK technical team.	event	nlc-sel-repair-err	sls.app.ack.nlc.repair_fail
Alert rule set for network exceptions	Invalid Terway resources	An alert is triggered when a Terway resource is invalid. In this case, submit a ticket to contact the ACK technical team.	event	terway-invalid-res	sls.app.ack.terway.invalid_resource
	IP allocation failures of Terway	An alert is triggered when an IP address fails to be allocated in Terway mode. In this case, submit a ticket to contact the ACK technical team.	event	terway-alloc-ip-err	sls.app.ack.terway.alloc_ip_fail
	Ingress bandwidth configuration parsing failures	An alert is triggered when the bandwidth configuration of an Ingress fails to be parsed. In this case, submit a ticket to contact the ACK technical team.	event	terway-parse-err	sls.app.ack.terway.parse_fail
	Network resource allocation failures of Terway	An alert is triggered when a network resource fails to be allocated in Terway mode. In this case, submit a ticket to contact the ACK technical team.	event	terway-alloc-res-err	sls.app.ack.terway.allocate_failure
	Network resource reclaiming failures of Terway	An alert is triggered when a network resource fails to be reclaimed in Terway mode. In this case, submit a ticket to contact the ACK technical team.	event	terway-dispose-err	sls.app.ack.terway.dispose_failure
	Terway virtual mode changes	An alert is triggered when the Terway virtual mode is changed.	event	terway-virt-mod-err	sls.app.ack.terway.virtual_mode_change
	Pod IP checks executed by Terway	An alert is triggered when a pod IP is checked in Terway mode.	event	terway-ip-check	sls.app.ack.terway.config_check
	Ingress configuration reload failures	An alert is triggered when the configuration of an Ingress fails to be reloaded. In this case, check whether the Ingress configuration is valid.	event	ingress-reload-err	sls.app.ack.ingress.err_reload_nginx
Alert rule set for storage exceptions	Cloud disk size less than 20 GiB	ACK does not allow you to mount a disk of less than 20 GiB. You can check the sizes of the disks that are attached to your cluster.	event	csi_invalid_size	sls.app.ack.csi.invalid_disk_size
	Subscription cloud disks cannot be mounted	ACK does not allow you to mount a subscription disk. You can check the billing methods of the disks that are attached to your cluster.	event	csi_not_portable	sls.app.ack.csi.disk_not_portable
	Mount target unmounting failures because the mount target is being used	An alert is triggered when an unmount failure occurs because the mount target is in use.	event	csi_device_busy	sls.app.ack.csi.deivce_busy
	No available cloud disk	An alert is triggered when no disk is available. In this case, submit a ticket to contact the ACK technical team.	event	csi_no_ava_disk	sls.app.ack.csi.no_ava_disk
	I/O hangs of cloud disks	An alert is triggered when I/O hangs occur on a disk. In this case, submit a ticket to contact the ACK technical team.	event	csi_disk_iohang	sls.app.ack.csi.disk_iohang
	Slow I/O rate of PVC used to mount cloud disks	An alert is triggered when the I/O of a disk that is mounted by using a persistent volume claim (PVC) is slow. In this case, submit a ticket to contact the ACK technical team.	event	csi_latency_high	sls.app.ack.csi.latency_too_high
	Disk usage exceeds the threshold	An alert is triggered when the usage of a disk exceeds the specified threshold. You can check the usage of a disk that is mounted to your cluster.	event	disk_space_press	sls.app.ack.csi.no_enough_disk_space
Alert rule set for cluster security events	High-risk configurations detected in inspections	An alert is triggered when a high-risk configuration is detected during a cluster inspection. In this case, submit a ticket to contact the ACK technical team.	event	si-c-a-risk	sls.app.ack.si.config_audit_high_risk

Configure alert rules

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, find the cluster that you want to manage and click its name. In the left-side pane, choose Operations > Alerts.
In the upper-right corner of the Alert Rules tab, click Configure Alert Rule. In the Alert Rules panel, click YAML in the Actions column to view the configuration of the AckAlertRule object.

You can modify the YAML file based on the preceding description of the default alert rule template.

Example:

apiVersion: alert.alibabacloud.com/v1beta1
kind: AckAlertRule
metadata:
  name: default
spec:
  groups:
    # The following code is a sample alert rule based on cluster events. 
    - name: pod-exceptions                             # The name of the alert rule set. This parameter corresponds to the Group_Name field in the alert rule template. 
      rules:
        - name: pod-oom                                # The name of the alert rule. 
          type: event                                  # The type of the alert rule, which corresponds to the Rule_Type parameter. Valid values: event and metric-cms. 
          expression: sls.app.ack.pod.oom              # The alert rule expression. If you set the rule type to event, the expression is set to the value of Rule_Expression_Id in the default alert rule template. 
          enable: enable                               # The status of the alert rule. Valid values: enable and disable. 
        - name: pod-failed
          type: event
          expression: sls.app.ack.pod.failed
          enable: enable
    # The following code is a sample alert rule for basic cluster resources. 
    - name: res-exceptions                              # The name of the alert rule set. This parameter corresponds to the Group_Name field in the alert rule template. 
      rules:
        - name: node_cpu_util_high                      # The name of the alert rule. 
          type: metric-cms                              # The type of the alert rule, which corresponds to the Rule_Type parameter. Valid values: event and metric-cms. 
          expression: cms.host.cpu.utilization          # The alert rule expression. If you set the rule type to event, the expression is set to the value of Rule_Expression_Id in the default alert rule template. 
          contactGroups:                                # The contact group that is associated with the alert rule. The contacts created by an Alibaba Cloud account are shared by all clusters within the account. 
          enable: enable                                # The status of the alert rule. Valid values: enable and disable. 
          thresholds:                                   # The alert threshold. For more information, see the "Modify the alert threshold for basic cluster resources" section of this topic.             
            - key: CMS_ESCALATIONS_CRITICAL_Threshold
              unit: percent
              value: '1'

Example - Modify the alert threshold for basic cluster resources by using a CRD

The rule type of the alert rule set for resource exceptions is metric-cms, which indicates that the rules are synchronized from CloudMonitor. The following example shows how to add the thresholds parameter to the CRD created for the alert rule set to which the Node - CPU usage rule belongs. You can use this parameter to configure the alert threshold, the number of times that the CPU usage exceeds the threshold before an alert is triggered, and the silence period after an alert is triggered.

apiVersion: alert.alibabacloud.com/v1beta1
kind: AckAlertRule
metadata:
  name: default
spec:
  groups:
    # The following code is a sample alert rule for basic cluster resources. 
    - name: res-exceptions                                        # The name of the alert rule set. This parameter corresponds to the Group_Name field in the alert rule template. 
      rules:
        - name: node_cpu_util_high                                # The name of the alert rule. 
          type: metric-cms                                        # The type of the alert rule. Valid values: event and metric-cms. 
          expression: cms.host.cpu.utilization                    # The alert rule expression. If you set the rule type to event, the expression is set to the value of Rule_Expression_Id in the default alert rule template. 
          contactGroups:                                          # The contact group associated with the alert rule. You can add contact groups in the ACK console. The contacts created by an Alibaba Cloud account are shared by all clusters within the account. 
          enable: enable                                          # The status of the alert rule. Valid values: enable and disable. 
          thresholds:                                             # The alert threshold. For more information, see Configure alert rules by using CRDs. 
            - key: CMS_ESCALATIONS_CRITICAL_Threshold
              unit: percent
              value: '1'  
            - key: CMS_ESCALATIONS_CRITICAL_Times
              value: '3'  
            - key: CMS_RULE_SILENCE_SEC
              value: '900'

Parameter	Description	Default

Parameter	Description	Default
`CMS_ESCALATIONS_CRITICAL_Threshold`	The alert threshold. `unit`: The unit of the threshold. Valid values: percent, count, and qps. `value`: The value of the threshold. This parameter is required. If you leave this parameter empty, the modification does not take effect and the alert rule is disabled.	The default value is the same as the default value specified in the default alert rule template.
`CMS_ESCALATIONS_CRITICAL_Times`	The number of times that the alert threshold is exceeded before an alert is triggered. This parameter is optional. If you leave this parameter empty, the default value is used.	3
`CMS_RULE_SILENCE_SEC`	The silence period after an alert is triggered. This parameter is used to prevent frequent alerting. Unit: seconds. This parameter is optional. If you leave this parameter empty, the default value is used.	900

FAQ

What do I do if I fail to update an alert rule and the following error message is returned: The Project does not exist : k8s-log-xxx?

Issue:

When the system updates an alert rule, the following error message is returned: The Project does not exist : k8s-log-xxx.

Cause:

You did not create an event center in Simple Log Service for your cluster.

Solution:

Go to the Simple Log Service console. Check whether the number of projects has reached the quota limit. If the quota limit is reached, delete excessive projects or submit a ticket to apply for a quota increase. For more information about how to delete a Simple Log Service project, see Manage a project.
Reinstall ack-node-problem-detector.
1. In the left-side navigation pane of the cluster details page in the ACK console, choose Applications > Helm.
2. If you want to reinstall ack-node-problem-detector by using a YAML file, perform the following steps to obtain a copy of the YAML template of ack-node-problem-detector:
  On the Helm page, find ack-node-problem-detector and click Update in the Actions column. After ack-node-problem-detector is updated, click View Details in the Actions column. On the details page of ack-node-problem-detector, select a resource and click View in YAML to copy the YAML content to your on-premises machine. Perform the same operation for each resource to obtain a copy of the YAML template.
3. On the Helm page, select ack-node-problem-detector and click Delete in the Actions column.
4. In the left-side navigation pane of the details page, choose Operations > Add-ons.
5. Click the Log and Monitoring tab, find ack-node-problem-detector, and then click Install.
  In the Note message, confirm the versions of the plug-ins and click OK. After ack-node-problem-detector is installed, the word "Installed" and the version information are displayed in the ack-node-problem-detector section.

What do I do if I fail to update an alert rule because no contact group subscribes to the alert rule?

Issue:

When the system updates an alert rule, the following error message is returned: this rule have no xxx contact groups reference.

Cause:

No contact group subscribes to the alert rule.

Solution:

Create a contact group and add contacts.
Find the alert rule and click Modify Contacts. In the Modify Contacts panel, add the contact group that you created as the subscriber.

Feature introduction

Scenarios

Cluster O&M

Application development

Application management

Multi-cluster management

Step 1: Enable alert management

Enable Managed Service for Prometheus when you create a cluster

Enable Managed Service for Prometheus for an existing cluster

1. Grant permissions to the worker RAM role

2. Enable alert management and configure the default alert rules.

Step 2: Configure alert rules by using CRDs

Default alert rule template

Configure alert rules

Example - Modify the alert threshold for basic cluster resources by using a CRD

FAQ

What do I do if I fail to update an alert rule and the following error message is returned: The Project does not exist : k8s-log-xxx?

Issue:

Cause:

Solution:

What do I do if I fail to update an alert rule because no contact group subscribes to the alert rule?

Issue:

Cause:

Solution:

Sales Support

Technical Support

Connect & Report Abuse

About Alibaba Cloud

Our Global Network

Quick Start

Global Offices

Olympic Games Paris 2024 New

Stade Roland Garros – Glitz from the Past New

Place de la Concorde – “Breaking” the Barriers New

Vaires-sur-Marne Nautical Stadium – Sports with Sustainability New

International Broadcast Center – Images, Sounds, and Data that Captivate Billions New

Customer Success Stories New

Trust Center

Security & Compliance Center

Cloud Compliance Resources

Security Compliance FAQs

Product & Feature Update New

Cloud Forward

Press Room

Alibaba Cloud e-Magazine New

Alibaba Cloud in Analyst Research

Notice

Go Global Service New

Go Global Alliance with Alibaba Cloud

China Gateway Hot

Information Compliance

China Gateway - MLPS 2.0 Compliance New

China Gateway - Networking

China Gateway - Global Application Acceleration New

China Gateway - Security

China Gateway - Data Security New

ICP Support Hot

China Gateway - Omnichannel Data Mid-End New

China Gateway - Organizational Data Mid-End New

China Gateway - Business Mid-End New

China Gateway - AI Service for Conversational Chatbots New

China Gateway - Online Education

China Gateway - Domain Registration

Work at Alibaba Cloud

Experienced Professionals

Students and Graduates

Free Trial

Pricing

Promo Center

Price Reduction

Pay Less and Deploy More

FinOps

Elastic Compute Service (ECS)

Simple Application Server (SAS)

Elastic GPU Service

Elastic Desktop Service (EDS)

Object Storage Service (OSS)

Cloud Enterprise Network (CEN)

Web Application Firewall (WAF)

Domain Names

Container Compute Service (ACS)

Secure Access Service Edge (SASE)

Intelligent Media Services(IMS)

Edge Security Acceleration (ESA)(Original DCDN)

Intelligent Media Management

DingTalk Enterprise

YiDA

Alibaba Cloud Model Studio

Apsara Prime - For Easy Cloud Product Selection

Alibaba Cloud ECS - Cater All Your Cloud Hosting Needs

1TB CDN—Get Free 1 TB Outbound Traffic Plan Now

Security—Under Attack? Get Free Security Support

Short Message Service - Free Testing is Available

Elastic Compute Service (ECS) Hot

CloudBox

Compute Nest

Dedicated Host Hot

ECS Bare Metal Instance

Elastic Desktop Service (EDS) Featured

Cloud Phone Beta

Elastic GPU Service Featured

Simple Application Server (SAS) Hot

Auto Scaling

Batch Compute

Elastic High Performance Computing (E-HPC)

Super Computing Cluster (SCC)

Function Compute (FC)