CloudMonitor provides dynamic threshold-triggered alert rules to monitor the metrics of Alibaba Cloud resources. Threshold-triggered alert rules can automatically fit the historical metric data and display the threshold boundaries. This helps you identify anomalies such as sudden increases and decreases in metric values, and ensures business stability.
What are dynamic thresholds?
Dynamic thresholds apply machine learning algorithms to dynamically identify the characteristics of historical data patterns such as the periodicity, trend, and fluctuation of metrics. Dynamic thresholds integrate the metrics of specific cloud services to automatically calculate the upper and lower threshold boundaries for each instance.
Limits
The dynamic threshold-triggered alerting feature is in invitational preview. To use the feature, submit a ticket.
Scenarios
The statistical characteristics of the metrics of cloud resources, such as resource usage, periodic changes, and variance fluctuations, vary in different business scenarios. For example, if your traffic is heavy in the daytime and light at night, metrics such as gateway traffic and ApsaraMQ message accumulation of an Elastic Compute Service (ECS) instance or Alibaba Cloud CDN domain name will show peak values and off-peak values accordingly. I/O-intensive services and compute-intensive tasks will cause different CPU utilization or load thresholds (load.1m
, load.5m
, and load.15m
) on different ECS instances.
Alert thresholds are fixed for single-metric alert rules, which are not suitable for the previously mentioned complicated business scenarios. As a result, constant alerting occurs on some high-load instances. However, some business exceptions on low-load instances cannot reach the related alert thresholds or some exceptions have lasted for more than half an hour when the thresholds are reached. To improve your alerting experience and shorten exception detection time, CloudMonitor provides the dynamic threshold-triggered alerting feature based on machine learning algorithms and expert experience in alert rules. The core algorithm of the dynamic threshold-triggered alerting feature can dynamically identify the historical characteristics of data patterns, such as the periodic patterns, fluctuations, and usage of metrics. The algorithm integrates the metrics of specific cloud services to automatically generate upper and lower threshold boundaries for each instance. This feature allows you to view the threshold effect in a visualized manner and provides sensitivity parameter tuning, implementing white boxing.
Benefits
Compared with single-metric or multi-metric alert rules, dynamic threshold-triggered alert rules have the following advantages:
Alert denoising
The dynamic threshold-triggered alerting feature collects the metric data of each instance and uses models such as robust time series decomposition and prediction to fit the data usage and business changes of different instance metrics. This feature also filters out abnormal noises based on historical alert clustering and similarity matching to improve alerting accuracy. This feature is suitable for the following scenarios:
Various usage levels of instance metrics
For example, a gaming company uses different ECS instances for offline computing and online services. In most cases, the same alert template is used. In this case, alerts are triggered when metrics such as CPU utilization, load usage, and memory usage are greater than 80. As a result, alerts are constantly triggered unexpectedly on high-load instances.
Load spikes caused by scheduled tasks
For example, a user uses ApsaraDB RDS for storage and sets a scheduled task to clear the historical data generated 30 days ago at 00:00:00 every day. However, when the scheduled task is running, the IOPS usage of ApsaraDB RDS is approximating 100% instantly and an alert is generated. Then, the IOPS usage quickly becomes normal. The false positive is generated on schedule every day.
In the preceding scenarios, the dynamic threshold-triggered alerting feature can effectively reduce the false positive rate by 80% to 90%. This feature enables you to focus on business exceptions and improve your monitoring and O&M experience.
Automatic exception detection
Exceptions on metrics of cloud service instances are usually caused by changes in upstream and downstream services and traffic, or changes in applications and data deployed on cloud service instances. The dynamic threshold-triggered alerting feature can quickly detect service exceptions, such as Internet-facing Server Load Balancer (SLB) connections and a large number of accumulated messages of ApsaraMQ. You can use this feature to detect exceptions on ECS metrics of basic resources and locate the root causes of business exceptions.
When you configure a single-metric or multi-metric alert rule, you set high thresholds to prevent excessive false positives. Besides, such alert rule applies to an application group or all resources. As a result, parameters cannot be tuned for specific services or instances. Dynamic threshold-triggered alert rules can be used to quickly and accurately detect sudden increases or decreases in metric values. Such alert rules are suitable for the following scenarios:
Detection of metric exceptions after code changes
For example, after a developer changes the application code, a memory leak occurs in the program, resulting in a full garbage collection and a significant increase in CPU utilization. However, a single-metric alert for high CPU utilization cannot be triggered in this case.
Timely warning before services become unavailable
For example, if the upstream service traffic increases suddenly, a dynamic threshold-triggered alert rule can help quickly detect the exception and trigger an alert before the usage threshold specified in a single-metric alert rule is reached. This prevents the downstream services from becoming unavailable due to continuous high traffic.
In the preceding business scenarios, dynamic threshold-triggered alert rules can be used to monitor the core metrics of cloud services. This way, CloudMonitor can recall 85% or more issues and faults within 3 minutes after a metric exception occurs.
Reduced threshold configuration and maintenance costs
You do not need to specify values for dynamic thresholds. You only need to create a dynamic threshold-triggered alert rule and select the corresponding alert condition (beyond the boundary, higher than the upper boundary, or below the lower boundary) to complete the threshold settings. The dynamic threshold-triggered alerting feature significantly reduces configuration and maintenance costs and is suitable for the following scenarios:
Specific threshold settings
When you configure alert rules for metrics that do not have physical upper limits, such as traffic, bandwidth, and queries per second (QPS) of an ECS instance, the values of such metrics may vary by orders of magnitude, and it is difficult to specify common and appropriate values. If the actual values of such metrics change with O&M changes or business changes and the related thresholds need to be modified accordingly, you must adjust the thresholds to prevent false positives or false negatives.
Configuration of multiple rules to trigger alerts with different thresholds and in different time periods
If the values of some metrics show significant peak values and off-peak values in different time periods, you need to configure multiple rules and specify different effective time.
Best practices
Alert denoising for ECS basic resources
A user uses an ECS instance for offline rendering and other ECS instances for online business support. The memory usage of the ECS instance used for offline rendering is significantly higher than that of other ECS instances used for online tasks. When the user configures a single-metric alert rule, the user sets the memory usage threshold to be greater than 80%. As a result, constant alerting is triggered on the ECS instance for offline rendering for one week, and a total of 200 alerts are generated. After the user configures a dynamic threshold-triggered alert rule, less than five alerts are generated in a week and the convergence rate for false positives is 95%.
The best practices of alert denoising apply to other metrics besides the memory usage of ECS instances. We recommend that you configure dynamic threshold-triggered alert rules for the metrics described in the following table.
Common exception | Possible cause | Metric | Alert condition |
The load is excessively high, the load fluctuates significantly, or the peak load lasts for a long period of time. | System resources are insufficient, processes experience exceptions (such as endless loops and memory leaks), the number of processes increases suddenly, or a large number of requests or data processing operations are suddenly generated by some applications or system services. |
| Higher than the upper boundary |
The number of requests increases suddenly, the number of requests fluctuates significantly, or the peak number of requests lasts for a long period of time. | An exception occurs on an application or a system service. The I/O performance of disks is insufficient, or the disk capacity is insufficient. A large number of disk read or write operations are performed on some applications or services. |
| Beyond the boundary |
The number of connections is excessively high, the number of connections fluctuates significantly, or the peak number of connections lasts for a long period of time. | The system load is excessively high, TCP connection pools are insufficient, exceptions occur on applications or services, or a large number of TCP connection operations are performed at a certain time on some applications or services. | (Agent)network.tcp.connection_state | Beyond the boundary |
False positive convergence caused by scheduled tasks of ApsaraDB RDS
When a user-specified scheduled task is executed to clear historical data in the early morning every day, the QPS of an ApsaraDB RDS for MySQL database immediately spikes. A single-metric alert rule triggers a false positive when the scheduled task is executed. After you change the alert rule to a dynamic threshold-triggered alert rule, the scheduled false positives no longer occur.
The best practices of false positive convergence caused by scheduled tasks apply to other metrics besides the QPS of ApsaraDB RDS for MySQL databases. We recommend that you configure dynamic threshold-triggered alert rules for the metrics described in the following table.
Common exception | Possible cause | Metric | Alert condition |
The performance of an ApsaraDB RDS instance fluctuates significantly. | The system load is excessively high, or database connection pools are insufficient. A large number of queries are performed at a certain time on an application or a service. |
| Higher than the upper boundary |
Detection of exceptions on OSS or CDN
Object Storage Service (OSS) and Alibaba Cloud CDN (CDN) serve as the storage-dependent and accelerated content delivery optimization components of services, respectively. Exceptions on OSS and CDN directly affect service feature availability. However, in general, application availability monitoring cannot cover the availability of OSS and CDN. As a result, alerts may not be triggered when exceptions occur on OSS or CDN.
For example, if the BPS of CDN drops to zero, the dynamic threshold-triggered alerting feature can detect and recall the exception in time, and sends an alert notification.
Dynamic threshold-triggered alert rules can be used to quickly cover the monitoring alerts for OSS and CDN, and detect exceptions in advance before the services become unavailable. We recommend that you configure dynamic threshold-triggered alert rules for the metrics described in the following table.
Alibaba Cloud service | Common exception | Possible cause | Metric | Alert condition |
OSS | The number of successful requests decreases suddenly, or the number of request errors increases suddenly. | The network connection is unstable or experiences an exception. You do not have permissions on OSS objects or OSS objects do not exist. An error occurs when an API operation is called. |
| Below the lower boundary |
| Higher than the upper boundary | |||
The traffic increases suddenly, the traffic decreases suddenly, the traffic fluctuates significantly, or the peak traffic lasts for a long period of time. | The network connection is unstable or experiences an exception. A large number of requests are sent by some applications or services at a certain time. |
| Beyond the boundary | |
CDN | The QPS increases suddenly, the QPS decreases suddenly, the QPS fluctuates significantly, the peak QPS lasts for a long period of time, or the response time increases. | The system load is excessively high, the cache is insufficient, and CDN nodes are insufficient. The number of user visits increases suddenly. A large number of requests are retried after a request fails. | BPS_isp QPS_isp InternetOut | Beyond the boundary |
rt | Higher than the upper boundary | |||
The hit ratio decreases. | Requests are redirected to the origin server and acceleration fails. | hitRate | Below the lower boundary |
Simplified O&M configuration of ApsaraMQ for Kafka
The quantities of some metrics of ApsaraMQ for Kafka are related to services, such as the number of times messages are sent on an instance and the number of messages consumed on an instance. In addition, message consumption varies significantly among groups and topics. As a result, it is difficult to set a common threshold to monitor message queues for different services. This may lead to errors such as false negatives and delayed detection.
With automated alerting capabilities, the dynamic threshold-triggered alerting feature can simplify alert rule configurations and reduce maintenance costs. This feature can quickly detect exceptions within 2 to 3 minutes, effectively reducing the mean time to repair (MTTR) of services.
For example, if the number of accumulated messages for ApsaraMQ for Kafka increases suddenly, this feature recalls the exception and sends an alert notification.
We recommend that you configure dynamic threshold-triggered alert rules for the metrics of ApsaraMQ for Kafka listed in the following table.
Common exception | Possible cause | Metric | Alert condition |
The traffic increases or decreases suddenly. | Many users access an application or perform a large number of data transmission operations. Exceptions occur on applications or the network bandwidth is consumed by malicious programs. |
| Beyond the boundary |
Messages are accumulated. | System resources are insufficient, processes experience exceptions (such as endless loops and memory leaks), the number of processes increases suddenly, or a large number of requests or data processing operations are suddenly generated by some applications or system services. |
| Higher than the upper boundary |
The number of connections is excessively high, the number of connections fluctuates significantly, or the peak umber of connections lasts for a long period of time. | The system load is excessively high, TCP connection pools are insufficient, exceptions occur on applications or services, or a large number of TCP connection operations are performed at a certain time on some applications or services. |
| Beyond the boundary |