Configure the monitoring and alerting feature for risk warning - ApsaraMQ for RocketMQ

ApsaraMQ for RocketMQ allows you to configure alert rules by using CloudMonitor. This helps you monitor the status and key metrics of your instance in real time and enables you to receive exception notifications at the earliest opportunity to implement risk warnings in production environments.

Background information

ApsaraMQ for RocketMQ provides fully managed messaging services. ApsaraMQ for RocketMQ also provides a Service Level Agreement (SLA) for each edition of instances. The actual metrics of each instance edition, such as messaging transactions per second (TPS) and message storage, are the same as the metrics that are specified for the edition. For information about the SLAs of different instance editions, see Limits on instance specifications.

You do not need to worry about the instance performance. However, you must monitor your instance usage in production environments to make sure that you do not exceed the thresholds that are specified for your instance. ApsaraMQ for RocketMQ integrates with CloudMonitor to provide monitoring and alerting services free of charge and for immediate use. You can use the services to monitor the following items:

Instance usage
If your actual instance usage exceeds the specification limit, ApsaraMQ for RocketMQ forcibly throttles the instance. To prevent faults that are caused by instance throttling, you can configure the instance usage alert in advance and upgrade your instance configurations when an excess usage risk is detected.
Business logic errors
Errors may occur when you send and receive messages. You can configure the invocation error alert to detect and fix errors and prevent negative impacts on your business.
Performance metrics
If performance metrics such as response time (RT) and message delay are required for your message system, you can configure the corresponding metric alerts in advance to prevent business risks.

Rules for configuring alerts

ApsaraMQ for RocketMQ provides various metrics and monitoring and alerting items. For more information, see Metric details and Metrics. Monitoring items can be divided into the following categories: resource usage, messaging performance, and messaging exceptions.

Based on accumulated best practices in production environments, we recommend that you follow the rules that are described in the following table to configure alerts.

Note

The following monitoring items are basic configurations that are recommended by Alibaba Cloud. ApsaraMQ for RocketMQ also provides other monitoring items. You can configure alerts in a fine-grained manner based on your business requirements. For more information, see Monitoring and alerting.

Category	Monitoring item	Configuration timing and reason	Related personnel

Category	Monitoring item	Configuration timing and reason	Related personnel
Resource usage	The number of API calls initiated to send messages on an instance The number of API calls initiated to receive messages on an instance Internet outbound bandwidth	We recommend that you configure this item immediately after an instance is created. The resource usage of an instance is not determined by one topic or group. You must consider the overall resource usage of the instance.	Resource O&M engineers
Messaging performance	Message sending TPS in a topic Message receiving TPS in a consumer group Message accumulation in a consumer group Consumption delay time in a consumer group	We recommend that you configure these items immediately after your business is launched. After your business is launched, you must estimate the messaging performance of your business.	Resource O&M engineers Business developers
Messaging exceptions	Generation of dead-letter messages Number of times that throttling occurs	We recommend that you configure these items immediately after your business is launched. After your business is launched, you must predict failures that may occur during message production. This helps you troubleshoot issues.	Resource O&M engineers Business developers

Procedure for configuring alerts

Log on to the ApsaraMQ for RocketMQ console. In the left-side navigation pane, click Instances.
In the top navigation bar, select a region, such as China (Hangzhou). On the Instances page, click the name of the instance that you want to manage.
In the left-side navigation pane, click Monitoring and Alerts. In the upper-left corner of the page that appears, click Create Alert Rule.

Best practices

Configure alerts about the number of API calls initiated to send and receive messages

Background: In ApsaraMQ for RocketMQ, the number of API calls initiated to send and receive messages is measured by messaging transactions per second (TPS). A peak messaging TPS is specified for each ApsaraMQ for RocketMQ 5.0 instance. If the number of API calls that are initiated to send and receive messages on an instance exceeds the peak messaging TPS, the instance is throttled. For information about limits on messaging TPS, see Limits on instance specifications.
Risk caused by not configuring the alerts: If you do not configure the alerts, you cannot receive alerts before the number of API calls exceeds the specification limit. As a result, your instance is throttled and specific messages fail to be sent or received.
Configuration timing: We recommend that you configure the alerts after the instance is created and the ratio between the TPS for sending messages and the TPS for receiving messages is specified. To modify the ratio between the TPS for sending messages and the TPS for receiving messages, perform the following steps:
1. On the Instance Details page, click the Basic Information tab.
2. In the upper-right corner of the page that appears, click Edit. In the Modify Messaging Request Ratio section of the Modify Configurations panel, modify the ratio between the TPS for sending messages and the TPS for receiving messages.

Configure alerts about the number of API calls initiated to send messages on an instance

Configure alerts about the number of API calls initiated to receive messages on an instance

Recommended threshold: We recommend that you set the alert threshold to 70% of the peak TPS for messaging sending. For example, if the peak TPS for message sending is 5,000, set the threshold to 3,500.
- Professional Edition and Enterprise Platinum Edition instances support the elastic TPS feature. You can enable the feature and set the alert threshold to 70% of the sum of the peak TPS for message sending and the peak elastic TPS for message sending.
- Serverless instances support the adaptive elasticity feature. You can enable the feature and set the alert threshold to 70% of the peak elastic TPS for message sending.
- You can view the peak TPS for message sending and the peak elastic TPS for message sending on the Instance Details page in the ApsaraMQ for RocketMQ console.
Alert handling: After you receive an alert about the number of API calls initiated for message sending, we recommend that you perform the following steps to handle the alert:
1. On the Instance Details page, click the Dashboard tab.
2. In the Current Limiting Related Indicators section, view the Production TPS Max value curve in Production TPS water level to determine the time when the alert threshold is reached.
3. In the Instance Overview section, view the curve in Rate of messages sent by the producer to the server (bars/minute). Then, find the topic whose TPS for message sending is abnormal based on the time when the alert threshold is reached and determine whether the business changes are normal.
4. If the business changes are abnormal, contact your users for further analysis.
5. If the business changes are normal, the specification of your instance is insufficient to maintain normal business operations. In this case, we recommend that you upgrade your instance configurations. For more information, see Upgrade or downgrade instance configurations.

Recommended threshold: We recommend that you set the alert threshold to 70% of the peak TPS for message receiving. For example, if the peak TPS for message receiving is 5,000, set the threshold to 3,500.
- Professional Edition and Enterprise Platinum Edition instances support the elastic TPS feature. You can enable the feature and set the alert threshold to 70% of the sum of the peak TPS for message receiving and the peak elastic TPS for message receiving.
- Serverless instances support the adaptive elasticity feature. You can enable the feature and set the alert threshold to 70% of the peak elastic TPS for message receiving.
- You can view the peak TPS for message receiving and the peak elastic TPS for message receiving on the Instance Details page in the ApsaraMQ for RocketMQ console.
Alert handling: After you receive an alert about the number of API calls initiated for message receiving, we recommend that you perform the following steps to handle the alert:
1. On the Instance Details page, click the Dashboard tab.
2. In the Current Limiting Related Indicators section, view the Consumption TPS Max value curve in Consumption TPS water level to determine the time when the alert threshold is reached.
3. In the Instance Overview section, view the curve in Rate of messages delivered by the server to the consumer (per minute). Then, find the group whose TPS for message receiving is abnormal based on the time when the alert threshold is reached and determine whether the business changes are normal.
4. If the business changes are abnormal, contact your users for further analysis.
5. If the business changes are normal, the specification of your instance is insufficient to maintain normal business operations. In this case, we recommend that you upgrade your instance configurations. For more information, see Upgrade or downgrade instance configurations.

Configure alerts about the number of messages sent by producers or received by consumers per minute

Background: ApsaraMQ for RocketMQ provides metrics to monitor messaging TPS by topic and consumer group. You can use the metrics to monitor messaging TPS in a specific business item and understand your business scale.
Risks caused by not configuring the alerts: Messaging TPS in a topic specifies the number of API calls that you can initiate to send and receive messages in the topic. If you do not configure the alerts, you cannot receive alerts before traffic drops to zero or traffic spikes occur. This may cause unexpected risks.
Configuration timing: We recommend that you configure the alerts after your business stabilizes.

Configure alerts about the number of messages sent by producers per minute

Configure alerts about the number of messages received by consumers per minute

Recommended threshold: We recommend that you configure the threshold based on the traffic volume after your business stabilizes.
Alert handling: After you receive an alert about the number of messages sent by producers per minute, we recommend that you perform the following steps to handle the alert:
1. On the Topics page, click the name of the topic configured in the alert rule.
2. On the Topic Details page, click the Dashboard tab.
3. View the Production curve in Message volume (pieces/minute). Then, determine whether the changes are normal based on the business model.

Recommended threshold: We recommend that you configure the threshold based on the traffic volume after your business stabilizes.
Alert handling: After you receive an alert about the number of messages received by consumers per minute, we recommend that you perform the following steps to handle the alert:
1. On the Groups page, click the ID of the group configured in the alert rule.
2. On the Group Details page, click the Dashboard tab.
3. View the Production rate (bars/min) curve in Trends in message production and consumption rates. Then, determine whether the changes are normal based on the business model.

Configure Internet outbound bandwidth alerts

Background: ApsaraMQ for RocketMQ 5.0 instances support the Internet access feature. Internet access is affected by outbound bandwidth. If the bandwidth limit is exceeded, access to the Internet may be compromised.
Risk caused by not configuring the alerts: If you do not configure the alerts, you cannot receive an alert when the Internet traffic usage of the instance exceeds the bandwidth limit. This causes issues such as packet loss and timeout or failures during client invocation.
Configuration timing: We recommend that you configure the alert after you create the non-serverless instance and enable the Internet access feature.
Note
Serverless instances support elastic bandwidth. You do not need to configure Internet outbound bandwidth alerts for serverless instances.

Recommended threshold: We recommend that you set the alert threshold to 70% of the specification limit. The tool used to collect traffic bandwidth can collect only 50% of the traffic bandwidth. Therefore, you can set the threshold to 35% of the specification limit. For example, if the bandwidth limit of the instance that you purchased is 1 MB/s, set the alert threshold to 43,750 bits/s. You can view the Internet bandwidth in the Running Information section of the Basic Information tab on the Instance Details page in the ApsaraMQ for RocketMQ console.
Note
When you calculate the threshold, convert MB/s into bits/s first. In the preceding example, 1 MB is converted into 125,000 bits/s based on the following formula: 1 MB/s = 1 × 10^⁶ bits/s = (1 × 10^{^6})/8 bits/s = 125,000 bits/s. Then, the threshold is calculated by using the following formula: 125,000 bits/s × 0.7 × 0.5 = 43,750 bits/s.
Alert handling: After you receive an Internet outbound bandwidth alert, we recommend that you perform the following steps to handle the alert:
1. On the Instance Details page, click the Dashboard tab.
2. In the Billing Metrics Overview section, view the downlink bandwidth curve in public network downlink traffic bandwidth to determine the time when the alert threshold is reached. Take note that the unit of the threshold must be the consistent with the unit of the metric.
3. In the Instance Overview section, view the curves in Rate of messages sent by the producer to the server (bars/minute) and Rate of messages delivered by the server to the consumer (per minute. Then, find the topic and group whose data is abnormal based on the time when the alert threshold is reached and analyze whether the business changes are normal.
4. If the business changes are abnormal, contact your users for further analysis.
5. If the business changes are normal, the specification of your instance is insufficient to maintain normal business operations. In this case, we recommend that you upgrade your instance configurations. For more information, see Upgrade or downgrade instance configurations.

Configure message accumulation alerts

Note

Fluctuation and errors may exist in the statistics about message accumulation. We recommend that you do not set the threshold for accumulated messages to less than 100. If your business is affected even if the number of accumulated messages is small, we recommend that you configure consumption delay time alerts to monitor message accumulation.

Background: ApsaraMQ for RocketMQ allows you to monitor message accumulation by consumer group. You can use message accumulation alerts to prevent faults that are caused by message accumulation.
Risks caused by not configuring the alerts: Message accumulation is a typical scenario and capability of ApsaraMQ for RocketMQ. In scenarios in which messages must be processed in real time, you must monitor and manage the number of accumulated messages to prevent negative impacts caused by message accumulation on your business.
Configuration timing: We recommend that you configure the alerts after your business stabilizes.

Recommended threshold: We recommend that you configure the threshold based on the actual performance of your business.
Alert handling: After you receive a message accumulation alert, we recommend that you perform the following steps to handle the alert:
1. On the Groups page, click the ID of the group configured in the alert rule.
2. On the Group Details page, click the Dashboard tab.
3. View the Accumulation amount curve in Accumulation related indicators. Then, analyze the change trend of accumulated messages and find the start time of message accumulation.
4. Analyze the cause of message accumulation based on business changes and application logs. For information about the consumption mechanism of accumulated messages, see Consumer types.
5. Determine whether to scale out consumer applications or fix the consumption logic defect based on the cause of message accumulation.

Configure consumption delay time alerts

Note

Consumption delay time is calculated based on the delay time of the first unconsumed message in a consumer group. Consumption delay time is cumulative and sensitive to business changes. After you receive a consumption delay time alert, you must determine whether a small number of messages or all messages are delayed.

Background: ApsaraMQ for RocketMQ allows you to monitor consumption delay by consumer group. The consumption delay time alert provides a detailed metric for analyzing message accumulation.
Risks caused by not configuring the alerts: Message accumulation is a typical scenario and capability of ApsaraMQ for RocketMQ. In scenarios in which messages must be processed in real time, you must monitor and manage the number of accumulated messages to prevent negative impacts caused by message accumulation on your business.
Configuration timing: We recommend that you configure the alerts after your business stabilizes.

Recommended threshold: We recommend that you configure the threshold based on the actual performance of your business.
Alert handling: After you receive a consumption delay time alert, we recommend that you perform the following steps to handle the alert:
1. On the Groups page, click the ID of the group configured in the alert rule.
2. On the Group Details page, click the Dashboard tab.
3. View the Accumulation amount curve in Accumulation related indicators. Then, analyze the change trend of accumulated messages and find the start time of message accumulation.
4. Analyze the cause of message accumulation based on business changes and application logs. For information about the consumption mechanism of accumulated messages, see Consumer types.
5. Determine whether to scale out consumer applications or fix the consumption logic defect based on the cause of message accumulation.

Configure alerts about the number of times that throttling occurs

Background: ApsaraMQ for RocketMQ allows you to use events that trigger throttling on a specific instance as alert metrics. This helps you understand negative impacts on your business.
Risks caused by not configuring the alerts: A large number of times that throttling occurs indicates that your traffic usage frequently exceeds the specification limit. In this case, we recommend that you upgrade your instance configurations.
Configuration timing: We recommend that you configure the alerts after your business stabilizes.
- We recommend that you configure alerts about the number of times that throttling occurs on an instance after the instance is created.
- We recommend that you configure alerts about the number of times that throttling occurs in a topic or consumer group after your business stabilizes.

Recommended threshold: We recommend that you configure the threshold based on the actual performance of your business.
Alert handling: After you receive an alert about the number of times that throttling occurs, we recommend that you perform the following steps to handle the alert:
1. On the Instance Details page, click the Dashboard tab.
2. In the Current Limiting Related Indicators section, view the curve in Restricted Request Distribution (Production). Then, analyze the time when throttling occurs and the rules for throttling.
3. In the Instance Overview section, view the curve in Rate of messages sent by the producer to the server (bars/minute). Then, find the topic whose data is abnormal based on the time when throttling occurs and the rules for throttling and view the curve of the topic to determine whether the traffic increase meets your business requirements.
4. If the traffic increase meets your business requirements, upgrade your instance configurations. Otherwise, troubleshoot the issue.