Best practices for fault handling based on the observability feature - ApsaraMQ for RocketMQ

ApsaraMQ for RocketMQ provides the dashboard feature and the monitoring and alerting feature. You can use the features to monitor broker status and important metrics at each stage of messaging. You can also use the features to configure alert rules for important metrics to receive exception reports at the earliest opportunity. This topic describes how to use the dashboard feature and the monitoring and alerting feature of ApsaraMQ for RocketMQ to manage faults in ApsaraMQ for RocketMQ. This provides solutions for your routine O&M and troubleshooting.

Implementation

Core issues

The following items describe the core issues in troubleshooting:

How to send alerts about and report service exceptions.
How to quickly locate exceptions.

Solutions

Indicators such as metrics and traces provided by ApsaraMQ for RocketMQ include the status information at each stage of messaging and the throughput of ApsaraMQ for RocketMQ brokers and resources. Metrics can be broadly divided into the following categories:

Level-1 metrics: We recommend that you use metrics that can measure the operation of your business as level-1 metrics. Exceptions in such metrics indicate issues in the business system. In most cases, these metrics can be used as monitoring and alerting metrics.
For example, if instance throttling is triggered because the messaging transactions per second (TPS) exceeds the specification limit, you can use TPS as a monitoring metric and create an alert rule to effectively prevent the instance from being throttled.
Level-2 metrics: We recommend that you use metrics that can be used to locate faults as level-2 metrics.
For example, accumulated messages indicate that faults have occurred during message consumption. Message sending success rates indicate whether exceptions occur during message sending.
Level-3 metrics: Such metrics can be used to further analyze level-2 metrics. Level-3 metrics help identify causes for the changes in level-2 metrics.

Solution for consumption exceptions

消费异常

Use the ConsumerLagLatencyPerGidTopic metric, which indicates the delay time for message processing, as the monitoring metric and create an alert rule. For more information, see Monitoring and alerting.
This metric indicates the health status of the consumption system and can affect the level of business impacts. The metric provides more information than the number of accumulated messages.
- If the number of messages is small, the number of accumulated messages may not trigger alerts even if issues occur.
- If the number of messages is large, the number of accumulated messages may generate false alerts.
- If the number of messages greatly fluctuates, you cannot accurately configure the alert threshold for message accumulation.
Check whether the rocketmq_process_time metric, which indicates the time consumed for message processing, and the rocketmq_process_time_count{invocation_status="success"/invocation_status="success | failure"} metric, which indicates the message processing success rate, are normal. This helps check whether the exception occurred on the consumer client.
The message processing success rate is calculated by using the following formula: Message processing success rate = Number of times that messages are successfully processed/Number of times that messages fail to be processed + Number of times that messages are successfully processed.
You can go to the Dashboard page in the ApsaraMQ for RocketMQ console to view the statistics of the preceding metrics. For information about the dashboard, see Dashboard.
Identify the specific cause based on the business logic or the change trend of metrics. For example, if the message processing duration becomes longer, you can check whether the memory and CPU of the consumer service are overloaded. Alternatively, you can check the running status of the downstream business logic that the consumption logic depends on for further analysis.

Solution for production exceptions

生产异常

Check whether the rocketmq_send_cost_time_count{invocation_status="success"/invocation_status="success | failure"} metric, which indicates the message sending success rate, is normal. The rate is calculated by using the following formula: Message sending success rate = Number of times that messages are successfully sent/Number of times that messages fail to be sent + Number of times that messages are successfully sent.
You can go to the Dashboard page in the ApsaraMQ for RocketMQ console to view the statistics of the preceding metric. For information about the dashboard, see Dashboard.
Check whether the network is normal or whether a short-term transmission failure is caused by broker restart.