Dashboard metrics and query methods - ApsaraMQ for RocketMQ

ApsaraMQ for RocketMQ can integrate with Managed Service for Prometheus and Managed Service for Grafana that are provided by Application Real-Time Monitoring Service (ARMS) to provide the dashboard feature. Managed Service for Prometheus is used to monitor metrics, and Managed Service for Grafana is used to store and display metrics. The dashboard feature allows you to monitor metrics and collect metric data in an all-in-one, comprehensive, and multi-dimensional manner. This helps you quickly obtain information about your business status. This topic describes the scenarios, background information, metric details, billing, and query methods of the dashboard feature.

Scenarios

Scenario 1: You need to receive alerts and locate issues in a timely manner when exceptions occur during online message consumption.
Scenario 2: You need to check whether messages are sent as expected in the messaging system when the status of specific online orders is abnormal.
Scenario 3: You need to analyze the change trend of message traffic, the characteristics of traffic distribution, or message volume to help you analyze the business trend and make business plans.
Scenario 4: You need to view and analyze the upstream and downstream dependency topologies of applications to upgrade, optimize, or transform the architecture.

Background information

When you use ApsaraMQ for RocketMQ to send and receive messages, key metrics, such as accumulated messages, buffering, and processing duration in a queue, can reflect the business performance and broker status. The key metrics of ApsaraMQ for RocketMQ are used in the following business scenarios.

Message accumulation

The following figure shows the status of each message in a queue of a specific topic.

队列消息状态

In the preceding figure, ApsaraMQ for RocketMQ calculates the number of messages and the processing duration at different processing stages. The metrics that are used in this process reflect the processing rate and message accumulation in the queue. By monitoring the metrics, you can determine whether exceptions occur during consumption. The following table describes the details of the metrics and the formulas that are used to calculate the metrics.

Category	Metric	Description	Calculation formula
Message quantity	Inflight messages	The messages that a consumer client is processing and for which the client has not returned the consumption results.	Number of inflight messages = Offset of the latest pulled message - Offset of the latest acknowledged message
	Ready messages	The messages that are visible to consumers and are ready for consumption on the ApsaraMQ for RocketMQ broker.	Number of ready messages = Maximum offset - Offset of the latest pulled message
	Consumer lag	The messages that are being processed and ready to be processed.	Consumer lag = Number of inflight messages + Number of ready messages
Duration	Ready time	For a normal message or an ordered message, the ready time is the time when the message is stored in the broker. For a scheduled message, the ready time is the time that is scheduled for the broker to deliver the message. For a delayed message, the ready time is the time when the specified delay period elapses. For a transactional message, the ready time is the time when a transaction is committed.	N/A
	Ready message queue time	The interval between the current point in time and the ready time of the earliest ready message. This metric indicates how soon a consumer pulls messages.	Ready message queue time = Current time - Ready time of the earliest ready message
	Consumer lag time	The interval between the ready time of the earliest unacknowledged message and the current time. This metric indicates how soon a consumer processes messages.	Consumer lag time = Current time - Ready time of the earliest unacknowledged message

Consumption in push mode

For PushConsumer, real-time message processing is based on the typical Reactor thread model of the SDK. The SDK has a built-in long polling thread, which pulls messages and stores the messages to a queue. Then, the messages are delivered from the queue to individual message consumption threads. The message listener behaves based on the message consumption logic. The following figure shows the message consumption process of PushConsumer consumers.

pushconsumer

For more information, see PushConsumer.

The following items describe metrics that are related to local buffer queues when you consume messages in push mode:

Message quantity: the total number of messages in local buffer queues.
Message size: the total size of messages in local buffer queues.
Waiting duration: the duration for which a message is stored in a local buffer queue before the message is processed.

Metric details

Important

The values of metrics that are related to messaging transactions per second (TPS), API calls for messaging, and message volume are calculated based on a normal message whose size is 4 KB. When you calculate metric values for large messages and featured messages, multiples are used. For more information, see Computing specifications.

The following table describes the fields that are related to the metrics of ApsaraMQ for RocketMQ.

Field	Valid value
Metric type	Counter: a cumulative metric whose value only increases. Example: the number of produced messages. Gauge: a metric whose value can increase or decrease. The value of a gauge indicates the instantaneous value of a statistical object. Example: the TPS for API calls. Histogram: a histogram that measures the value distribution of a metric. Example: the distribution of message sizes.
Label	instance_id: the ID of the ApsaraMQ for RocketMQ instance. topic: the ApsaraMQ for RocketMQ topic. message_type: the message type. The value normal indicates that the message is a normal message. The value fifo indicates that the message is an ordered message. The value transaction indicates that the message is a transactional message. The value delay indicates that the message is a delayed or scheduled message. fifo_enable: indicates whether the ApsaraMQ for RocketMQ broker delivers messages for consumption in the same order as they are produced. The value true indicates that messages are delivered in order. The value false indicates that messages are delivered concurrently. uid: the ID of your Alibaba Cloud account. client_id: the ID of the ApsaraMQ for RocketMQ client. invocation_status: the response of the API call that is initiated to send messages. The value success indicates that the call is successful. The value failure indicates that the call failed.

Metrics related to brokers

Type	Name	Unit	Description	Label
Gauge	rocketmq_instance_requests_max	count/s	The maximum value of messaging TPS in the instance per minute. Throttled requests are excluded. Rule for determining the value: The system collects one sample every second based on a 1-minute cycle. The maximum value among the 60 samples is used.	uid instance_id
Gauge	rocketmq_instance_requests_in_max	count/s	The maximum value of message sending TPS in the instance per minute. Throttled requests are excluded. Rule for determining the value: The system collects one sample every second based on a 1-minute cycle. The maximum value among the 60 samples is used.	uid instance_id
Gauge	rocketmq_instance_requests_out_max	count/s	The maximum value of message receiving TPS in the instance per minute. Throttled requests are excluded. Rule for determining the value: The system collects one sample every second based on a 1-minute cycle. The maximum value among the 60 samples is used.	uid instance_id
Gauge	rocketmq_topic_requests_max	count/s	The maximum value of message sending TPS in the topics of the instance per minute. Throttled requests are excluded. Rule for determining the value: The system collects one sample every second based on a 1-minute cycle. The maximum value among the 60 samples is used.	uid instance_id topic
Gauge	rocketmq_group_requests_max	count/s	The maximum value of message receiving TPS in the consumer groups of the instance. Throttled requests are excluded. Rule for determining the value: The system collects one sample every second based on a 1-minute cycle. The maximum value among the 60 samples is used.	uid instance_id consumer_group
Gauge	rocketmq_instance_requests_in_threshold	count/s	The throttling threshold for message sending in the instance.	uid instance_id
Gauge	rocketmq_instance_requests_out_threshold	count/s	The throttling threshold for message receiving in the instance.	uid instance_id
Gauge	rocketmq_throttled_requests_in	count	The number of throttled requests during message sending.	uid instance_id topic message_type
Gauge	rocketmq_throttled_requests_out	count	The number of throttled reqeusts during message receiving.	uid instance_id topic fifo_enable consumer_group
Gauge	rocketmq_instance_elastic_requests_max	count/s	The maximum scaling value of messaging TPS in the instance.	uid instance_id
Counter	rocketmq_requests_in_total	count	The number of API calls initiated to send messages.	uid instance_id topic message_type
Counter	rocketmq_requests_out_total	count	The number of API calls initiated to receive messages.	uid instance_id topic consumer_group fifo_enable
Counter	rocketmq_messages_in_total	message	The number of messages that producers send to the broker.	uid instance_id topic message_type
Counter	rocketmq_messages_out_total	message	The number of messages that the broker delivers to consumers. The messages include messages that are being processed, successfully processed, and failed to be processed.	uid instance_id topic consumer_group fifo_enable
Counter	rocketmq_throughput_in_total	byte	The throughput when producers send messages to the broker.	uid instance_id topic message_type
Counter	rocketmq_throughput_out_total	byte	The throughput when the broker delivers messages to consumers. The messages include messages that are being processed, successfully processed, and failed to be processed.	uid instance_id topic consumer_group fifo_enable
Counter	rocketmq_internet_throughput_out_total	byte	The amount of outbound Internet traffic that is used for messaging.	uid instance_id topic message_type
Histogram	rocketmq_message_size	byte	The distribution of message sizes. Data is collected for this metric only when messages are sent. The following items describe the distribution ranges: le_1_kb: ≤ 1 KB le_4_kb: ≤ 4 KB le_512_kb: ≤ 512 KB le_1_mb: ≤ 1 MB le_2_mb: ≤ 2 MB le_4_mb: ≤ 4 MB le_overflow: > 4 MB	uid instance_id topic message_type
Gauge	rocketmq_consumer_ready_messages	message	The number of ready messages. Ready messages are messages that are ready on the broker and can be consumed by consumers. This metric reflects the number of messages that are not processed by consumers.	uid instance_id topic consumer_group
Gauge	rocketmq_consumer_inflight_messages	message	The number of inflight messages. This metric reflects the total number of messages that consumer clients are processing and for which the client has not returned the consumption results.	uid instance_id topic consumer_group
Gauge	rocketmq_consumer_queueing_latency	ms	The queuing time for ready messages in a consumer group. The time difference between the current point in time and the point in time when the earliest message is ready. This metric indicates how soon a consumer pulls messages.	uid instance_id topic consumer_group
Gauge	rocketmq_consumer_lag_latency	ms	The delayed time before messages are consumed. The interval between the ready time of the earliest unacknowledged message and the current time. This metric indicates how soon a consumer processes messages.	uid instance_id topic consumer_group
Counter	rocketmq_send_to_dlq_messages	message	The number of new dead-letter messages per minute. A dead-letter message is a message that fails to be delivered after the maximum number of retries is reached. Dead-letter messages are saved to a specific topic or discarded based on the dead-letter policy that is configured for the consumer group.	uid instance_id topic consumer_group
Gauge	rocketmq_storage_size	byte	The size of the storage space that is used by the instance, including the storage space that is used by all files.	uid instance_id

Metrics related to producers

Type

Name

Unit

Description

Label

Histogram

rocketmq_send_cost_time

The distribution of the time consumed to successfully call the API operation to send messages.

The following items describe the distribution ranges:

le_1_ms
le_5_ms
le_10_ms
le_20_ms
le_50_ms
le_200_ms
le_500_ms
le_overflow

uid
instance_id
topic
client_id
invocation_status

Metrics related to consumers

Type	Name	Unit	Description	Label
Histogram	rocketmq_process_time	ms	The distribution of the time consumed by push consumers to process messages, including successful and failed processing. The value of this metric is calculated by using the following formula: `rocketmq_process_time = Process end time - Process start time` The following items describe the distribution ranges: le_1_ms le_5_ms le_10_ms le_100_ms le_10000_ms le_60000_ms le_overflow	uid instance_id consumer_group topic client_id invocation_status
Gauge	rocketmq_consumer_cached_messages	message	The number of messages in the local buffer queues of push consumers.	uid instance_id consumer_group topic client_id
Gauge	rocketmq_consumer_cached_bytes	byte	The total size of messages in the local buffer queues of push consumers.	uid instance_id consumer_group topic client_id
Histogram	rocketmq_await_time	ms	The distribution of queuing time for messages in the local buffer queues of push consumers. The value of this metric is calculated by using the following formula: `rocketmq_await_time = Process start time - Arrival time` The following items describe the distribution ranges: le_1_ms le_5_ms le_20_ms le_100_ms le_1000_ms le_5000_ms le_10000_ms le_overflow	uid instance_id consumer_group topic client_id

Billing

Dashboard metrics that are used in ApsaraMQ for RocketMQ are basic metrics in Managed Service for Prometheus. You are not charged for basic metrics in Managed Service for Prometheus. Therefore, you can use the dashboard feature of ApsaraMQ for RocketMQ free of charge.

For more information, see Metrics and Pay-as-you-go.

Prerequisites

Managed Service for Prometheus is activated. For more information, see Activate ARMS.
The following service-linked role is created:
- Role name: AliyunServiceRoleForOns.
- Role policy name: AliyunServiceRolePolicyForOns.
- Permission description: Allow ApsaraMQ for RocketMQ to assume the role to access CloudMonitor and ARMS to implement the monitoring, alerting, and dashboard features.
- For more information, see Service-linked roles.

View dashboard metrics

You can view dashboard metrics on the following pages in the ApsaraMQ for RocketMQ console:

Dashboard page: displays metrics about all topics and consumer groups on an instance.
Instance Details page: displays the producer overview, billing metrics, and throttling metrics of the specified instance.
Topic Details page: displays metrics that are related to message production and producer clients of the specified topic.
Group Details page: displays metrics that are related to message accumulation and consumer clients of the specified consumer group.

Log on to the ApsaraMQ for RocketMQ console. In the left-side navigation pane, click Instances.
In the top navigation bar, select a region, such as China (Hangzhou). On the Instances page, click the name of the instance that you want to manage.
Use one of the following methods to view the dashboard:
- On the Instance Details page, click the Dashboard tab.
- In the left-side navigation pane of the Instance Details page, click Dashboard.
- In the left-side navigation pane of the Instance Details page, click Topics. On the page that appears, click the name of the topic that you want to manage. On the Topic Details page, click the Dashboard tab.
- In the left-side navigation pane of the Instance Details page, click Groups. On the page that appears, click the name of the group that you want to manage. On the Group Details page, click the Dashboard tab.

FAQ about the dashboard

How do I obtain metrics on the dashboard?

Log on to the ARMS console by using your Alibaba Cloud account.
In the left-side navigation pane, click Integration Center.
On the Integration Center page, enter RocketMQ in the search field and click the search icon.
In the search result, select the cloud service whose monitoring data you want to integrate into ARMS. Example: Aliyun RocketMQ (5.0) Service. For more information, see Step 1: Integrate the monitoring data of the cloud service into Managed Service for Prometheus.
After you integrate the monitoring data of the cloud service into ARMS, click Integration Management in the left-side navigation pane.
On the Cloud Service Region tab, click the name of the environment that you want to manage.
In the Basic Information section of the Component Management tab, click the cloud service region next to Default Metric Storage.
On the Settings tab of the page that appears, view the methods used to access different types of data.

How do I integrate metric data provided by the dashboard of ApsaraMQ for RabbitMQ into a self-managed Grafana system?

All metric data on the dashboard of ApsaraMQ for RocketMQ are stored in Alibaba Cloud Managed Service for Prometheus. You can follow the procedure in the "How do I obtain metrics on the dashboard?" section to integrate the monitoring data of ApsaraMQ for RocketMQ into Managed Service for Prometheus, obtain the environment name and HTTP API URL, and then use the HTTP API URL to integrate the metric data on the dashboard of ApsaraMQ for RocketMQ into a self-managed Grafana system. For more information, see Use an HTTP API URL to connect a Prometheus instance to a self-managed Grafana system.

What is the maximum TPS of an instance?

Maximum TPS: The system collects one TPS value every second based on a 1-minute cycle. The maximum value among the 60 values is known as the maximum TPS of the minute.

Example:

An ApsaraMQ for RocketMQ instance produces 60 normal messages in a specific minute. If each of the message is 4 KB in size, the message production rate of the instance is 60 messages per minute. The following items describe how to calculate the maximum TPS of the instance:

If all 60 messages are sent in the first second, the TPS value for the first second is 60, and the TPS values for the other 59 seconds are all 0.
In this case, the maximum TPS of the instance is 60.
If 40 messages are sent in the first second and 20 messages are sent in the second second, the TPS value for the first second is 40, the TPS value for the second second is 20, and the TPS values for the other 58 seconds are all 0.
In this case, the maximum TPS of the instance is 40.