Use Managed Service for Prometheus to monitor self-managed Kafka clusters and ApsaraMQ for Kafka instances -

This topic describes how to use Managed Service for Prometheus to monitor ApsaraMQ for Kafka instances and self-managed Kafka clusters.

Challenges of using a self-managed Prometheus service to monitor ApsaraMQ for Kafka instances and self-managed Kafka clusters

If you use a self-managed Prometheus service to monitor ApsaraMQ for Kafka instances and self-managed Kafka clusters, you may need to handle the following challenges:

To ensure security and facilitate organization management, it is likely that you deploy your business in separate virtual private clouds (VPCs). If you want to use a self-managed Prometheus service to monitor your business, you must deploy the Prometheus service in each VPC. This increases the deployment costs and O&M costs.
You must configure Prometheus, Grafana, and Alertmanager in each independent self-managed monitoring system, which is complex and requires a long period of time to complete.
In some cases, the JMX agent of open-source Apache Kafka consumes a large amount of CPU resources. This causes some impacts on self-managed Kafka clusters.
You cannot use the self-managed Prometheus service to monitor ApsaraMQ for Kafka instances. As a result, you cannot monitor your messaging clusters in a one-stop and centralized manner.
If your self-managed Kafka cluster is deployed on an Elastic Compute Service (ECS) instance, the self-managed Prometheus service cannot flexibly define and capture targets based on ECS tags due to the lack of the service discovery mechanism. If you want to implement a similar mechanism, you must write code in Golang to call the POP API of Alibaba Cloud ECS to integrate the open source Prometheus service. Then, you must compile and package the code, and then deploy the open source Prometheus service. This process is complex and causes great trouble in version upgrades.
For the commonly used open-source Grafana dashboards, you cannot customize monitoring metrics based on the principles and best practices of Apache Kafka.
No alert template is available for monitoring Apache Kafka. You must configure alert rules on your own. This process requires manpower and has high technical requirements.

Comparison between a self-managed Prometheus service and Managed Service for Prometheus

The following table compares a self-managed Prometheus service with Managed Service for Prometheus in monitoring ApsaraMQ for Kafka instances and self-managed Kafka clusters.

Item	Self-managed Prometheus service	Managed Service for Prometheus
Deployment costs and O&M costs	You must purchase ECS instances to deploy Prometheus, Grafana, and Alertmanager in multiple VPCs. This incurs high O&M costs.	Managed Service for Prometheus is a fully managed service that is provided for immediate use and integrates Prometheus, Grafana, and Alertmanager.
Availability, performance, and storage capacity	The overall performance and high availability performance are poor, and the storage capacity is small.	The overall performance and high availability performance are high, and the storage capacity is large.
Exporter performance	In some cases, the JMX agent of open source Apache Kafka consumes a large amount of CPU resources. This causes some impacts on self-managed Kafka clusters.	Managed Service for Prometheus continuously optimizes the performance and improves the stability of JMX agents of open source Apache Kafka.
Service discovery	The service discovery of ECS instances is performed by using the open source static configurations or a third-party service registry. The service discovery process is complex and the O&M cost is high.	Managed Service for Prometheus is compatible with open source service discovery features and provides aliyun_sd_configs. Similar to the LabelSelector for Kubernetes service discovery, you can use ECS tags to identify target ECS instances. This simplifies the configuration and O&M of service discovery.
Grafana dashboard	The Grafana dashboard displays only the collected metrics. You cannot customize the monitoring metrics based on the principles and best practices of Apache Kafka.	Managed Service for Prometheus provides a professional dashboard template for monitoring Apache Kafka. You can use the dashboard to quickly and accurately understand the running status of the entire Apache Kafka process and troubleshoot issues.
Alert rule	No alert template is available for monitoring Apache Kafka. You must configure the alert rules.	Managed Service for Prometheus provides professional and flexible alert metric templates based on the best practices of monitoring Apache Kafka. You can configure alert rules on the GUI.
Unified service	The self-managed Prometheus service is deployed in multiple VPCs, and the service cannot be used to monitor ApsaraMQ for Kafka instances. As a result, you cannot monitor your messaging clusters in a one-stop and centralized manner.	Managed Service for Prometheus is a fully managed service that is integrated into ApsaraMQ for Kafka. ApsaraMQ for Kafka provides a native overall monitoring system.

Use Managed Service for Prometheus to monitor ApsaraMQ for Kafka

Managed Service for Prometheus is integrated into ApsaraMQ for Kafka. The main metrics include:

The traffic of instances, groups, and topics
The message accumulation of groups and topics
The disk usage of instances
The rebalance metrics of groups

View ApsaraMQ for Kafka dashboards

ApsaraMQ for Kafka provides three monitoring dashboards for instances, groups, and topics. You can view data on the dashboards to understand the production and consumption of messages and quickly identify issues.

Instance dashboard

Log on to the ApsaraMQ for Kafka console. In the left-side navigation pane, click Instances.
Click the name of the ApsaraMQ for Kafka instance that you want to view. In the left-side navigation pane, click Prometheus Monitoring to view the monitoring data of the instance.

Consumer group dashboard

Log on to the ApsaraMQ for Kafka console. In the left-side navigation pane, click Instances.
Click the name of the ApsaraMQ for Kafka instance that you want to view. In the left-side navigation pane, click Groups. On the page that appears, click the ID of the group that you want to view and click the Prometheus Monitoring tab to view the monitoring data of the group.

Topic dashboard

Log on to the ApsaraMQ for Kafka console. In the left-side navigation pane, click Instances.
Click the name of the ApsaraMQ for Kafka instance that you want to view. In the left-side navigation pane, click Topics. On the page that appears, click the name of the topic that you want to view and click the Prometheus Monitoring tab to view the monitoring data of the topic.

Use Managed Service for Prometheus to configure alert rules for ApsaraMQ for Kafka

Log on to the ARMS console.
In the left-side navigation pane, choose Managed Service for Prometheus > Instances.
Click the name of the Prometheus instance instance that you want to manage to go to the Integration Center page.
Click the ApsaraMQ for Kafka card in the Installed section. In the panel that appears, click the Alerts tab to view Prometheus alerts of ApsaraMQ for Kafka. Managed Service for Prometheus provides key alert metrics for ApsaraMQ for Kafka instances, groups, and topics. You can add alert rules based on your business requirements. For more information, see Create an alert rule for a Prometheus instance.

Use Managed Service for Prometheus to monitor self-managed Kafka clusters

You can also use Managed Service for Prometheus to monitor self-managed Kafka clusters that are deployed in an ECS environment or container service environment, such as Container Service for Kubernetes (ACK), Serverless Kubernetes (ASK), and registered clusters. Basic metrics such as the number of brokers, the topic partitions, and the message group lag are collected. To use Managed Service for Prometheus, you do not need to configure or restart the Kafka broker.

When you use Managed Service for Prometheus to monitor self-managed Kafka clusters, you must also focus on internal O&M metrics.

Deploy the Kafka application component for self-managed Kafka clusters

Log on to the ARMS console.
In the left-side navigation pane, choose Integration Center, and click the Kafka card on the right.

Set the parameter values as prompted, and click OK. The following tables describe the key parameters.

ACK

Parameter	Description
Pod Label Selector	The tag and tag value configured for a pod when a Kafka JMX agent is deployed. Managed Service for Prometheus uses this tag for service discovery. For more information, see How do I deploy and configure a Kafka JMX agent?
Metric Collect Interval	The interval at which you want the service to collect monitoring data.

ECS(VPC)

Parameter	Description
Kafka cluster name	The Kafka cluster name, which must be unique. If the name duplicates, the same metrics may be collected repeatedly, resulting in dashboard errors.
Service address	The endpoint of the self-managed Kafka broker, which can be an IP address or a domain name system (DNS) address. Separate multiple broker addresses with commas (,) or semicolons (;).
Enable SASL	Specifies whether to enable the Simple Authentication and Security Layer (SASL) feature on the Apache Kafka broker.
Enable TLS	Specifies whether to enable the Transport Layer Security (TLS) feature on the Apache Kafka broker.
Metric collection interval (units/second)	The interval at which you want the service to collect monitoring data.

View the dashboards of self-managed Kafka clusters

Log on to the ARMS console.
In the left-side navigation pane, choose Managed Service for Prometheus > Instances.
Click the name of the Prometheus instance instance that you want to manage to go to the Integration Center page.
Click the Kafka card in the Installed section. In the panel that appears, click the Dashboards tab and click the diagram of the Grafana dashboard that you want to view.
The dashboards of the Kafka application component display the following information:
- The number of Kafka brokers.
- The number of partitions in each topic.
- The numbers of inbound messages, outbound messages, and accumulated messages in each topic.
- The number of in-sync replicas (ISRs) in each topic.

Configure alert rules for self-managed Kafka clusters

On the Integration Center page that appears in the Managed Service for Prometheus console, click the Kafka card in the Installed section. In the panel that appears, click the Alerts tab to view the Prometheus alerts.

Managed Service for Prometheus provide four key alert metrics: Number of active brokers decrease, Number of Under Replicated partitions, Number of partitions, and Consumer topic lag. You cal also add alert rules based on your business requirements. For more information, see Create an alert rule for a Prometheus instance.