Use Managed Service for Prometheus to monitor Cassandra - Managed Service for Prometheus

This topic describes how to use Alibaba Cloud Managed Service for Prometheus to monitor Cassandra.

Prerequisites

A Prometheus instance for ECS is created. For more information, see Create a Prometheus instance to monitor an ECS instance.

Limits

You can install the component only for Prometheus instances for ECS.

Step 1: Deploy a Cassandra JMX agent

Based on the version of Cassandra, download a Cassandra JMX agent to the Elastic Compute Service (ECS) instance where Cassandra resides.
- Cassandra 4.1.x and later
- Cassandra 4.0.x and earlier
Decompress the package to MCAC_ROOT. Add the following information to the cassandra-env.sh file:
```
MCAC_ROOT=/path/to/directory
JVM_OPTS="$JVM_OPTS -javaagent:${MCAC_ROOT}/lib/datastax-mcac-agent.jar"
```
Important
The port number that the Cassandra JMX agent exposes to Managed Service for Prometheus is 9103. To change the port number, modify the following information in the ${MCAC_ROOT}/config/collectd.conf.tmpl file.
Restart Cassandra and run the curl localhost:{jmx port}/metrics command in the ECS instance. Check whether data is returned. If data is returned, the Cassandra JMX agent is installed.

Step 2: Integrate Cassandra into Managed Service for Prometheus

Procedure

Entry point 1: Integration center of the Prometheus instance

Log on to the Managed Service for Prometheus console.
In the left-side navigation pane, click Instances.
Click the name of the Prometheus instance instance that you want to manage to go to the Integration Center page.

Entry point 2: Integration center in the ARMS console

Log on to the Application Real-Time Monitoring Service (ARMS) console.
In the left-side navigation pane, click Integration Center. In the Components section, find Cassandra and click Add. In the panel that appears, integrate Cassandra as prompted.

Integrate Cassandra

This section describes how to integrate the Cassandra component in the integration center of the Prometheus instance.

Install or add the Cassandra component.
- If this is the first time that you install the Cassandra component, perform the following operation.
  In the Not Installed section of the Integration Center page, find Cassandra and click Install.
  Note
  You can click the card to view the common Cassandra metrics and dashboard thumbnails in the panel that appears. The metrics listed are for reference only. After you install the Cassandra component, you can view the actual metrics collected by Managed Service for Prometheus. For more information, see Key metrics.
- If you have installed the Cassandra component, you must add the component again.
  In the Installed section of the Integration Center page, find Cassandra and click Add.

On the Settings tab in the STEP2 section, configure the parameters and click OK. The following table describes the parameters.

Parameter	Description
Instance name	The name of the exporter. The name can contain only lowercase letters, digits, and hyphens (-) and cannot start or end with a hyphen (-). The name must be unique.
ECS Label Key (service discovery)	The ECS tag and tag value that are used to deploy the exporter. Managed Service for Prometheus uses this tag for service discovery. Valid values: acs:emr:nodeGroupType and acs:emr:hostGroupType.
ECS Label value	The ECS tag values. Default values: CORE,MASTER. Separate multiple values with commas (,).
JMX Agent listening port	The listening port of the metrics. Managed Service for Prometheus accesses the port to obtain metric data. Default value: 9103.
Metrics path	The HTTP path used by Managed Service for Prometheus to collect metric data from the exporter. Default value: /metrics.
Metrics scrape interval (seconds)	The interval at which Managed Service for Prometheus collects the monitoring data. Default value: 30.

Note

You can view the monitoring metrics on the Metrics tab in the STEP2 section.

The installed components are displayed in the Installed section of the Integration Center page. Click the component. In the panel that appears, you can view information such as targets, metrics, dashboard, alerts, service discovery configurations, and exporters. For more information, see Integration center.

You can also view the status of the exporter on the Targets tab.

Step 3: View the dashboards of Cassandra

On the Dashboards tab, you can view monitoring data such as the availability, client read and write latency, and client throughput. You can also view the CPU utilization, memory usage, and disk usage of nodes.

On the Integration Center page, click the Cassandra component in the Installed section. In the panel that appears, click the Dashboards tab to view the thumbnails and hyperlinks of Cassandra dashboards. Click a hyperlink to go to the Grafana page and view the dashboard. This section describes the monitoring metrics of common dashboards.

Cluster/Node Information section
Client Read Latency, Write Delay, and Throughput section
Exceptions and Errors section
Caching and Bloom Filters section
Hardware resource usage section
Storage occupancy details section
Thread Pool Status section
JVM and Garbage Collection section

Step 4: Configure alerting

On the Integration Center page, click the Cassandra component in the Installed section. In the panel that appears, click the Alerts tab to view all Cassandra alert rules configured in Managed Service for Prometheus.

Managed Service for Prometheus allows you to enable Cassandra exporters with simple configurations. Out-of-the-box dedicated dashboards and alerting are provided. You can use the ARMS console to manage the exporters with reduced workload costs.

Managed Service for Prometheus provides multiple default alert rules for the key metrics of Cassandra. Common Cassandra alert rules are preset as templates to help the O&M personnel build dashboards and alert systems. The following table lists the default alert rules.

Category	Metric	Description
Node status	Proportion of inactive nodes in the cluster	If the value is greater than 10, one or more nodes in the cluster are down.
Resource usage	CPU utilization	If the CPU utilization of a node exceeds 85% in the last 5 minutes, the CPU utilization reaches the upper limit.
	Memory usage	If the memory usage of a node exceeds 85%, the memory usage reaches the upper limit.
	Hard disk usage	If the hard disk usage of a node exceeds 85%, the hard disk reaches the upper limit.
Read and write latency and throughput	Read latency	If the read latency of a node exceeds 200 ms in the past 1 minute, the read latency is high.
	Write latency	If the write latency of a node exceeds 200 ms in the last 1 minute, the write latency is high.
	Read throughput	If the number of read operations of a node exceeds 1,000 in the last 1 minute, the read throughput is high.
	Write throughput	If the number of write operations of a node exceeds 1,000 in the last 1 minute, the write throughput is high.
Exceptions and errors	Timed out requests	If the number of timed out requests of a node exceeds 10 in the past 1 minute, the node is overloaded.
	Failed requests	If the number of failed requests of a node exceeds 10 in the past 1 minute, the node is overloaded.
	Dropped messages	If the number of dropped messages of a node exceeds 10 in the last 1 minute, the node is overloaded.
JVM	GC time ratio	If the GC time of a node in the last 5 minutes accounts for more than 1%, the garbage collection is too frequent.

You can also create alert rules based on your business requirements. For more information, see Create an alert rule for a Prometheus instance.

Key metrics

Cluster and node information

Metric	Level	Description	Remarks
mcac_client_connected_native_clients	Major	Number of CQL connections	If the value is too large, lots of system resources are occupied, which causes prolonged client latency.
mcac_table_live_disk_space_used_total	Major	Space occupied by Cassandra	If the value is too large, storage space may be insufficient, causing prolonged access latency.
mcac_table_snapshots_size	Recommand	Cassandra snapshot file size	Snapshots are used to restore data. If the value is too large, storage space may be insufficient to store complete snapshots.
collectd_uptime	Major	Node startup time	If the value is too large, the system has not been restarted for a long time, and may be vulnerable to security risks.

Key performance metrics

Metric	Level	Description	Remarks
mcac_table_read_latency	Critical	Client read latency	If the value is too large, the read speed of the application is slow, which affects user experience.
mcac_table_write_latency	Critical	Client write latency	If the value is too large, the write speed of the application is slow, which affects user experience.

Exceptions and errors

Metric	Level	Description	Remarks
mcac_client_request_timeouts_total	Critical	Timed out client requests	If the value is too large, the system is overloaded, which severely affects user experience.
mcac_client_request_failures_total	Critical	Abnormal client requests	If the value is too large, the system is overloaded, which severely affects user experience.
mcac_dropped_message_dropped_total	Critical	Dropped messages	If the value is too large, the system is overloaded, which severely affects user experience.

Caching and Bloom filters

Metric	Level	Description	Remarks
mcac_table_key_cache_hit_rate	Major	Hit rate of key_cache	If the value is too small, the read speed of the application may be slow, which affects user experience.
mcac_table_row_cache_hit_total	Major	Number of hits of row_cache	If the value is too small, the read speed of the application may be slow, which affects user experience.
mcac_table_row_cache_miss_total	Recommand	Number of missed hits of row_cache	If the value is too large, the read speed of the application may be slow, affecting user experience.
mcac_table_row_cache_hit_out_of_range_total	Recommand	Number of times that row_cache hits but still accesses the disk	If the value is too large, the read speed of the application may be slow, affecting user experience.
mcac_table_bloom_filter_false_ratio	Major	False-positive rate of the Bloom filter	If the value is too large, non-existent elements in the query result are misjudged as existent, which wastes query time and resources. This degrades query performance and increases query costs.

Usage trends in CPU, memory, and disks

Metric	Level	Description	Remarks
collectd_cpu_total	Critical	CPU utilization	If the value is too large, the system is overloaded, which prolongs client request latency and severely affects user experience.
collectd_memory	Critical	Memory usage	If the value is too large, the system is overloaded, which prolongs the client request latency and severely affects user experience.
collectd_df_df_complex	Critical	Hard disk usage	If the value is too large, the hard disk space is insufficient. Data cannot be stored persistently, and the system may crash.

SSTable compression

Metric	Level	Description	Remarks
mcac_table_pending_compactions	Major	SSTable compression task in progress	If the value is too large, the system is overloaded, which prolongs the client request latency. We recommend that you configure the compression interval of SSTable.
mcac_table_compaction_bytes_written_total	Major	SSTable compression speed	If the value is too small, the compression speed is slow, which causes task accumulation. We recommend that you increase the hardware configuration of the node.
mcac_table_compression_ratio	Major	SSTable compression ratio	If the value is too large, the compressed files are still too large, and the compression tasks do not achieve the expected results.

Disk file

Metric	Level	Description	Remarks
mcac_table_live_ss_table_count	Major	Number of SSTables	If the value is too large, the hard disk usage is high, and the read/write latency is prolonged. We recommend that you configure the compression policy of SSTable.
mcac_table_live_disk_space_used_total	Major	Hard disk space occupied by SSTable	If the value is too large, the hard disk usage is high, and the read/write latency is prolonged. We recommend that you configure the compression policy of SSTable.
mcac_table_ss_tables_per_read_histogram	Major	Number of SSTables for each read operation	If the value is too large, the client read latency is high.
mcac_commit_log_total_commit_log_size	Major	Hard disk space occupied by Commit Log	If the value is too large, the hard disk space is insufficient, the read/write performance is degraded, and the data recovery time is increased.
mcac_table_memtable_live_data_size	Major	Space occupied by the MemTable	If the value is too large, the data write performance and node stability are degraded.
mcac_table_waiting_on_free_memtable_space	Major	Time spent waiting for the MemTable to be released	If the value is too large, the data write performance and node stability are degraded.

Thread pool status

Metric	Level	Description	Remarks
mcac_thread_pools_active_tasks	Critical	Number of active tasks in the thread pool	If the value is too large, system resources are occupied, which may cause a reduced response speed and even a system crash.
mcac_thread_pools_total_blocked_tasks_total	Critical	Number of blocked tasks in the thread pool	If the value is too large, system resources are occupied, which may cause a reduced response speed and even a system crash.
mcac_thread_pools_pending_tasks	Critical	Number of pending tasks in the thread pool	If the value is too large, lots of system resources are occupied. If the requests that correspond to pending tasks time out, the system may crash.
mcac_thread_pools_completed_tasks	Major	Number of completed tasks in the thread pool	This metric indicates the throughput of the system. The higher the value, the better the system performs.

JVM

Metric	Level	Description	Remarks
mcac_jvm_memory_used	Critical	Size of the used JVM heap memory	If the value is too large, the memory may be insufficient, which triggers frequent garbage collection, and reduces the throughput of the application.
mcac_jvm_gc_time	Critical	Time spent by the application in GC	If the value is too large, GC is too frequent, and the system has less time to execute user tasks, which may lead to client request timeout or even a system crash.