This topic describes how to use Alibaba Cloud Managed Service for Prometheus to monitor Cassandra.
Prerequisites
A Prometheus instance for ECS is created. For more information, see Create a Prometheus instance to monitor an ECS instance.
Limits
You can install the component only for Prometheus instances for ECS.
Step 1: Deploy a Cassandra JMX agent
Based on the version of Cassandra, download a Cassandra JMX agent to the Elastic Compute Service (ECS) instance where Cassandra resides.
Decompress the package to
MCAC_ROOT
. Add the following information to the cassandra-env.sh file:MCAC_ROOT=/path/to/directory JVM_OPTS="$JVM_OPTS -javaagent:${MCAC_ROOT}/lib/datastax-mcac-agent.jar"
ImportantThe port number that the Cassandra JMX agent exposes to Managed Service for Prometheus is 9103. To change the port number, modify the following information in the ${MCAC_ROOT}/config/collectd.conf.tmpl file.
Restart Cassandra and run the
curl localhost:{jmx port}/metrics
command in the ECS instance. Check whether data is returned. If data is returned, the Cassandra JMX agent is installed.
Step 2: Integrate Cassandra into Managed Service for Prometheus
Procedure
Entry point 1: Integration center of the Prometheus instance
Log on to the Managed Service for Prometheus console.
In the left-side navigation pane, click Instances.
Click the name of the Prometheus instance instance that you want to manage to go to the Integration Center page.
Entry point 2: Integration center in the ARMS console
Log on to the Application Real-Time Monitoring Service (ARMS) console.
In the left-side navigation pane, click Integration Center. In the Components section, find Cassandra and click Add. In the panel that appears, integrate Cassandra as prompted.
Integrate Cassandra
This section describes how to integrate the Cassandra component in the integration center of the Prometheus instance.
Install or add the Cassandra component.
If this is the first time that you install the Cassandra component, perform the following operation.
In the Not Installed section of the Integration Center page, find Cassandra and click Install.
NoteYou can click the card to view the common Cassandra metrics and dashboard thumbnails in the panel that appears. The metrics listed are for reference only. After you install the Cassandra component, you can view the actual metrics collected by Managed Service for Prometheus. For more information, see Key metrics.
If you have installed the Cassandra component, you must add the component again.
In the Installed section of the Integration Center page, find Cassandra and click Add.
On the Settings tab in the STEP2 section, configure the parameters and click OK. The following table describes the parameters.
Parameter
Description
Instance name
The name of the exporter.
The name can contain only lowercase letters, digits, and hyphens (-) and cannot start or end with a hyphen (-).
The name must be unique.
ECS Label Key (service discovery)
The ECS tag and tag value that are used to deploy the exporter. Managed Service for Prometheus uses this tag for service discovery. Valid values: acs:emr:nodeGroupType and acs:emr:hostGroupType.
ECS Label value
The ECS tag values. Default values: CORE,MASTER. Separate multiple values with commas (,).
JMX Agent listening port
The listening port of the metrics. Managed Service for Prometheus accesses the port to obtain metric data. Default value: 9103.
Metrics path
The HTTP path used by Managed Service for Prometheus to collect metric data from the exporter. Default value: /metrics.
Metrics scrape interval (seconds)
The interval at which Managed Service for Prometheus collects the monitoring data. Default value: 30.
NoteYou can view the monitoring metrics on the Metrics tab in the STEP2 section.
The installed components are displayed in the Installed section of the Integration Center page. Click the component. In the panel that appears, you can view information such as targets, metrics, dashboard, alerts, service discovery configurations, and exporters. For more information, see Integration center.
You can also view the status of the exporter on the Targets tab.
Step 3: View the dashboards of Cassandra
On the Dashboards tab, you can view monitoring data such as the availability, client read and write latency, and client throughput. You can also view the CPU utilization, memory usage, and disk usage of nodes.
On the Integration Center page, click the Cassandra component in the Installed section. In the panel that appears, click the Dashboards tab to view the thumbnails and hyperlinks of Cassandra dashboards. Click a hyperlink to go to the Grafana page and view the dashboard. This section describes the monitoring metrics of common dashboards.
Cluster/Node Information section
Client Read Latency, Write Delay, and Throughput section
Exceptions and Errors section
Caching and Bloom Filters section
Hardware resource usage section
Storage occupancy details section
Thread Pool Status section
JVM and Garbage Collection section
Step 4: Configure alerting
On the Integration Center page, click the Cassandra component in the Installed section. In the panel that appears, click the Alerts tab to view all Cassandra alert rules configured in Managed Service for Prometheus.
Managed Service for Prometheus allows you to enable Cassandra exporters with simple configurations. Out-of-the-box dedicated dashboards and alerting are provided. You can use the ARMS console to manage the exporters with reduced workload costs.
Managed Service for Prometheus provides multiple default alert rules for the key metrics of Cassandra. Common Cassandra alert rules are preset as templates to help the O&M personnel build dashboards and alert systems. The following table lists the default alert rules.
Category | Metric | Description |
Node status | Proportion of inactive nodes in the cluster | If the value is greater than 10, one or more nodes in the cluster are down. |
Resource usage | CPU utilization | If the CPU utilization of a node exceeds 85% in the last 5 minutes, the CPU utilization reaches the upper limit. |
Memory usage | If the memory usage of a node exceeds 85%, the memory usage reaches the upper limit. | |
Hard disk usage | If the hard disk usage of a node exceeds 85%, the hard disk reaches the upper limit. | |
Read and write latency and throughput | Read latency | If the read latency of a node exceeds 200 ms in the past 1 minute, the read latency is high. |
Write latency | If the write latency of a node exceeds 200 ms in the last 1 minute, the write latency is high. | |
Read throughput | If the number of read operations of a node exceeds 1,000 in the last 1 minute, the read throughput is high. | |
Write throughput | If the number of write operations of a node exceeds 1,000 in the last 1 minute, the write throughput is high. | |
Exceptions and errors | Timed out requests | If the number of timed out requests of a node exceeds 10 in the past 1 minute, the node is overloaded. |
Failed requests | If the number of failed requests of a node exceeds 10 in the past 1 minute, the node is overloaded. | |
Dropped messages | If the number of dropped messages of a node exceeds 10 in the last 1 minute, the node is overloaded. | |
JVM | GC time ratio | If the GC time of a node in the last 5 minutes accounts for more than 1%, the garbage collection is too frequent. |
You can also create alert rules based on your business requirements. For more information, see Create an alert rule for a Prometheus instance.
Key metrics
Cluster and node information
Metric | Level | Description | Remarks |
mcac_client_connected_native_clients | Major | Number of CQL connections | If the value is too large, lots of system resources are occupied, which causes prolonged client latency. |
mcac_table_live_disk_space_used_total | Major | Space occupied by Cassandra | If the value is too large, storage space may be insufficient, causing prolonged access latency. |
mcac_table_snapshots_size | Recommand | Cassandra snapshot file size | Snapshots are used to restore data. If the value is too large, storage space may be insufficient to store complete snapshots. |
collectd_uptime | Major | Node startup time | If the value is too large, the system has not been restarted for a long time, and may be vulnerable to security risks. |
Key performance metrics
Metric | Level | Description | Remarks |
mcac_table_read_latency | Critical | Client read latency | If the value is too large, the read speed of the application is slow, which affects user experience. |
mcac_table_write_latency | Critical | Client write latency | If the value is too large, the write speed of the application is slow, which affects user experience. |
Exceptions and errors
Metric | Level | Description | Remarks |
mcac_client_request_timeouts_total | Critical | Timed out client requests | If the value is too large, the system is overloaded, which severely affects user experience. |
mcac_client_request_failures_total | Critical | Abnormal client requests | If the value is too large, the system is overloaded, which severely affects user experience. |
mcac_dropped_message_dropped_total | Critical | Dropped messages | If the value is too large, the system is overloaded, which severely affects user experience. |
Caching and Bloom filters
Metric | Level | Description | Remarks |
mcac_table_key_cache_hit_rate | Major | Hit rate of key_cache | If the value is too small, the read speed of the application may be slow, which affects user experience. |
mcac_table_row_cache_hit_total | Major | Number of hits of row_cache | If the value is too small, the read speed of the application may be slow, which affects user experience. |
mcac_table_row_cache_miss_total | Recommand | Number of missed hits of row_cache | If the value is too large, the read speed of the application may be slow, affecting user experience. |
mcac_table_row_cache_hit_out_of_range_total | Recommand | Number of times that row_cache hits but still accesses the disk | If the value is too large, the read speed of the application may be slow, affecting user experience. |
mcac_table_bloom_filter_false_ratio | Major | False-positive rate of the Bloom filter | If the value is too large, non-existent elements in the query result are misjudged as existent, which wastes query time and resources. This degrades query performance and increases query costs. |
Usage trends in CPU, memory, and disks
Metric | Level | Description | Remarks |
collectd_cpu_total | Critical | CPU utilization | If the value is too large, the system is overloaded, which prolongs client request latency and severely affects user experience. |
collectd_memory | Critical | Memory usage | If the value is too large, the system is overloaded, which prolongs the client request latency and severely affects user experience. |
collectd_df_df_complex | Critical | Hard disk usage | If the value is too large, the hard disk space is insufficient. Data cannot be stored persistently, and the system may crash. |
SSTable compression
Metric | Level | Description | Remarks |
mcac_table_pending_compactions | Major | SSTable compression task in progress | If the value is too large, the system is overloaded, which prolongs the client request latency. We recommend that you configure the compression interval of SSTable. |
mcac_table_compaction_bytes_written_total | Major | SSTable compression speed | If the value is too small, the compression speed is slow, which causes task accumulation. We recommend that you increase the hardware configuration of the node. |
mcac_table_compression_ratio | Major | SSTable compression ratio | If the value is too large, the compressed files are still too large, and the compression tasks do not achieve the expected results. |
Disk file
Metric | Level | Description | Remarks |
mcac_table_live_ss_table_count | Major | Number of SSTables | If the value is too large, the hard disk usage is high, and the read/write latency is prolonged. We recommend that you configure the compression policy of SSTable. |
mcac_table_live_disk_space_used_total | Major | Hard disk space occupied by SSTable | If the value is too large, the hard disk usage is high, and the read/write latency is prolonged. We recommend that you configure the compression policy of SSTable. |
mcac_table_ss_tables_per_read_histogram | Major | Number of SSTables for each read operation | If the value is too large, the client read latency is high. |
mcac_commit_log_total_commit_log_size | Major | Hard disk space occupied by Commit Log | If the value is too large, the hard disk space is insufficient, the read/write performance is degraded, and the data recovery time is increased. |
mcac_table_memtable_live_data_size | Major | Space occupied by the MemTable | If the value is too large, the data write performance and node stability are degraded. |
mcac_table_waiting_on_free_memtable_space | Major | Time spent waiting for the MemTable to be released | If the value is too large, the data write performance and node stability are degraded. |
Thread pool status
Metric | Level | Description | Remarks |
mcac_thread_pools_active_tasks | Critical | Number of active tasks in the thread pool | If the value is too large, system resources are occupied, which may cause a reduced response speed and even a system crash. |
mcac_thread_pools_total_blocked_tasks_total | Critical | Number of blocked tasks in the thread pool | If the value is too large, system resources are occupied, which may cause a reduced response speed and even a system crash. |
mcac_thread_pools_pending_tasks | Critical | Number of pending tasks in the thread pool | If the value is too large, lots of system resources are occupied. If the requests that correspond to pending tasks time out, the system may crash. |
mcac_thread_pools_completed_tasks | Major | Number of completed tasks in the thread pool | This metric indicates the throughput of the system. The higher the value, the better the system performs. |
JVM
Metric | Level | Description | Remarks |
mcac_jvm_memory_used | Critical | Size of the used JVM heap memory | If the value is too large, the memory may be insufficient, which triggers frequent garbage collection, and reduces the throughput of the application. |
mcac_jvm_gc_time | Critical | Time spent by the application in GC | If the value is too large, GC is too frequent, and the system has less time to execute user tasks, which may lead to client request timeout or even a system crash. |