Alibaba Cloud Elasticsearch provides monitoring metrics to help you understand cluster health, identify performance issues, and take corrective action. This topic describes available metrics, their meanings, common exception causes, and recommended handling procedures.
Metric values in this document are for reference and may differ from those displayed in the console. Always refer to the console for the most current and authoritative information.
Quick reference: Recommended thresholds
|
Metric |
Warning |
Critical |
|
Disk usage |
> 75% |
> 85% |
|
CPU utilization |
> 80% |
> 95% |
|
Heap memory |
> 75% |
> 85% |
|
Node Load_1m |
> vCPU count |
> 2× vCPU count |
View metrics
Log on to the Alibaba Cloud Elasticsearch console.
In the left navigation menu, choose Elasticsearch Clusters.
Navigate to the target cluster.
In the top navigation bar, select the resource group to which the cluster belongs and the region where the cluster resides.
On the Elasticsearch Clusters page, find the cluster and click its ID.
-
In the left-side navigation pane, choose .
-
View monitoring details.
-
View Basic Monitoring details
On the Basic Monitoring tab, select a Group Name category and a monitoring period as needed to view the monitoring details of resources in the corresponding category during the specified period.
Note-
Click Custom to view monitoring details within a custom time period as needed.
-
The monitoring and alerting feature for Elasticsearch clusters is enabled by default. Therefore, you can view historical monitoring data on the Cluster Monitoring page. You can view monitoring data by minute, and monitoring data is retained only for 30 days.
-
-
Understand monitoring differences
The cluster monitoring feature provided by Alibaba Cloud Elasticsearch may differ from the monitoring feature provided by Kibana or third-party services in the following aspects:
-
Sampling period differences: The sampling period differs from that of Kibana or third-party monitoring, resulting in different collected data and therefore differences.
-
Query algorithm differences: Both Alibaba Cloud Elasticsearch cluster monitoring and Kibana monitoring are affected by cluster stability when collecting data. The QPS metric in cluster monitoring may show sudden increases, negative values, or no monitoring data due to cluster jitter, while Kibana monitoring may show empty values.
NoteIf cluster monitoring provides more metrics than Kibana, use both features for comprehensive monitoring.
-
Collection interface differences: Kibana monitoring metrics depend on the Elasticsearch API, while some node-level metrics in cluster monitoring (such as CPU utilization, load_1m, disk usage) call the underlying system interfaces of Alibaba Cloud Elasticsearch. Therefore, the monitoring includes not only the Elasticsearch process but also the usage of system-level resources.
Comprehensive metric reference
This section provides detailed information for each available monitoring metric, organized by category.
Cluster metrics
ClusterStatus
Description
This metric displays the health status of the cluster. A value of 0.00 indicates that the cluster is normal. This metric is essential to cluster monitoring. For detailed instructions, see Configure cluster alerts. The following table describes the values of the metric.
|
Value |
Color |
Status |
Description |
|
0.00 |
Green |
All primary and replica shards are available. |
All the indexes stored on the cluster are healthy and do not have unassigned shards. |
|
1.00 |
Yellow |
All primary shards are available, but not all of the replica shards are available. |
One or more indexes have unassigned replica shards. |
|
2.00 |
Red |
Not all of the primary shards are available. |
One or more indexes have unassigned primary shards. |
The colors listed above correspond to the cluster status displayed on your instance's Basic Information page.
Common causes for exceptions
During monitoring, when the metric value is not 0.00, it indicates an abnormal state. Common reasons for such exceptions include:
-
The CPU utilization or heap memory usage of the nodes in the cluster is excessively high or reaches 100%.
-
The disk usage of the nodes in the cluster is excessively high. For example, the disk usage is higher than 85% or reaches 100%.
-
The Load_1m of the nodes is excessively high.
-
The statuses of the indexes stored on the cluster are abnormal (not green).
Troubleshooting
-
View the monitoring information on the Monitoring page of the Kibana console, or view the logs of the instance to obtain specific information about the issue and troubleshoot it (for example, if an index uses too much memory, you can delete some indexes).
-
For cluster exceptions caused by high disk usage, troubleshoot based on Methods to troubleshoot and handle high disk usage and read_only issues.
-
For instances with 1 core and 2 GB of memory, if the instance status is abnormal, first upgrade the cluster to a specification with a CPU-to-memory ratio of 1:4. If the cluster is still abnormal after upgrading, troubleshoot based on the preceding two solutions.
ClusterAutoSnapshotLatestStatus
Description
This metric displays the snapshot status of the automatic backup feature in the Elasticsearch console. When the metric value is 0, it indicates that snapshots are created. The following table describes the values of the metric.
|
Snapshot status |
Description |
|
0 |
Snapshots are created. |
|
-1 |
No snapshots are created. |
|
1 |
The system is creating a snapshot. |
|
2 |
The system failed to create a snapshot. |
Common causes for exceptions
When the metric value is 2, the service is abnormal. The common causes are as follows:
-
The disk usage of the nodes in the cluster is excessively high or close to 100%.
-
The cluster is abnormal.
ClusterNodeCount
This metric indicates the total number of nodes in the cluster, which is used to monitor whether the node scale meets expectations.
ClusterDisconnectedNodeCount
This metric indicates the total number of disconnected nodes in the cluster. Disconnected nodes may cause shards to be reassigned or increase query latency.
Cluster Index Count
This metric indicates the number of indexes in the cluster. Too many indexes may lead to resource contention (for example, memory, CPU).
Cluster Shard Count
This metric indicates the number of shards in a cluster. Too many shards increase management costs (for example, metadata operations). Too few shards may affect query performance (for example, uneven payload).
Cluster Primary Shard Count
This metric indicates the number of primary shards in the cluster. Insufficient primary shards may cause write bottlenecks.
Cluster Slow Searching Count
This metric indicates the number of slow queries in the cluster. You can use this metric to identify performance bottlenecks (such as complex queries or index design issues).
ClusterIndexQPS
If the write QPS of a cluster spikes, the CPU utilization, heap memory usage, or minute-average node load of the cluster may reach a high level. This may affect your services that run on the cluster. You should avoid this situation.
This metric shows the number of documents written to a cluster per second. The details are as follows:
-
If the cluster receives a write request that contains only one document within 1 second, the value of this metric is 1. The value increases with the number of write requests received per second.
-
If multiple documents are written to the cluster in batch by using the _bulk API within 1 second, the write QPS is calculated based on the total number of documents in the request. If multiple batch write requests are sent by using the _bulk API within 1 second, the values are accumulated.
ClusterQueryQPS
If the query QPS of a cluster spikes, the CPU utilization, heap memory usage, or minute-average node load of the cluster may reach a high level. This may affect your services that run on the cluster. Try to avoid this situation.
This metric shows the number of queries per second (QPS) that are executed on the cluster. The number of queries per second is related to the number of primary shards in the index that you want to query.
For example, if the index from which you want to query data has five primary shards, your cluster can process five queries per second.
Cluster Slow Searching Distribution
Description
This metric is based on the logs of index.search.slowlog.query and index.search.slowlog.fetch in the slow query log. It aggregates the time taken (took_millis) and displays the distribution in intervals of 1 second (0-1s, 1-2s, up to 10s). You can configure the threshold for slow logs. For related parameters, see the index.search.slowlog.threshold.xxx parameter in Index template configuration.
Common causes for exceptions
During the monitoring period, when the slow query time interval increases and the number of queries increases, service exceptions may occur. Common causes are as follows:
|
Exception cause |
Description |
|
QPS |
Query or write QPS surges or fluctuates significantly, causing high cluster pressure and longer query response times. |
|
Aggregate queries or script queries |
Aggregate query scenarios require a large amount of computing resources for data aggregation, especially memory. |
|
Term queries on numeric fields |
During term queries on many numeric fields (byte, short, integer, long), constructing the bitset for document ID collections is time-consuming and affects query speed. If the numeric field does not require range or aggregate queries, it is recommended to change it to a keyword type field. |
|
Fuzzy matching |
Wildcard characters, regular expressions, and fuzzy queries need to traverse the term list in the inverted index to find all matching terms, and then collect the corresponding document IDs for each term. Especially without prior stress testing, large-scale queries will consume a lot of computing resources. It is recommended to conduct stress tests based on your scenario before using these features and select an appropriate scale. |
|
The cluster receives a few slow query or write requests |
In this case, the query and write QPS traffic fluctuations are small or not obvious. View and analyze it by clicking Search Slow Log on the Query logs page in the Alibaba Cloud Elasticsearch console. |
|
The cluster stores many indexes or shards |
Because Elasticsearch monitors the indexes in the cluster and writes logs, when the total number of indexes or shards is excessive, it can easily cause high CPU or HeapMemory utilization, or high load_1m, affecting the query speed of the entire cluster. |
|
Merge operations are performed on the cluster |
Merge operations consume CPU resources, and the Segment Count of the corresponding node will drop sharply. Check this on the Overview page of the node in the Kibana console. |
|
Garbage collection (GC) operations are performed on the cluster |
GC operations attempt to release memory (such as full GC), consume CPU resources, and may cause CPU utilization to surge, affecting query speed. |
|
Scheduled tasks are performed on the cluster |
Data backup or other custom tasks require a large amount of IO resources, affecting query speed. |
FielddataMemoryUsedBytes
Description
This metric shows the Fielddata memory usage in the cluster. The higher the monitoring curve, the more Fielddata data is cached in heap memory. Excessive Fielddata memory usage can trigger Fielddata memory circuit breaking, affecting cluster stability.
Common causes for exceptions
During the monitoring period, when the metric occupies a large amount of heap memory, service abnormalities may occur. Common causes include the following:
-
Queries contain many sort or aggregation operations on string (Text) fields. Fielddata for such queries is not revoked by default. It is recommended to use numeric field types.
-
Query or write QPS traffic surges or fluctuates significantly, causing Fielddata to be cached frequently.
-
The cluster stores many indexes or shards. Because Elasticsearch monitors the indexes in the cluster and writes logs, when there are too many total indexes or shards, this can easily cause high CPU or HeapMemory usage, or high Load_1m payload.
Index metrics
BulkTotalOperation
Description
This metric displays the number of bulk requests per second for the index.
Common causes for exceptions
During the monitoring period, this metric may have no data. Common causes include the following:
-
High cluster pressure affects the normal collection of cluster monitoring data.
-
Monitoring data failed to be pushed.
IndexSearchQPS
Description
This metric shows the number of QPS on an index. The QPS value is related to the number of primary shards in the index being queried.
For example, if the index to query data has five primary shards, your cluster can process five queries per second.
Common causes for exceptions
During the monitoring period, this metric may have no data. Common causes include the following:
-
High cluster pressure affects the normal collection of cluster monitoring data.
-
Monitoring data failed to be pushed.
A sudden increase in index query QPS may cause high CPU utilization, HeapMemory usage, or Load_1m in the cluster, affecting the entire cluster service. You can optimize the index to address these issues.
IndexSearchDelayMax
Indicates the maximum time consumed by query requests to the index, measured in milliseconds.
Node resource metrics
Node CPU Utilization_ES Business
Description
This metric displays the CPU utilization percentage of each node in the cluster. If the CPU utilization is high or close to 100%, the services that run on the cluster are affected.
Common causes for exceptions
If the value of the metric spikes or significantly fluctuates, an error occurs. This issue may be caused by one or more reasons described in the following table.
|
Exception cause |
Description |
|
QPS |
Query or write QPS spikes or significantly fluctuates. |
|
The cluster receives a few slow query or write requests |
In this case, the query and write QPS traffic fluctuates slightly or not noticeably. You can click Search Slow Logs on the LogSearch page in the Alibaba Cloud Elasticsearch console to view and analyze. |
|
The cluster stores many indexes or shards |
Because Elasticsearch monitors the indexes in the cluster and writes logs, when the total number of indexes or shards is excessive, it can easily cause high CPU or HeapMemory utilization, or high Load_1m. |
|
Merge operations are performed on the cluster |
Merge operations consume CPU resources. The Segment Count of the corresponding node drops sharply. You can view this on the Overview page of the node in the Kibana console. |
|
GC operations are performed |
GC operations attempt to free up memory (such as full gc) and consume CPU resources. As a result, the CPU utilization may spike. |
|
Scheduled tasks are performed on the cluster |
Scheduled tasks, such as data backup or custom tasks, are performed on the cluster. |
The NodeCPUUtilization(%) metric monitors the usage of system resources of Alibaba Cloud Elasticsearch and the resource usage of tasks that run on Elasticsearch clusters.
Node Disk Usage
This metric displays the disk usage percentage of each node. Keep disk usage below 75%. Do not exceed 85%. Otherwise, the following situations may occur, which can affect your services that run on the cluster.
|
Node disk usage |
Description |
|
>85% |
New shards cannot be assigned. |
|
>90% |
The cluster attempts to migrate shards from the node to other data nodes with lower disk usage. |
|
>95% |
Elasticsearch forcibly sets the |
-
It's highly recommended to configure this metric. When alerts are triggered, resize disks, add nodes, or delete index data promptly to avoid service impact.
-
The NodeDiskUtilization(%) metric monitors the usage of system resources of Alibaba Cloud Elasticsearch and the resource usage of tasks that run on Elasticsearch clusters.
Node Heap Memory Usage_ES Business
Description
This metric displays the heap memory usage percentage of each node in the cluster. When the heap memory usage is high or large memory objects exist, cluster services are affected, and GC operations are automatically triggered.
Common causes for exceptions
If the value of the metric spikes or significantly fluctuates, an error occurs. This issue may be caused by one or more reasons described in the following table.
|
Exception cause |
Description |
|
QPS |
Query or write QPS traffic spikes or significantly fluctuates. |
|
The cluster receives a few slow query or write requests |
In this case, the query and write QPS traffic fluctuates slightly or not noticeably. You can click Search Slow Logs on the LogSearch page in the Alibaba Cloud Elasticsearch console to view and analyze the logs. |
|
The cluster receives many slow query or write requests |
In this case, the query and write QPS traffic fluctuates significantly or noticeably. You can click Indexing Slow Logs on the LogSearch page in the Alibaba Cloud Elasticsearch console to view and analyze the logs. |
|
The cluster stores many indexes or shards |
Because Elasticsearch monitors the indexes in the cluster and writes logs, when the total number of indexes or shards is excessive, the CPU or heap memory usage, or Load_1m can become too high. |
|
Merge operations are performed on the cluster |
Merge operations consume CPU resources, and the Segment Count of the corresponding node drops sharply. You can view this on the Overview page of the node in the Kibana console. |
|
GC operations are performed |
GC operations attempt to free up memory (for example, Full GC) and consume CPU resources. This may cause the heap memory usage to drop sharply. |
|
Scheduled tasks are performed on the cluster |
Data backup or other custom tasks. |
Node Memory Usage_Total
This metric displays the system memory usage of the node.
NodeStatsCpuIOWaitPercentage
This metric displays the CPU IO wait percentage of the node.
NodeLoad_1m
Description
This metric shows the 1-minute load of each node, indicating system busyness. Normally, this value is less than the number of vCPUs on the node. The following table describes the values of the metric for a node that has only one vCPU.
|
Node Load_1m |
Description |
|
< 1 |
No pending processes exist. |
|
= 1 |
The system does not have idle resources to run more processes. |
|
> 1 |
Processes are queuing for resources. |
-
The metric includes not only the resource usage at the system level of Alibaba Cloud Elasticsearch but also the resource usage of Elasticsearch tasks.
-
Fluctuations in the NodeLoad_1m metric is typically normal. The Node CPU utilization provides more information about fluactuations.
Common causes for exceptions
If the value of the metric exceeds the number of vCPUs on a node, an error occurs. This issue may be caused by one or more of the following reasons:
-
The CPU utilization or heap memory usage of the nodes in the cluster is excessively high or reaches 100%.
-
The query or write QPS traffic surges or increases significantly.
-
The cluster receives slow query requests.
You can go to the LogSearch page in the Alibaba Cloud Elasticsearch console to view and analyze the corresponding logs.
NodeLoad_1m includes not only the resource usage at the system level of Alibaba Cloud Elasticsearch but also the resource usage of Elasticsearch tasks.
Node network metrics
Node Network Plan_Input
This metric displays the number of inbound traffic packets for each node in the cluster. The monitoring cycle of the metric is 1 minute.
Node Network Plan_Output
This metric displays the number of outbound traffic from the data transfer plan for each node in the cluster. The monitoring cycle of the metric is 1 minute.
Node Network Bandwidth_Input
This metric displays the inbound rate of data packets per second for each node in the cluster. The monitoring cycle of the metric is 1 minute. Unit: KB/s.
Node Network Bandwidth_Output
This metric displays the outbound data packet rate per second for each node in the cluster. The monitoring cycle of the metric is 1 minute. Unit: KB/s.
NodeStatsTcpEstablished
Description
This metric displays the number of TCP connection requests initiated by clients to each node in the cluster.
Common causes for exceptions
During monitoring, when the metric value spikes or significantly fluctuates, a service error occurs. A common cause is that TCP connections initiated by clients are not released for an extended period, causing a sudden increase in TCP connections on nodes. Configure client policies to release connections.
NodeStatsIOUtil
Description
This metric displays the IO usage percentage of each node in the cluster.
Common causes for exceptions
If the metric value spikes or significantly fluctuates during monitoring, a service error occurs. This may be caused by high disk usage, which increases the average wait time for data read and write operations, causing IO usage to spike, potentially reaching 100%. Troubleshoot based on your cluster configuration and other metrics. For example, upgrade the cluster configuration.
NodeStatsNetworkRetransRate
This metric displays the network retransmission rate of the node.
Node Network Bandwidth
Node network bandwidth (KiB/s) = Node network bandwidth_Input (KiB/s) + Node Network Bandwidth_Output.
Node Network Bandwidth Usage
Node network bandwidth usage (%) = (Node Network Bandwidth_Input + Node Network Bandwidth_Output / Node network base bandwidth (Gbit/s).
Node Network Plan
Node network plan (count) = Node Network Plan_Output + Node network plan_Inputs (count).
Node Network Plan Usage
Node network packet usage (%) = (Node network packet_Outputs (count) + Node network packet_Inputs (count)) / packet forwarding PPS.
Node disk metrics
Disk Bandwidth_Read
This metric displays the amount of data read from nodes in the secondary cluster per second.
Disk Bandwidth_Write
This metric displays the amount of data written to each node in the cluster per second.
Disk IOPS_Read
This metric displays the number of read requests completed per second on each node in the cluster.
Disk IOPS_Write
This metric displays the number of write requests completed per second by each node in the cluster.
DiskAverageQueueSize
This metric displays the average length of the request queue.
Disk Bandwidth
Disk bandwidth (MiB/s) = Disk Bandwidth_Read + Disk Bandwidth_Write.
Disk Bandwidth Usage_Disk
Disk bandwidth usage_disk (%) = (Disk Bandwidth_Read + Disk Bandwidth_Write) / Formula for calculating single disk throughput performance (MB/s).
Disk Bandwidth Usage_Node
Disk bandwidth usage_node (%) = (Disk Bandwidth_Read + Disk Bandwidth_Write) / Disk basic bandwidth (Gbit/s).
NodeStatsDiskIops
Disk IOPS (count) = Disk IOPS_Read + Disk IOPS_Write.
Disk IOPS Usage_Disk
Disk IOPS usage_disk (%) = (Disk IOPS_Read + Disk IOPS_Write) / Single disk IOPS performance calculation formula.
Disk IOPS Usage_Node
Disk IOPS usage_node (%) = (Disk IOPS_Read + Disk IOPS_Write) / Disk basic IOPS.
Node JVM metrics
JVMMemoryOldUsedBytes
Description
This metric shows the size of the old generation heap memory usage for each node in the cluster. When the old generation occupies a high percentage or contains large memory objects, it affects cluster services and automatically triggers GC. The collection of large objects may result in long GC durations or full GC.
Common causes for exceptions
If the value of the metric spikes or significantly fluctuates, an error occurs. This issue may be caused by one or more reasons described in the following table.
|
Cause |
Description |
|
QPS |
Query or write QPS traffic spikes or significantly fluctuates. |
|
Aggregation queries or script queries |
Aggregation query scenarios require a large amount of computing resources for data aggregation, especially memory. Please be cautious when using them. |
|
Term queries on numeric fields |
When performing term queries on many numeric fields (byte, short, integer, long), constructing the bitset for document ID collections is time-consuming and affects query speed. If the numeric field does not require range or aggregation operations, change it to a keyword type field. |
|
Fuzzy matching |
Wildcard characters, regular expressions, and fuzzy queries need to traverse the term list in the inverted index to find all matching terms, and then collect the corresponding document IDs for each term. Especially without prior stress testing, large-scale queries will consume a lot of computing resources. It is recommended to conduct stress tests based on your scenario before using these features and select an appropriate scale. |
|
The cluster receives a few slow query or write requests |
In this case, the query and write QPS traffic fluctuations are small or not obvious. You can go to the Query logs page in the Alibaba Cloud Elasticsearch console and click Search Slow Logs to view and analyze the logs. |
|
The cluster receives many slow query or write requests |
In this case, the query and write QPS traffic fluctuations are small or not obvious. You can go to the Query logs page in the Alibaba Cloud Elasticsearch console and click Indexing Slow Logs to view and analyze the logs. |
|
The cluster stores many indexes or shards |
The system monitors indexes stored on the cluster and logs index changes. If the cluster stores excessive indexes or shards, the CPU utilization, heap memory usage, or minute-average node load may reach a high level. |
|
Merge operations are performed on the cluster |
Merge operations consume CPU resources, and the Segment Count of the corresponding node will drop sharply. You can check this on the Overview page of the node in the Kibana console. |
|
GC operations are performed |
GC operations attempt to free up memory (for example, Full GC), consume CPU resources, and may cause a significant decrease in heap memory usage. |
|
Scheduled tasks are performed on the cluster |
Scheduled tasks, such as data backup or custom tasks, are performed on the cluster. |
NodeStatsFullGcCollectionCount
Frequent Full GC occurrences in the system affect cluster service performance.
Description
This metric displays the total number of GC operations in the cluster within 1 minute.
Common causes for exceptions
If the value of this metric is not 0, an error has occurred. This issue may be caused by one or more of the following reasons:
-
High heap memory usage in the cluster.
-
Large objects stored in the cluster memory.
JVMGCOldCollectionCount
Description
This metric indicates the number of Old Generation garbage collections on each node in the cluster. When the Old Generation occupies a high percentage or contains large memory objects, it affects cluster services and automatically triggers garbage collection operations. The collection of large objects may result in long GC durations or Full GC.
The Full GC basic monitoring metric is obtained through logs, while memory metrics in advanced monitoring depend on ES engine collection. These two methods have differences in data acquisition and application. Evaluate cluster performance by combining all metrics.
Common causes for exceptions
For more information, see Node Old Area Usage (B).
JVMGCOldCollectionDuration
Description
This metric indicates the average time spent on Old generation garbage collection for each node in the cluster. When the Old generation area usage is high or large memory objects exist, GC operations are automatically triggered. The collection of large objects may result in longer GC durations or Full GC.
Common causes for exceptions
For more information, see JVMMemoryOldUsedBytes.
Thread pool metrics
SearchThreadpoolActiveThreads
Indicates the number of threads in the query thread pool that are currently executing tasks in the cluster.
SearchThreadpoolRejectedV2
Indicates the number of rejected requests in the query thread pool within the cluster. When all threads in the thread pool are processing tasks and the task queue is full, new query requests are rejected and exceptions are thrown.
Other metrics
NodeStatsExceptionLogCount
Description
This metric shows the time consumed by GC operations on each node in the cluster. The higher the value, the longer the GC operations take. Long GC operations may affect cluster services.
Common causes for exceptions
During monitoring, when the metric value is not 0, the service is abnormal. Common causes include the following:
-
The cluster receives abnormal query requests.
-
The cluster receives abnormal write requests.
-
Errors occur when the cluster runs tasks.
-
Garbage collection operations have been executed.
Troubleshooting
You can go to the Query logs page in the Alibaba Cloud Elasticsearch console, and click Cluster Logs. On the Cluster Logs page, you can view detailed exception information based on the time point and analyze the cause of the exception.
If there are GC records in the Cluster Logs, they will also be counted and displayed in the NodeStatsExceptionLogCount monitoring metric.
Deprecated metric
SearchThreadpoolRejected
Indicates the number of rejected requests in the query thread pool within the cluster. This metric is deprecated. Use SearchThreadpoolRejected instead.