×
Community Blog About abnormal cluster loads or status

About abnormal cluster loads or status

Dive into expert troubleshooting tips for managing your Alibaba Cloud Elasticsearch cluster, addressing common issues like unbalanced loads, cluster s...

The CPU utilization and loads of some nodes in an Elasticsearch cluster are normal, whereas other nodes are in the idle state. What do I do?
This issue is caused by unbalanced loads on the cluster. Unbalanced loads may be caused by several reasons, which include inappropriate shard settings, uneven segment sizes, unseparated hot and cold data, and persistent connections that are used for Service Load Balancer (SLB) instances and multi-zone architecture. Resolve the issue based on the actual scenario. For more information, see Unbalanced loads on a cluster.
What do I do if an Elasticsearch cluster is in a state indicated by the color yellow?
● Cause
If the number of replica shards that you specify for an index is greater than the number of nodes minus 1, the cluster enters a state indicated by the color yellow.
Solutions
Run the GET _cat/indices?v command to query the distribution of shards for indexes and identify the index that is in a state indicated by the color yellow. Then, change the number of replica shards for the index to 0. After the cluster recovers to a normal state, change the number of replica shards for the index from 0 to the original setting.
PUT test/_settings
{
"index" : {

"number_of_replicas":"0"

}
}
What do I do if an Elasticsearch cluster is in a state indicated by the color red due to heavy loads?
If an error occurs on the node on which primary shards are distributed, the cluster enters a state indicated by the color red. You can run the GET /_cat/indices?v command to query the distribution of shards for indexes and identify the index that is in a state indicated by the color red. Then, troubleshoot the issue based on the causes and solutions described in the following table.
Cause
Solution
The resources of the cluster are insufficient due to unbalanced loads on nodes.
Change the total number of primary and replica shards to an integral multiple of the number of data nodes in the cluster to balance loads on nodes. For more information, see What do I do if shards are not evenly distributed on nodes in an Elasticsearch cluster?
The cluster stores invalid indexes.
Clear invalid indexes on a regular basis, such as monitoring indexes whose names start with .monitoring. For more information about how to configure monitoring indexes, see Configure monitoring indexes.
Shards are not allocated to nodes.
Run the GET /_cluster/allocation/explain?pretty command to query the reason why shards are not allocated to nodes and resolve the issue based on the actual situation. After the issue is resolved, run the POST /_cluster/reroute?retry_failed=true command to reallocate shards to nodes.
The cache occupies a large amount of resources.
Run the POST //_cache/clear?fielddata=true command to clear the cache.
A cluster update operation such as configuration upgrade is being performed on the cluster.
Pause the update operation and select Forced Update on the Upgrade/Downgrade page to forcefully update the cluster. For more information, see Upgrade the configuration of a cluster.
The resources of the cluster are insufficient because the cluster uses low specifications such as 1 vCPU and 2 GiB of memory or 2 vCPUs and 4 GiB of memory.
Upgrade the configuration of the cluster. For more information, see Upgrade the configuration of a cluster.
The disk usage exceeds 85%.
We recommend that you delete the historical data you no longer require or that you expand the capacity of disks. For more information, see High disk usage and read-only indexes.
Monitoring data or an alert shows that the CPU utilization of my Elasticsearch cluster is excessively high. What do I do?
Troubleshoot the issue based on the causes and solutions described in the following table.
Cause
Solution
The number of queries or write requests per second spikes.
Reduce the number of queries or write requests per second for the cluster, reduce the amount of data to write to the cluster in parallel, or scale out or up the cluster. We recommend that you perform stress testing in the production environment and select appropriate specifications.
The cache for indexes occupies a large amount of resources.
Run the POST /Index name/_cache/clear?fielddata=true command to clear the cache.
The cluster uses low specifications.
Upgrade the configuration of the cluster. For more information, see Upgrade the configuration of a cluster.
Loads on nodes in the cluster are unbalanced.
Change the total number of primary and replica shards to an integral multiple of the number of data nodes in the cluster to balance loads on nodes. For more information, see What do I do if shards are not evenly distributed on nodes in an Elasticsearch cluster?
What do I do if the disk usage of my Elasticsearch cluster is excessively high?
Run the DELETE /Index name command to delete invalid indexes. After the disk usage is lower than 75%, forcefully upgrade the configuration of disks in the Elasticsearch console. For more information, see Upgrade the configuration of a cluster. If the disk usage of a node is excessively high, you must optimize the configuration of shards. For more information, see What do I do if shards are not evenly distributed on nodes in an Elasticsearch cluster?
Monitoring data or an alert shows that the memory usage of my Elasticsearch cluster is excessively high. What do I do?
Troubleshoot the issue based on the causes and solutions described in the following table.
Cause
Solution
The cache for the cluster occupies a large amount of memory.
If the cache for the cluster occupies a large amount of memory for a short period of time, run the POST /Index name/_cache/clear?fielddata=true command to clear the cache. If the cache for the cluster occupies a large amount of memory for a long period of time, upgrade the configuration of the cluster. For more information, see Upgrade the configuration of a cluster. The memory usage of the cluster may periodically increase but no alert is generated, which may be caused by business fluctuations or memory reclaim of the cluster. This is a normal phenomenon.
The read or write throughput of the cluster is high.
Stop the read or write operation, install a throttling plug-in, and then enable the throttling feature of the plug-in. For more information, see Use the aliyun-qos plug-in.
Invalid indexes occupy a large amount of memory.
Delete invalid indexes such as monitoring indexes whose names start with .monitoring to release resources. You can specify a retention duration for such indexes. For more information, seeConfigure monitoring indexes.
Shards are not evenly distributed on nodes, and loads on nodes are unbalanced.
Change the total number of primary and replica shards to an integral multiple of the number of data nodes in the cluster and make sure that shards are evenly distributed on nodes to balance loads on the nodes. For more information, see What do I do if shards are not evenly distributed on nodes in an Elasticsearch cluster?
Abnormal queries exist. For example, a user on the business side sends a query request that contains a string with numerous special characters.
Run the GET _cat/tasks?v command to obtain the ID of the time-consuming query task and run the GET _tasks?detailed=true&actions=read/search command to obtain the detailed query statement and save and analyze the statement. You can also call the task cancel API, restart the cluster, or restart only heavily loaded nodes in the cluster to quickly cancel the query.
What do I do if shards are not evenly distributed on nodes in an Elasticsearch cluster?
Appropriately plan shards and reallocate shards for nodes. Make sure that the total number of primary and replica shards is an integral multiple of the number of data nodes in the cluster. This ensures that data is evenly distributed on each data node and prevents heavy loads on a node due to uneven shard distribution. The following descriptions provide examples on how to allocate primary and replica shards for nodes:
● If the cluster has three data nodes, you can configure three primary shards and one replica shard for each primary shard. The total number of primary and replica shards that you can configure is six.
● If the cluster has eight data nodes, you can configure four primary shards and one replica shard for each primary shard. The total number of primary and replica shards that you can configure is eight. Alternatively, you can configure eight primary shards and one replica shard for each primary shard. In this case, the total number of primary and replica shards that you can configure is 16.
My Elasticsearch cluster is heavily loaded, and the cluster logs contain the following error message: java.lang.StackOverflowError for the entire cluster. What do I do?
The error message indicates that a stack overflow error occurs because the amount of data written to the stack by using Lucene exceeds the upper limit. This issue is related to regular expression-based queries and fuzzy match. This issue is fixed in Elasticsearch V6.0 and later. We recommend that you upgrade the configuration of the cluster at the earliest opportunity or optimize the query statement that you use. For more information, see java.lang.StackOverflowError for the entire cluster.
How do I query the size of the JVM heap memory that is allocated to an Elasticsearch cluster?
Run the GET _nodes/stats/jvm?pretty command. By default, the Java Virtual Machine (JVM) heap memory of an Elasticsearch cluster is half of the memory of the cluster. You cannot change the size of the JVM heap memory of an Elasticsearch cluster.
Alibaba Cloud Elasticsearch not only provides robust features for managing complex data workloads but also offers a user-friendly interface and seamless scalability. With our 30 Day Free Trial, you can explore these capabilities firsthand:
Embark on Your 30-Day Free Trial
Experience Alibaba Cloud Elasticsearch today and transform your data management journey with precision, efficiency, and peace of mind.

0 1 0
Share on

Data Geek

96 posts | 4 followers

You may also like

Comments

Data Geek

96 posts | 4 followers

Related Products

  • Resource Management

    Organize and manage your resources in a hierarchical manner by using resource directories, folders, accounts, and resource groups.

    Learn More
  • Alibaba Cloud Elasticsearch

    Alibaba Cloud Elasticsearch helps users easy to build AI-powered search applications seamlessly integrated with large language models, and featuring for the enterprise: robust access control, security monitoring, and automatic updates.

    Learn More
  • Cloud Shell

    A Web browser-based admin tool that allows you to use command line tools to manage Alibaba Cloud resources.

    Learn More
  • RAM(Resource Access Management)

    Secure your cloud resources with Resource Access Management to define fine-grained access permissions for users and groups

    Learn More