Apply GC in Elasticsearch to Process GC Logs

By Wei Bin, CTO of Puxiang Communication Technology

If you have seen an Elasticsearch log, it was probably similar to the example below:

[2018-06-30T17:57:23,848][WARN ][o.e.m.j.JvmGcMonitorService] [qoo--eS] [gc][228384] overhead, spent [2.2s] collecting in the last [2.3s]
[2018-06-30T17:57:29,020][INFO ][o.e.m.j.JvmGcMonitorService] [qoo--eS] [gc][old][228385][160772] duration [5s], collections [1]/[5.1s], total [5s]/[4.4d], memory [945.4mb]->[958.5mb]/[1007.3mb], all_pools {[young] [87.8mb]->[100.9mb]/[133.1mb]}{[survivor] [0b]->[0b]/[16.6mb]}{[old] [857.6mb]->[857.6mb]/[857.6mb]}

Based on the keyword [gc], you may have guessed that the logs are related to garbage collection (GC). However, do you understand the meaning of each part? This article will help you better understand the specifics.

First, let's take a quick look at each part:

The first part indicates the time the log was generated.
The second part indicates the level of the log. In this example, the log levels are WARN and INFO.
The third part indicates the class of the output log. We will talk about the class later.
The fourth part indicates the name of the current Elasticsearch node.
The fifth part indicates the GC keyword. Let's begin with this keyword.

Note: If you are already familiar with GC, you can skip the following section.

What Is GC?

GC stands for garbage collection or garbage collector.

If you write a program in C language, you have to manually run the malloc and free commands to request and release the memory required to store the data. If you forget to release the memory, memory leakage may occur. In other words, useless data may occupy precious memory resources. However, if you write a program in Java, you do not have to explicitly request or release memory resources, because a Java virtual machine (JVM) can automatically manage memory resources. Most importantly, the JVM provides the GC feature to independently release the memory occupied by useless data, which is also called garbage.

The main reason we need to study GC is that stop-the-world (STW) occurs during GC. When STW occurs, all user threads stop working. If the STW problem lasts for a long time, the availability and real-time performance of the application will suffer greatly.

GC mainly resolves the following problems:

How do I find garbage?
How do I recycle garbage?
When do I recycle garbage?

Let's answer each question one by one.

How Do I Find Garbage?

Garbage refers to objects that are no longer used or referenced. In Java, objects are all created on the heap. In this article, we assume that objects are only created on the heap by default. To find garbage, you need to determine whether an object is referenced. You can use one of the following methods:

Reference counting method. Specifically, the system increases the reference value by 1 when the object is referenced and decreases the reference value by 1 when a reference is removed. If the reference value is 0, the object can be collected as garbage.
Reachability analysis. Specifically, the system marks all live objects by traversing the reference chains of the known live objects, which are defined as GC roots.

Method 1 is simple, direct, and efficient, but inaccurate. In particular, method 1 is useless for garbage objects that are mutually referenced.

Method 2 is widely used at present. As the starting points of the GC process, GC roots are critical in this method. However, we do not have the space to provide a detailed description.

How Do I Recycle Garbage?

After I find garbage, how can I recycle it? This seems to be a silly question. Can't I just collect the garbage and throw it away? The corresponding operation on the program directly marks the space occupied by these objects as idle, right? Now, let's look at the basic recycling algorithm, the mark-sweep algorithm.

Mark-Sweep Algorithm

This algorithm is very simple. The system marks the garbage by using the reachability analysis method, and then recycles the garbage space. However, a significant problem occurred after a period of time: A large number of fragments appear in the memory, but the system fails to allocate the requested memory to a large object. As a result, an out-of-memory (OOM) error occurs and the large number of memory fragments must be maintained by using a data structure such as a chain table. This increases the marking and clearance costs and reduces the efficiency. The following figure shows the results of this algorithm.

Copying Algorithm

To resolve the efficiency problem of the preceding algorithm, a copying algorithm has been proposed. The copying algorithm splits the memory into two parts. Only one part of the memory is used at a time. When this part is insufficient, GC is triggered and live objects are copied to the other part of the memory. This process repeats cyclically. This algorithm is efficient because the system does not need to maintain any chain table during resource allocation, but only needs to move the pointer backward. However, the biggest problem with this algorithm is that many memory resources are wasted.

After all, the memory usage is only 50%.

Despite its low memory usage, this algorithm is very efficient when Java objects are very short-lived. According to IBM Research, up to 98% of Java objects live for less than one day. This means that a majority of the memory resources are recycled during each GC. In this case, only a small amount of data needs to be copied. Therefore, the execution efficiency is very high.

Mark-Compact Algorithm

This algorithm resolves the memory fragmentation problem of the first algorithm. As shown in the following figure, it compacts all the memory resources in the recycling phase.

However, the addition of the compacting phase increases the GC time.

Generation Collection Algorithm

Since most Java objects are short-lived, we divide the memory into a young generation and old generation. The former stores short-lived objects while the latter stores long-lived objects. Of course, the long-lived objects are also promoted from short-lived objects. Then, we can use different recycling algorithms for the two types of objects. For example, the copying algorithm may be efficient for the young generation while the mark-sweep or mark-compact algorithm is better suited to the old generation. The generation collection algorithm is also the most commonly used. The following figure shows the division of the JVM heap by generation. The young generation is generally divided into eden, survivor 0 (S0), and survivor 1 (S1) spaces, so that the copying algorithm can be used efficiently.

The following figure shows an example of the GC process after the memory is divided by generation.

Objects are generally first created in the eden space.
When the eden space is full, young GC is triggered. At this time, live objects in the eden space are copied to the S0 space. Then, the eden space is cleared before the system continues to allocate memory resources to new objects.
When the eden space fulls up again, young GC is triggered again. At this time, live objects in the eden and S0 spaces are copied to the S1 space. Then, the eden and S0 spaces are cleared before the system continues to allocate memory resources to new objects.
Each time young GC is performed, the number of survival times increases by 1 for each live object. After the number of survival times of a live object reaches a certain threshold, the live object is promoted to the old generation during the next young GC.
GC is also performed on the old generation when required.

Common GCs

The GC algorithms introduced above are implemented as garbage collectors, and we actually use them. The latest GC mechanisms usually use the generation collection algorithm. The following figure briefly summarizes the different GCs used in young and old generations.

As shown in the preceding figure, different GCs are used in young and old generations. Actually, the GCs connected in each line can be used in combination. These GCs can be classified into the following categories based on how they work:

Serial GC: serial single-threaded collectors. All user threads must be stopped when these GCs run, and only one GC thread can be run.
Parallel GC: parallel multi-threaded collectors. These GCs are the multi-thread version of serial GCs. All user threads must be stopped when these GCs run. However, you can run multiple GC threads at the same time, which improves the efficiency.
Concurrent GC: concurrent multi-threaded collectors. GC is performed in multiple stages. In some stages, you can run user threads and GC threads concurrently. Note that the threads that can be used to run concurrent GCs are different from those used to run parallel GCs.
Other GCs

Let's take a brief look at how different GCs operate.

Serial GC

Serial GC threads run in the young generation while serial old GC threads run in the old generation. The following figure shows the general execution process.

Parallel GC

ParNew and Parallel Scavenge GC threads run in the young generation while parallel old GC threads run in the old generation. The following figure shows the general execution process. In this mode, GC threads run in parallel and therefore the efficiency is higher than that in the serial GC mode.

Concurrent Mark-Sweep GC

Currently, concurrent GC threads only run in the old generation. Concurrent mark-sweep (CMS) GCs are the most popular at the moment. The execution of CMS GCs is divided into multiple stages. User threads must be stopped only in some stages. We do not have space to go into detail here. If you are interested, you can read some relevant articles on your own. The following figure shows the general execution process of CMS GCs in the old generation.

Other GCs

Currently, garbage first garbage collectors (G1 GCs) and Z garbage collectors (ZGCs) are the most advanced GCs. Their operating mechanisms are different from those of all the preceding types of GCs. G1 GCs and ZGCs also adopt the generation collection algorithm, but they further divide the heap into multiple spaces. Details are not described here. If you are interested, see the references at the end of this article.

GC Combinations in Elasticsearch

CMS GC is configured for Elasticsearch by default. ParNew GCs are used in the young generation and CMS GCs are used in the old generation. You can see the following configurations in config/jvm.options:

## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

When Do I Recycle Garbage?

Now that we know how to find and recycle garbage. When should we recycle garbage? Simply put, garbage is recycled based on the following rules:

GCs in the young generation are triggered when the eden space is full.
When objects are located in the young GC, the system predicts whether the old generation can allocate space to the objects that are to be promoted from the young generation to the old generation. If the old generation cannot allocate space to the objects, which occurs when the old generation is full, serial old and parallel old GCs are triggered.
CMS GCs are triggered when the occupancy of the old generation exceeds a certain percentage threshold, not when the old generation is full. This is because CMS GCs adopt the concurrent algorithm. When a GC thread is collecting garbage, user threads are also running. Therefore, the system must reserve some heap space for user threads, preventing space allocation failures from triggering full GC.

How Do I Read GC Logs?

Now that we have a better understanding of GC, we can return to the questions posed at the beginning of this article.

[2018-06-30T17:57:23,848][WARN ][o.e.m.j.JvmGcMonitorService] [qoo--eS] [gc][228384] overhead, spent [2.2s] collecting in the last [2.3s]

The GC spent 2.2s on garbage collection in the last 2.3s. The garbage collection process seems to have taken an excessively long time. Pay close attention to this.

[2018-06-30T17:57:29,020][INFO ][o.e.m.j.JvmGcMonitorService] [qoo--eS] [gc][old][228385][160772] duration [5s], collections [1]/[5.1s], total [5s]/[4.4d], memory [945.4mb]->[958.5mb]/[1007.3mb], all_pools {[young] [87.8mb]->[100.9mb]/[133.1mb]}{[survivor] [0b]->[0b]/[16.6mb]}{[old] [857.6mb]->[857.6mb]/[857.6mb]}

Let's discuss the specifics of each item. I believe that with the preceding basic knowledge of GC, you will easily understand the explanations here.

gc: This is the 228385th time that GC is performed.
duration: [The total GC duration is 5s for the current check. The value may be the sum of the duration of multiple GC operations.]
collections: [Only one GC occurred since the last check]/[A total of 5.1s have elapsed since the last check]
total: [The total GC duration is 5s for the current check]/[The total GC duration since the JVM started is 4.4 days]
memory: [Heap space before GC]->[Heap space after GC]/[Total heap space].
all_pools (details about the generation division): {Young generation->[Memory size after GC]/[Total size of the young generation]} {Survivor space->[Memory size after GC]/[Total size of the survivor space]} {Old generation->[Memory size after GC]/[Total size of the old generation]}

Source Code

As you can see from the logs, the logs are output by the JvmGcMonitorService class. You can quickly find the class by searching the source code in /Users/rockybean/code/elasticsearch/core/src/main/java/org/elasticsearch/monitor/jvm/JvmGcMonitorService.java. We will not discuss the source code in detail here. The following figure shows the general execution process of the class.

The source code also defines the log format, as shown in the following sample code.

private static final String SLOW_GC_LOG_MESSAGE =
"[gc][{}][{}][{}] duration [{}], collections [{}]/[{}], total [{}]/[{}], memory [{}]->[{}]/[{}], all_pools {}";
private static final String OVERHEAD_LOG_MESSAGE = "[gc][{}] overhead, spent [{}] collecting in the last [{}]";

You may find that the GCs in the output log are only divided by young and old spaces because Elasticsearch encapsulates GC names into the org.elasticsearch.monitor.jvm.GCNames class. The following sample code is relevant to the encapsulation.

public static String getByMemoryPoolName(String poolName, String defaultName) {
        if ("Eden Space".equals(poolName) || "PS Eden Space".equals(poolName) || "Par Eden Space".equals(poolName) || "G1 Eden Space".equals(poolName)) {
            return YOUNG;
        }
        if ("Survivor Space".equals(poolName) || "PS Survivor Space".equals(poolName) || "Par Survivor Space".equals(poolName) || "G1 Survivor Space".equals(poolName)) {
            return SURVIVOR;
        }
        if ("Tenured Gen".equals(poolName) || "PS Old Gen".equals(poolName) || "CMS Old Gen".equals(poolName) || "G1 Old Gen".equals(poolName)) {
            return OLD;
        }
        return defaultName;
    }
    public static String getByGcName(String gcName, String defaultName) {
        if ("Copy".equals(gcName) || "PS Scavenge".equals(gcName) || "ParNew".equals(gcName) || "G1 Young Generation".equals(gcName)) {
            return YOUNG;
        }
        if ("MarkSweepCompact".equals(gcName) || "PS MarkSweep".equals(gcName) || "ConcurrentMarkSweep".equals(gcName) || "G1 Old Generation".equals(gcName)) {
            return OLD;
        }
        return defaultName;
    }

In the preceding code, you can see the names of the different GC algorithms that we mentioned in the first section.

Now we have finished describing the source code. To learn more, you can search for more relevant information online.

Summary

You can find many articles about GC on the Internet. This article starts with basic knowledge about GC so that beginners can easily get started and understand the GC output. I hope this article helps you better understand GC logs in Elasticsearch.

About the Author

Wei Bin, CTO of Puxiang Communication Technology, is an open-source software advocate. He was the first Elastic certified engineer in China and the initiator of the community projects Elastic Daily and Elastic Talk. Wei Bin was awarded the 2019 Annual Partner Architect - Special Contribution Award by Elastic China. Wei Bin has rich practical experiences in open-source software such as Elasticsearch, Kibana, Beats, Logstash, and Grafana. He provides consulting and training services for customers in many industries, including retail, finance, insurance, securities, and technology. He helps customers identify the roles of open-source software in their business, implement open-source software from scratch, and scale it out to produce business value.

References

Some Key Technologies of Java Hotspot G1 GC (article in Chinese): https://mp.weixin.qq.com/s/4ufdCXCwO56WAJnzng_-ow
Understanding Java Garbage Collection: https://www.cubrid.org/blog/3826410
Understand JVMs in Depth: Advanced Features and Best Practices of JVMs

Declaration: This article is adapted from "Do You Understand the GC Logs in Elasticsearch Logs?" under the authorization of the author Wei Bin. We reserve the right to investigate unauthorized use.

Community

Apply GC in Elasticsearch to Process GC Logs

What Is GC?

How Do I Find Garbage?

How Do I Recycle Garbage?

Mark-Sweep Algorithm

Copying Algorithm

Mark-Compact Algorithm

Generation Collection Algorithm

Common GCs

Serial GC

Parallel GC

Concurrent Mark-Sweep GC

Other GCs

GC Combinations in Elasticsearch

When Do I Recycle Garbage?

How Do I Read GC Logs?

Source Code

Summary

About the Author

References

Read previous post:

Read next post:

Data Geek

You may also like

Comments

Data Geek

Related Products

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Alibaba Cloud Elasticsearch

Tair