By Lu Yanbo
This is the second article of the Java in Container series. You are welcome to follow the series. 😊
Java agents are enabled in multiple Alibaba Cloud products for users. Thus, the overall Java Agent duration increases, and each agent stores each other in scenarios where multiple Java agents are used together, resulting in increased memory usage and resource consumption.
Microservices Engine (MSE) initiated a one-java-agent project that can collaborate with various Java agents. It supports more efficient and convenient bytecode injection.
Among them, each Java agent, as the plugin of the one-java-agent, is loaded through multi-thread startup in the premain phase, thus reducing the startup speed from O(n) to O(1) and the overall loading time of Java Agent.
However, the following errors were found during the one-java-agent premain phase of the new version of Agent verification:
2022-06-15 06:22:47 [oneagent plugin arms-agent start] ERROR c.a.o.plugin.PluginManagerImpl -start plugin error, name: arms-agent
com.alibaba.oneagent.plugin.PluginException: start error, agent jar::/home/admin/.opt/ArmsAgent/plugins/ArmsAgent/arms-bootstrap-1.7.0-SNAPSHOT.jar
at com.alibaba.oneagent.plugin.TraditionalPlugin.start(TraditionalPlugin.java:113)
at com.alibaba.oneagent.plugin.PluginManagerImpl.startOnePlugin(PluginManagerImpl.java:294)
at com.alibaba.oneagent.plugin.PluginManagerImpl.access$200(PluginManagerImpl.java:22)
at com.alibaba.oneagent.plugin.PluginManagerImpl$2.run(PluginManagerImpl.java:325)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.InternalError: null
at sun.instrument.InstrumentationImpl.appendToClassLoaderSearch0(Native Method)
at sun.instrument.InstrumentationImpl.appendToSystemClassLoaderSearch(InstrumentationImpl.java:200)
at com.alibaba.oneagent.plugin.TraditionalPlugin.start(TraditionalPlugin.java:100)
... 4 common frames omitted
2022-06-16 09:51:09 [oneagent plugin ahas-java-agent start] ERROR c.a.o.plugin.PluginManagerImpl -start plugin error, name: ahas-java-agent
com.alibaba.oneagent.plugin.PluginException: start error, agent jar::/home/admin/.opt/ArmsAgent/plugins/ahas-java-agent/ahas-java-agent.jar
at com.alibaba.oneagent.plugin.TraditionalPlugin.start(TraditionalPlugin.java:113)
at com.alibaba.oneagent.plugin.PluginManagerImpl.startOnePlugin(PluginManagerImpl.java:294)
at com.alibaba.oneagent.plugin.PluginManagerImpl.access$200(PluginManagerImpl.java:22)
at com.alibaba.oneagent.plugin.PluginManagerImpl$2.run(PluginManagerImpl.java:325)
at java.lang.Thread.run(Thread.java:855)
Caused by: java.lang.IllegalArgumentException: null
at sun.instrument.InstrumentationImpl.appendToClassLoaderSearch0(Native Method)
at sun.instrument.InstrumentationImpl.appendToSystemClassLoaderSearch(InstrumentationImpl.java:200)
at com.alibaba.oneagent.plugin.TraditionalPlugin.start(TraditionalPlugin.java:100)
... 4 common frames omitted
Those familiar with Java Agent may notice that this is an Instrumentation.appendToSystemClassLoaderSearch
call that reported an error.
First of all, the appendToSystemClassLoaderSearch
path does exist. Secondly, the real reason for this error is in the C++ part, which is difficult to troubleshoot.
Anyway, we still need to investigate why this mistake occurred.
First of all, let's sort out the specific call process. The following analysis is based on this:
- Instrumentation.appendToSystemClassLoaderSearch (java)
- appendToClassLoaderSearch0 (JNI)
`- appendToClassLoaderSearch
|- AddToSystemClassLoaderSearch
| `-create_class_path_zip_entry
| `-stat
`-convertUft8ToPlatformString
`- iconv
Since this problem has a 10% probability of occurring in the container environment and is easy to reproduce, the latest code of dragonwell8 is used to add logs to confirm the scene.
First, add the log at the actual entrance of JNI, which is the appendToClassLoaderSearch
method entrance.
After adding the log, I found that the problem was even more confusing.
appendToClassLoaderSearch
entry will output.This is not the same as the error log. Is the stacktrace information fooling us?
After a tough night, I consulted colleagues from the Dragonwell team the next day. Here was what they did:
After that, our logs can be output.
This was the first difficulty we met. The printf needs to add fflush to ensure successful output.
After that, logs are added continuously. Finally, we found that create_class_path_zip_entry returned NULL.
Continue the troubleshooting and find that the stat error is reported. Return No such file or directory. As we mentioned earlier, the path of jarFile exists. Isn't stat thread-safe?
After checking the document [1], we found that stat is thread-safe.
Looking back, I noticed that the path of stat was abnormal. Sometimes, the path was empty, and other times, the path was /home/admin/.opt/ArmsAgent/plugins/ahas-java-agent/ahas-java-agent.jarSHOT.jar. As seen at the end of the character, it was caused by two characters written into the same memory. Moreover, the length of the corresponding string has become an irregular number.
Then, the problem is clear to see. Let's start looking for the generation of this string. This character is generated by the convertUft8ToPlatformString.
Therefore, the logic of utf8ToPlatform is debugged. gdb is directly run on ECS to debug jvm and avoid frequent log addition and container restart.
The results show that under Linux, utf8ToPlatform is the direct memcpy, and the target address of memcpy is on the stack.
It seems the thread safety problem doesn't exist.
After careful investigation, it was found it was related to the environment variables. The environment variable related to coding on ECS is LANG=en_US.UTF-8. The centos:7 does not have this environment variable by default on the container. As such, jvm reads ANSI_X3.4-1968.
This is the second difficulty. Environmental variables will affect the local encoding conversion.
With the preceding phenomena and codes, it is found that it is still necessary to go through iconv to switch from UTF-8 to ANSI_X3.4-1968 coding in the container environment.
It can also be inferred here that if LANG=en_US.UTF-8 is manually set in the container. This problem will not occur again. Additional verification confirms this.
Then, add the log. When finally confirm that it is iconv, the target string breaks down.
Does that mean the iconv is not thread-safe?
We checked iconv's documentation and found that it is not completely thread-safe.
Generally speaking, before iconv, you need to use iconv_open to open an iconv_t, and this iconv_t does not support the simultaneous use of multiple threads.
At this point, the problem has been clearly located since jvm has written iconv_t as a global variable. When multiple threads append, it is possible to call iconv at the same time, resulting in competition problems.
Here is the third difficulty. Iconv is not thread-safe.
It is easy to modify the Java code. You only need to add a lock.
However, there is a design problem here. The instrument objects have been scattered everywhere in the code. Now, it is suddenly necessary to add a lock. Almost all the places used need to be changed, and the cost of code transformation is high.
Finally, we solve it through the proxy class.
As such, other places only need to use InstrumentationWrapper, and such a problem will not be triggered.
Then, we analyzed the code on the jvm side and found that the appendToClassLoaderSearch 0 method was not threaded-safe because iconv_t was not thread-safe. Can it be solved gracefully?
If it is a Java program, it can be solved by directly using ThreadLoal to store iconv_t.
However, on the cpp side, although C++ 11 supports thread_local, jdk8 does not use C++ 11 (you can refer to JEP). Besides, C++ 11 only supports thread_local set and get. Lifecycle management (such as initialization and destruction of thread_local) is not supported yet. For example, it is impossible to automatically recycle iconv_t resources at the end of the thread.
Let's fall back to pthread. Since pthread provides thread-specific data, you can do similar things.
In short, pthread provides the full lifecycle management of thread_local. Therefore, the final code is listed below. The thread-local storage is initialized with make_key.
Therefore, after compiling the JDK, inputting mirroring and restarting pods several times in batches did not cause the problems mentioned at the beginning of the article.
There are many problems we have come across throughout this process, including Java, JNI/JVMTi, glibc, and pthread.
From this case, we restored the problem and proposed a solution according to the call stack and code. I hope this article help you learn more about Java/JVM.
[1] Document
https://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_09.html
[2] Links that one-java-agent fixes:
https://github.com/alibaba/one-java-agent/issues/31
[3] Links that Dragonwell fixes:
https://github.com/alibaba/dragonwell8/pull/346
[4] The one-java-agent brings us a convenient and non-intrusive microservice governance method:
https://www.alibabacloud.com/product/microservices-engine
How Do We Implement Full-Process Grayscale Capability on the Database?
Observability | A Recapture of Time Series Data Downsampling in Prometheus
506 posts | 48 followers
FollowAlibaba Cloud Native Community - May 18, 2023
Alibaba Cloud Community - February 24, 2023
Alibaba Cloud Native Community - May 22, 2023
Alibaba Cloud Native Community - March 29, 2023
Alibaba Cloud Native Community - May 22, 2023
Alibaba Cloud Native Community - April 29, 2024
506 posts | 48 followers
FollowExplore Web Hosting solutions that can power your personal website or empower your online business.
Learn MoreExplore how our Web Hosting solutions help small and medium sized companies power their websites and online businesses.
Learn MoreBuild superapps and corresponding ecosystems on a full-stack platform
Learn MoreWeb App Service allows you to deploy, scale, adjust, and monitor applications in an easy, efficient, secure, and flexible manner.
Learn MoreMore Posts by Alibaba Cloud Native Community