×
Community Blog Implement Auto Instrumentation under GraalVM Static Compilation on OTel Java Agent

Implement Auto Instrumentation under GraalVM Static Compilation on OTel Java Agent

This article introduces Alibaba Cloud's Experts' presentation at 2024 OpenTelemetry Community Day and summarizes their explorations in the field.

By Wangtao and Chengpu

1

As the influence of OpenTelemetry in the observability field continues to grow, its project evolves at a rapid pace. As one of the most extensive users of Java in China, Alibaba Cloud is deeply involved in the development of OTel Java Instrumentation and community activities. We have contributed to and reviewed over 100 pull requests (PRs) and participated in 58 Issues discussions, ranking first in the Asia Pacific region for contributions to the OpenTelemetry project.

At the 2024 OpenTelemetry Community Day, Alibaba Cloud's Observability Engineers Huxing Zhang (Wangtao) and Zihao Rao (Chengpu) presented a speech titled Implement Auto Instrumentation under GraalVM Static Compilation on OTel Java Agent and introduced explorations of Alibaba Cloud in the related field. This article provides a summary of the presentation.

1. Background

2

With the continuous evolution of applications in cloud-native serverless scenarios, modern Java applications are facing two unprecedented challenges: the slow startup of Java applications and high memory overhead. Why do Java applications take a long startup time? As can be seen from the figure Lifecycle of Java apps, a typical Java application goes through the following phases during startup: virtual machine initialization, class loading, and JIT compilation. These processes significantly increase the startup time. Additionally, both the lack of preloading optimization and memory consumption to store loaded classes caused by extra steps can lead to high memory usage.

3

About five years ago, GraalVM released its first stable version. Compared to traditional JVM-based environments, GraalVM Native Image offers several significant advantages:

  1. It eliminates JVM initialization, JIT compilation, and interpretation execution, thereby significantly reducing the startup time.
  2. Since the application is optimized and compiled before runtime, the memory usage during VM startup is greatly reduced. As shown in the figure, microservices written with typical frameworks see a ~5x lower memory in the Native Image mode compared to the JVM-based environments.

4

So how does the GraalVM Native Image work? Simply put, it uses the compiler provided by GraalVM to compile Java bytecode into executable binary files related to operating systems. As shown in the above figure on the right, the process involves the following steps:

1. Reachability analysis: The GraalVM compiler performs reachability analysis on the Java applications starting from the entry point (typically the main function of Java applications). Only the reachable code segments will be compiled into the final binary file, which reduces the size of the final compiled file.

2. Class initialization: After the class is initialized, the state can be directly saved. This allows the Native Image to directly read these states into memory upon startup, significantly reducing initialization time.

3. Object snapshots: During initialization, some objects are also initialized. These objects are saved along with the other data during compilation to reduce startup time.

Finally, all this information is packaged together into a binary file, and a final executable file that contains necessary runtime support components and all the data collected earlier is generated.

5

It sounds great, but switching to the GraalVM Native Image will cause some problems. For example:

  1. Many dynamic features of Java applications no longer work directly, such as dynamic class loading, reflection, and dynamic proxies. So, GraalVM should provide additional configurations to address these issues.
  2. The platform independence that is a hallmark of the Java platform is no longer available.
  3. Most importantly, Java Agents based on bytecode rewriting are ineffective, because there is no concept of bytecode. As a result, various observability capabilities achieved through Java Agents, such as collecting Trace/Metrics, are no longer available.

Therefore, while reducing startup time and memory consumption, we want to ensure that applications possess out-of-the-box observability, meaning that all enhancements made by Java Agents continue to function. So how can we solve this problem?

2. Solution

6

To address this common pain point, the Alibaba Cloud Observability team, in collaboration with the programming language and compiler team, designed and implemented the pioneering static Java Agent instrumentation.

Before formally introducing the specific solution, it is necessary to review the working principle of the Java Agent. Key work processes of Java Agents include preMain execution, main function execution, and class loading. When an application uses the Java Agent, it will register a transformer for specific classes, such as class C in the figure. After preMain is executed, the main function is executed. During this process, various classes may be loaded. When the class loader encounters class C, it triggers the callback registered by the Java Agent, where the transformation logic for class C is executed to transform class C into class C'. Finally, the class loader loads the transformed class C', effectively rewriting the bytecode of specific classes in the original application based on Java Agent and adding additional logic.

However, in GraalVM, bytecode does not exist at runtime, making it impossible to enhance the application at runtime by using a similar approach. If we want to achieve a similar capability, we need to implement it before runtime. Thus, the problem can be translated to:

A. How to transform target classes before runtime to obtain the transformed classes?

B. How to replace the original classes with the transformed ones before runtime?

7

To address these two issues, the overall design is shown in the above figure. It consists of two phases: pre-running and static compilation. In the pre-running phase, the application mounts both the OTel Java Agent and the Native Image Agent for pre-running. Among them, the OTel Java Agent is responsible for transforming the class C to C' during the pre-running process, while the Native Image Agent is responsible for collecting the transformed classes, such as the class C' in the above figure. This solves Problem A: How to transform target classes before runtime to obtain the transformed classes?

Next, in the static compilation phase, we use the original application, the OTel Agent, the transformed classes, and configurations as inputs and compile them. During the compilation process, we replace class C in the application with C' and generate an executable application that only contains C' for running. This addresses Problem B: How to replace the original classes with the transformed ones before runtime?

8

After understanding the overall scheme, you may be curious about what the Native Image Agent is and how it can be used to collect transformed classes.

The Native Image Agent is actually a tool provided by GraalVM. It can scan our application to collect all the dynamic configurations required for static compilation. This helps eliminate the limitations of the GraalVM Compiler and allows developers to continue using some of the dynamic features offered by Java in GraalVM, such as reflection and dynamic proxies.

However, it does not directly help us collect transformed classes. To solve this problem, we implement an interceptor in the Native Image Agent. This interceptor checks the bytecode of classes before and after transformation. If changes are detected, it records and saves them. Otherwise, it ignores the class.

In fact, we found that simply recording transformed classes was not enough. Some classes are not part of the original application, such as dynamically generated classes. Therefore, we need to use the Native Image Agent to collect them. In addition, since preMain is a concept in the JVM and the Java Agent and it is not natively supported in GraalVM, we use it to generate the necessary preMain configuration to inform GraalVM about the entry point of the OTel Java Agent.

Apart from the above, we have also made some additional adaptations for special cases. For example, because the GraalVM compiler is also a Java application, we cannot directly collect transformed classes from JDK by using the Native Image Agent or replace them during compilation like non-JDK classes, as this could affect the behavior of the GraalVM compilation. Therefore, we implement some special APIs within GraalVM. Then, we use them in the OTel Java Agent to re-instrument JDK classes, so that the GraalVM static compilation process can recognize the relevant content and the transformation logic for JDK classes can be contained in the final Native Image executable file without modifying the JDK on which it depends. Lastly, there are multiple class loaders in the OTel Java Agent, while there is only one class loader in GraalVM, so we perform Shade on classes to achieve a similar function.

9

With the transformed classes and some necessary configurations, how do we replace C' with C during static compilation? We mainly provide two methods. The first method uses the -classpath tool in JDK. If two classes with the same name appear in one path, this tool ensures that the first class takes effect. Therefore, you can solve the replacement issue by simply declaring the path of the transformed classes at the beginning. The second method uses the --module-path tool that is available in JDK 9 and later to achieve a similar effect. It mainly packages the classes to be replaced into a JAR package and then patches the original module to implement the replacement.

3. Demonstration

0~43s: In the first part of the video, we demonstrate the pre-running process. During this process, the OTel Java Agent and the Native Image Agent are mounted to the Spring Boot + MySQL application to generate and collect enhanced classes. Note that in the video, the pre-running process takes 18.517 seconds when the application runs on the JVM. This includes the time of JVM startup and agent class enhancement. The simple Spring Boot application startup also consumes 6.152 seconds. After pre-running, a native-configs folder is generated, containing dynamic configuration files collected by the Native Image Agent and recorded dynamic enhanced class files.

44s~1min 44s: In the second part of the video, we demonstrate the static compilation process. Here, the original application, the OTel Agent, the transformed classes, and configurations are used as inputs and then compiled. The compilation process can be time-consuming as it includes multiple steps such as static analysis and static compilation. The demonstration uses a machine with 32 cores and 64GB for compilation. After static compilation, an executable binary file strongly related to the host system is generated, that is, the demo file in the video, which can be executed directly.

1min 45s~2min 6s: In the third part of the video, we demonstrate the execution of the statically compiled application. It is evident that both the local execution time of 0.069 seconds and the overall JVM startup time of 0.088 seconds show a significant improvement compared with the JVM scenario. Additionally, after applying this solution, the statically compiled application collects metrics, traces, and other data as expected.

10

Finally, we have also used this solution to perform static compilation for commonly used frameworks (such as Spring Boot, Kafka, Redis, and MySQL) and then compared the observability data collection performance with that under the original JVM-based environments. Under GraalVM, this solution provides observability capabilities for applications, and there is a noticeable improvement in both startup speed and runtime memory overhead compared with the original JVM solution.

4. Future Works

11

In the future, we plan to focus on the following two aspects for further optimization:

  1. We will conduct more comprehensive test cases over multiple signals such as metrics, traces, and logs.
  2. We will consider consolidating the pre-running phase and the native compilation phase into a unified phase to ensure that transformed classes are collected completely. The current limitation is that only the bytecode of executed classes is collected during the pre-running phase, potentially leading to omissions if unit tests are insufficient.

5. Future Open Source Plan

Alibaba Cloud has long been concerned about and involved in the construction of open-source projects in the observability and static compilation fields:

In the observability field, Alibaba Cloud is deeply involved in the development of the OpenTelemetry Java Instrumentation project. We have contributed to and reviewed over 100 PRs and participated in 58 Issues discussions, ranking first in the Asia Pacific region for contributions to the OpenTelemetry project. Moreover, @steverao has become a Triager for OpenTelemetry Java Instrumentation. In the future, we will continue to actively engage with the OpenTelemetry community. Our self-built Go agent will be fully open-sourced and we are considering contributing it to the OpenTelemetry Go Instrumentation community. There is an ongoing discussion regarding the contribution. In the future, more agents will be open-sourced.

Currently, we have contributed PRs for the implementation of the corresponding solutions to the above two communities. In the future, we plan to collaborate with them to advance and implement the related solutions.

PRs:

0 1 0
Share on

You may also like

Comments

Related Products

  • Managed Service for Prometheus

    Multi-source metrics are aggregated to monitor the status of your business and services in real time.

    Learn More
  • Cloud-Native Applications Management Solution

    Accelerate and secure the development, deployment, and management of containerized applications cost-effectively.

    Learn More
  • Function Compute

    Alibaba Cloud Function Compute is a fully-managed event-driven compute service. It allows you to focus on writing and uploading code without the need to manage infrastructure such as servers.

    Learn More
  • Lindorm

    Lindorm is an elastic cloud-native database service that supports multiple data models. It is capable of processing various types of data and is compatible with multiple database engine, such as Apache HBase®, Apache Cassandra®, and OpenTSDB.

    Learn More