Interpretation of Gemini: an Enterprise-level State Storage Engine of Alibaba Cloud Realtime Compute for Apache Flink

This article is a compilation from the research on Gemini, an enterprise-level state storage engine of Alibaba Cloud Realtime Compute for Apache Flink. The research was conducted by Jinzhong Li, Zhaoqian Lan, and Yuan Mei, members of the Alibaba Cloud Realtime Compute for Apache Flink storage engine team. The content is divided into the following five parts:

The Pain Points of Stream Computing State Access
Enterprise-level State Storage Engine
Gemini Performance Evaluation & Online Performance
Summary
References

1. The Pain Points of Stream Computing State Access

Flink, as a stateful stream computing system, heavily relies on its state storage engine. In Flink, state serves the purpose of storing either intermediate computation results or sequences of historical events, as illustrated in Figure 1-1. Let's examine two typical scenarios:

In the case of Aggregate (Agg) operators, the outcome of computations on incoming data is stored in the state. When new data arrives, incremental computations can be performed using the previously stored results.
For dual-stream or multi-stream Join operators, each stream's data is matched against the historical data from several other streams based on the Join criteria. Hence, each stream must use the state to preserve the sequence of events received over a time period.

Figure 1-1 The Flink state is used to store intermediate computing results or the sequence of historical events

When a Flink job has a large state scale, it is difficult for the state storage engine to store full state data in memory, so some cold data is often stored on the disk. The difference between memory and disk in access performance and latency is huge, and I/O access can easily become a bottleneck in data processing. In the Flink computing process, if an operator needs to frequently load state data from the disk, this operator is prone to become a bottleneck for the entire job throughput performance. Therefore, the state storage engine is often a key factor determining the performance of Flink jobs.

1.1 RocksDB State Back-end Issues

Currently, the state storage engine available for community production is based on the implementation of RocksDB. As a general-purpose KV storage engine, RocksDB is not fully suitable for stream computing scenarios. In actual production use and user feedback, we found that it has the following pain points:

Periodic checkpoints in Flink degrade the performance of RocksDB and are prone to CPU spikes, affecting cluster stability. In the fault tolerance mechanism of Flink, jobs periodically trigger checkpoints to generate global state snapshots for fault recovery. Each time Flink triggers a checkpoint, it flushes the data in the RocksDB memory to generate a new file, which brings many negative effects:
- An unnecessary cache miss is caused, leading to more disk reads and poor performance.
- Internal log sorting frequency is higher, leading to higher overall system CPU and IO overhead.
- CPU spikes tend to form during checkpoints (as shown in Figure 1-2), resulting in sudden resource contention in the cluster, and it is difficult for users to estimate reasonable cluster resources in advance.

Figure 1-2 Periodic checkpoints of Flink jobs cause periodic CPU spikes

The recovery of the scaling concurrency state is slow, especially in the scaling-down concurrency scenario. Take Flink-1.18 RocksDB, in which two concurrencies are scaled down to one concurrency, as an example. During concurrency scaling-down, valid Key-Value data needs to be traversed from the original DB instance and inserted into the new DB instance. The whole process involves a lot of I/O read and write operations and the speed will be relatively slow (Note: file-level clipping is expected to be supported after Flink-1.19, which can accelerate the process). Especially for large-state scenarios, the state recovery process may take tens of minutes.
Strongly depends on the local disk. If the local disk is used up, the job cannot run normally. The state data in RocksDB must be stored on local disks. However, the capacity of the local disk is generally limited. RocksDB often encounters a situation where the local disk is full and the job cannot run stably, which can only be solved by expanding the disk or expanding the concurrency, and the overall extensibility of the system is poor.

2. Enterprise-level State Storage Engine Gemini

Realtime Compute for Apache Flink provides an enterprise-level state storage engine, Gemini, which is designed for state access in stream computing. This engine resolves the pain points of open source state storage engines in terms of performance, checkpoints, and job recovery. With the comprehensive upgrade of Alibaba Cloud Realtime Compute for Apache Flink this year, Gemini has also introduced a new version, making breakthroughs in performance and stability. The new version of Gemini has undergone extensive production testing by Alibaba Group and Alibaba Cloud customers. In various scenarios, the performance, ease of use, and stability of the new version of Gemini are significantly better than the open source versions of the state storage engine.

2.1 Core Architecture

The overall architecture of Gemini still maintains disks as the primary persistent storage, and memory as the cache (as shown in Figure 2-1). Write Buffer utilizes hash indexes with a tight memory structure, providing significant performance advantages over sorted indexes in small and medium-sized states. The new version of Gemini improves the disk data storage structure and focuses on optimizing engine performance in large-state scenarios. It redesigns the file format based on the characteristics of stream computing, supports different state expiration cleanup methods based on common business scenarios, greatly optimizes the compression and encoding efficiency of state data, reduces the state size, and effectively improves the state access performance.

Figure 2-1 Gemini core architecture

2.2 Storage-Compute Separation and Hot/ Cold Data Separation - Enhanced Disk Capacity Scalability

In cloud-native deployment environments, local disk capacity is generally limited. RocksDB is designed to store full state data in local disks and has poor scalability. Gemini supports remote storage and access to state data files. When the local disk capacity is insufficient, some cold data can be stored in the remote distributed file systems, getting rid of the capacity limit of the local disk. You do not need to adopt the method of expanding concurrency due to insufficient storage usage, thus saving a lot of costs.

Remote access is characterized by low cost but poor performance. Gemini uses a hot/cold-tiered approach to solve this problem. Data that is frequently accessed is retained in local disks based on historical information and stream computing features. Data that is infrequently accessed is stored in remote systems. This approach achieves optimal performance at the existing cost overhead.

2.3 State Lazy Loading and Delayed Tailoring -Greatly Improving the Startup and Scaling Speed

The new version of Gemini provides the state lazy loading feature (LazyRestore) to resolve the issue that job recovery takes a long time and jobs are interrupted for a long period of time in large-state scenarios. As shown in Figure 2-2, in the traditional state recovery mode, the user job can run normally and process business data only after the remote checkpoint file is synchronously downloaded to the local. In the state lazy loading mode, only a small amount of metadata needs to be downloaded during state recovery, so that the job can start to process user data, and then download the remote checkpoint file to the local computer in asynchronous download mode. During the download process, the operator can directly read the remote state data to complete the computing.

Figure 2-2 Normal state recovery mode (EagerRestore) vs. state lazy loading mode (LazyRestore)

Scaling concurrency is also a common operation for users. Unlike simple job recovery, scaling concurrency involves state tailoring, that is, processing redundant data. Unlike RocksDB, which needs to traverse the required key-value data to resume a job during scaling concurrency, Gemini can directly use the original file to splice metadata, quickly restore the DB instance, and start processing user data. However, redundant data in the file can be cleaned asynchronously, and the performance of state read and write threads is hardly affected during the cleaning process. This feature is called state-delayed tailoring.

Gemini's use of state lazy loading and delayed tailoring can achieve great results in job recovery speed. Let's compare three different recovery methods (see Figure 2-3 ):

RocksDB: In the state recovery phase, you need to download the state file and metadata file, and then process the redundant data. After the processing is completed, the job can be successfully started, and the overall interruption time is long.
Gemini + delayed tailoring: You only need to download the state file and metadata file to start. This asynchronizes the processing of redundant data. During the asynchronous processing, the performance of read and write threads is hardly affected. This allows you to quickly start a job and reduce the interruption time.
Gemini + state lazy loading + delayed tailoring: The operation of downloading state files is performed in the asynchronous phase. This allows a job to download only a small amount of metadata and start processing user data. This greatly reduces the job interruption time. During the asynchronous download of state files, the performance of the job gradually improves from 0. As the remote files are gradually downloaded to the local computer, the job performance can gradually recover to a normal level. Compared with the completely blocked download mode, the state lazy loading mode has a higher overall throughput of jobs because the data can be processed normally during the download of the state file.

Figure 2-3 Comparison of interruption time of RocksDB/Gemini/Gemini state lazy loading

Now, Realtime Compute for Apache Flink provides the capability of dynamic update of job parameters (hot update). You can update job parameters without the need to stop and restart the job. The state lazy loading function has been launched in combination with the dynamic update job parameter feature, which greatly reduces the interruption time of user services by more than 90% in the parameter update scenario.

2.4 Key-value Separation Feature - Optimization of Dual-stream/Multi-stream Join Performance

2.4.1 Key-value Separation Core Advantages

In many dual-stream Join scenarios in Flink, the Join success rate is low or the state data Value is long. Key-value separation can bring advantages to the performance of such jobs. For example, in risk control scenarios, only abnormal data can be joined successfully. In real-time recommend scenarios (as shown in Figure 2-4), the Join can be successful only when the recommend algorithm actually takes effect. The characteristics of such user scenarios determine that the Join success rate of the corresponding Flink job is very low, and the business data field stored in Value is very long, so key-value separation can obtain great performance advantages.

The advantages of KV separation in Join scenarios come from two aspects:

The Join operator only needs to use Keys to determine whether the Join is successful, and the Value will participate in the calculation only when the Join is successful. If the Join success rate is low, key-value separation can cache more Keys in the cache to improve the state access performance.
Separate and store large Values in the state data to reduce the size of the main storage data structure, greatly reducing the CPU and I/O overhead of redundant data collation in the engine.

Figure 2-4 Flink dual-stream Join model

The disadvantage of the key-value separation mechanism is that it is not very friendly to range queries, and there is a certain degree of space amplification. In the Flink scenario, state access operations are mainly point queries, and range queries are relatively few, which is a natural application scenario for key-value separation. For the disadvantage of space amplification, Gemini can support storage-compute separation through key-value separation, which avoids the disadvantage of storage space to the greatest extent.

2.4.2 Key-value Separation Supports Storage-Compute Separation

The Gemini key-value separation function can be closely combined with the preceding storage-compute separation and cold/hot data separation functions. In scenarios where the local space is insufficient, the separated Value data (cold data) can be preferentially stored at the remote end to ensure that the key read performance is not affected. In the case of low Value access probability, this scheme can provide the performance of a pure local disk storage scheme at a low cost.

2.4.3 Adaptive Key-value Separation

In stream computing scenarios, different jobs have different data characteristics (such as the Value length and the access frequency of Keys and Values). A fixed key-value separation parameter cannot optimize the performance of all jobs. To maximize the performance advantages of key-value separation, Gemini supports adaptive key-value separation. The storage engine can identify hot and cold data according to the characteristics of state data, and dynamically adjust the proportion of key-value separation data to achieve better overall system performance. The parameter tuning process is shown in Figure 2-5. The Gemini adaptive key-value separation feature is enabled by default in SQL Join scenarios. You can use this feature to improve job performance without the need to configure this feature.

Figure 2-5 Gemini adaptive KV separation parameter tuning process

3. Gemini Performance Evaluation & Online Performance

3.1 Flink State Benchmark

Test environment: An Alibaba Cloud ECS i2.2xlarge instance with 8 vCPUs, 64 GB of memory, and NVMe SSD;

Test settings: Use Flink State Benchmark to compare the pure state operation performance of RocksDB and Gemini. RocksDB setting: WriteBuffer 64 MB (two by default), blockCache 512 MB. Gemini setting: total memory (64MB * 2 + 512MB).

The test results are shown in Figure 3-1. The throughput performance of Gemini can be two to five times that of RocksDB for most point queries (ValueGet, ListGet, and MapGet) and write operations (ValueUpate, ListUpdate, and MapUpdate), which account for a large proportion of Flink stream computing scenarios.

Figure 3-1 Gemini/RocksDB Flink-state-benchmark performance comparison

3.2 Nexmark

Test environment: 5 Alibaba Cloud ecs.c7.16xlarge instances (1 JM,4 TM), each with 64 vCPUs, 128GB memory, and PL1 ESSD;

Test settings: Select Nexmark stateful use case, and with default data volume EventsNum=100MB, use Nexmark standard configuration (8 concurrencies, 8 TaskManagers, each TaskManager 8 GB memory) to compare the performance difference between RocksDB and Gemini.

The test results are shown in Table 3-1. Gemini has a significant optimization effect on job performance (single-core throughput), and the performance of all use cases is better than RocksDB, with about half of the use cases outperforming RocksDB by more than 70%.

Table
Table 3-1 Gemini/RocksDB Nexmark performance comparison

3.3 State Recovery Speed Test

Test environment: Fully managed pay-as-you-go Flink is activated in Realtime Compute for Apache Flink.

Test settings: Use WordCount Benchmark to test the recovery performance of jobs that are lazily loaded in RocksDB, Gemini, and Gemini states. The total state size of the job is about 4 GB, and the source data is generated in a normal distribution. Each TaskManager is allocated 1 CPU and 4 GB of memory resources.

The test results are shown in Figure 3-2. In the scenario of changing concurrency, the interruption time of Gemini default jobs is less than that of RocksDB (the concurrency scaling-up is reduced by 47%, and the concurrency scaling-down is reduced by 78%). After the lazy loading of Gemini is enabled, the interruption time of jobs can be further reduced compared with that of RocksDB (the concurrency scaling-up is reduced by 94%, and the concurrency scaling-down is reduced by 96%). Compared with RocksDB, the time required for a Gemini job to return to normal performance is significantly reduced, especially in scenarios where concurrency is scaled down by more than 70%.

Figure 3-2 RocksDB vs. Gemini vs. Gemini lazy loading in concurrency scaling speed

At the same time, the state lazy loading function and the function of dynamically updating job parameters (hot update) have been jointly tested. In the scenario where the job concurrency is 128 and each concurrent state size is 5 GB, after the state lazy loading + hot update function is enabled, the interruption time of job concurrency scaling can be reduced by more than 90% (concurrency scaling-up 579 sec -> 13 sec, concurrency scaling-in 420 sec -> 11 sec).

3.4 Key-value Separation Effect Test

Test settings: Select Nexmark Q20 Join job as the performance testing Benchmark for key-value separation, and appropriately expand the data scale (EventsNum=400 MB/800MB) to make it more suitable for the large-state scenario of dual-stream Join. Other test environments and settings are consistent with 3.2 Nexmark, and the performance of Gemini when key-value separation is disabled/enabled is tested respectively.

The test results are shown in Figure 3-3. In the Q20 dual-stream Join scenario, Gemini has a significant performance optimization effect after key-value separation is enabled, and the job throughput can be improved by more than 50% to 70%.

Figure 3-3. Comparison of the throughput of Nexmark Q20 Gemini with key-value separation disabled and enabled

3.5 Online Performance

Gemini, as the default state backend of Realtime Compute for Apache Flink, has been continuously optimized and improved for over three years, enhancing its performance, stability, and ease of use. Currently, in Alibaba Group's internal real-time computing platform and the public cloud Realtime Compute Flink service, more than 50WCU stateful jobs use the Gemini storage engine, helping users efficiently achieve their business goals.
Since the VVR-8.X version of Enterprise-level Engine of Realtime Compute for Apache Flink, we have upgraded the Gemini architecture. Up to now, over 53% of stateful Flink tasks in Alibaba Group's internal real-time computing platform use the new Gemini engine with excellent performance and stability. It is estimated that overall operation resources are further saved by about 27% compared with the old engine. In the public cloud Realtime Compute Flink version, over 24% of stateful jobs have also used the new Gemini engine.

4. Summary

Gemini, as an enterprise-level state storage engine, is designed based on the characteristics of stream computing scenarios. After more than three years of continuous optimization and refinement, the performance, stability, and ease of use have improved. Since the VVR-8.X version of Enterprise-level Engine of Realtime Compute for Apache Flink, the new version of Gemini has upgraded the core architecture and features based on the previous version. Compared with RocksDB, the new version of Gemini provides better state access performance and faster scaling, and supports key-value separation, storage-compute separation, and state lazy loading.

In the future, we will continue to optimize and improve the Gemini engine to enhance the performance, ease of use, and stability of stream computing products, making it the most suitable state storage engine for stream computing scenarios.

5. References

[1] https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/concepts/stateful-stream-processing
[2] https://www.alibabacloud.com/help/en/flink/user-guide/dynamically-update-deployment-parameters
[3] https://github.com/apache/flink-benchmarks/tree/master/src/main/java/org/apache/flink/state/benchmark
[4] https://github.com/nexmark/nexmark
[5] https://www.alibabacloud.com/en/product/realtime-compute
[6] https://www.alibabacloud.com/help/en/flink/product-overview/engine-version

Community

Interpretation of Gemini: an Enterprise-level State Storage Engine of Alibaba Cloud Realtime Compute for Apache Flink

1. The Pain Points of Stream Computing State Access

1.1 RocksDB State Back-end Issues

2. Enterprise-level State Storage Engine Gemini

2.1 Core Architecture

2.2 Storage-Compute Separation and Hot/ Cold Data Separation - Enhanced Disk Capacity Scalability

2.3 State Lazy Loading and Delayed Tailoring -Greatly Improving the Startup and Scaling Speed

2.4 Key-value Separation Feature - Optimization of Dual-stream/Multi-stream Join Performance

2.4.1 Key-value Separation Core Advantages

2.4.2 Key-value Separation Supports Storage-Compute Separation

2.4.3 Adaptive Key-value Separation

3. Gemini Performance Evaluation & Online Performance

3.1 Flink State Benchmark

3.2 Nexmark

3.3 State Recovery Speed Test

3.4 Key-value Separation Effect Test

3.5 Online Performance

4. Summary

5. References

Read previous post:

Read next post:

Apache Flink Community

You may also like

Comments

Apache Flink Community

Related Products

Realtime Compute for Apache Flink

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Compute Nest