Introduction to Unified Batch and Stream Processing of Apache Flink

Unified batch and stream processing of Flink is a well-established concept in the stream computing field. Flink has become the de facto standard for streaming compute engines. Most people are familiar with the application of Flink in stream computing scenarios. This article describes the technologies and challenges of unified batch and stream processing of Flink. In the next article, we will introduce the use cases of unified batch and stream processing, as well as its implementation in some enterprises, based on publicly available information.

Overview of unified batch and stream processing

This section introduces the technologies of unified batch and stream processing to help you understand the overall architecture of unified batch and stream processing and the technical selection of each component in the architecture.

Evolution of unified batch and stream processing

This section describes how the architecture of unified batch and stream processing evolves.

Lambda architecture

Lambda architecture is a data processing architecture that consists of a traditional batch data pipeline and a fast streaming data pipeline for handling real-time data. It is composed of three layers: batch layer, speed (or real-time) layer, and serving layer for responding to queries. At the data serving layer, it merges the views of both the batch layer and speed layer to provide external services. Although the Lambda architecture takes advantage of both batch and stream-processing methods without changing the original batch data pipeline, it faces the following issues:

The maintenance costs are high.
The learning and developing costs are high, because two sets of code must be separately developed.
Data consistency issues may occur because the compute engines for batch and streaming processing are different.
The two systems use different storage formats. Data management is complex because different storage formats are used for batch and streaming processing

source: Questioning the Lambda Architecture [1]

Kappa architecture

Unlike the Lambda architecture, the Kappa architecture removes the batch layer and retains the speed layer for stream processing. It achieves the upstream replay (backtrack) capability by using the data retention feature of Message Queue. The Kappa architecture resolves the preceding issues faced by the Lambda architecture but cannot provide the benefits of batch processing. For example, the backtracking performance of stream processing is inferior to that of batch processing. When processing jobs that do not require real-time data processing, the Kappa architecture can cause a waste of resources.

The architecture of unified batch and stream processing

Flink uses a unified compute engine and a unified storage format for both batch and stream processing, thereby resolving the issues present in the Lambda and Kappa architectures.

A unified compute engine is used, which avoids the maintenance costs of the two systems.
A unified storage format is used, which reduces the redundancy and costs of storage links.
Only one set of code is required for batch and stream processing. This significantly reduces the learning costs and development costs of users. The unified compute engine and single code set effectively ensure data consistency.

The following section describes the most important components in the architecture of unified batch and stream processing: compute engine and storage.

Compute engine

Apache Flink proposed batch processing as a special case of stream processing. You can use the DataStream API and Flink SQL API to define streaming jobs and batch jobs. Flink has significantly improved user experience, job stability, and processing performance by using unified batch and stream processing. Many companies implement Flink Batch in their production environments. The Flink community remains committed to further developing unified batch and stream processing in the future.

Apache Spark, which is one of the earliest compute engines that proposed the concept of unified batch and stream processing, can be used as the compute engine of unified batch and stream processing. Unlike Flink that offers native streaming, Apache Spark uses micro batches to emulate streaming. That means Spark provides poor semantics guarantee and can only deliver near real-time processing. Therefore, Apache Spark faces challenges in meeting business requirements in complex and large-scale real-time computing scenarios. Apache Spark is exploring continuous processing to implement stream computing and reduce latency. It is still in the experimental stage[3]. According to the author's observation, the investment in the research of continuous processing was small and almost stopped after 2021.

Lake table formats

The storage format is another important component in the architecture of unified batch and stream processing. Flink has made significant efforts to explore storage types for unified batch and stream processing and now supports several storage types. In the open-source community, the following mainstream storage formats apply to Flink's unified batch and stream processing:

Apache Paimon: a lake format designed for unified batch and stream processing. With Flink CDC, you can import data into Apache Paimon with one click. Additionally, you can use Flink SQL or Spark SQL to write data to Apache Paimon in batch or streaming mode. As a streaming data lake, Apache Paimon also allows Flink or Spark to read data in streaming mode. Apache Paimon has strong streaming read and write capabilities, achieving a latency of 1 to 5 minutes for streaming lake storage[4].

Apache Hudi: natively supports multiple engines. Therefore, you can use Apache Hudi to read and write batch and streaming data, and use Presto to perform interactive analytics for data in Apache Hudi[5]. When integrated with Flink, Apache Hudi can achieve end-to-end latencies of up to ten minutes[4].

Apache Iceberg: As early as 2020, Alibaba Cloud began efforts to integrate Flink into Apache Iceberg. After the integration, Apache Iceberg supports batch and streaming writes with Apache Flink's APIs. Flink writes data to Apache Iceberg with a specific latency. Therefore, we recommend one-hour data updates[4] for Apache Iceberg.

Data warehouses

In addition to lake tables, data warehouses can be used as storage for unified batch and stream processing, such as StarRocks, ClickHouse, Apache Druid, and Alibaba Cloud Hologres. However, data warehouses have higher storage costs than data lakes. In some data lakehouse solutions, such as the integration of StarRocks and Apache Paimon[6] and the integration of Hologres and Apache Paimon[7], data is written to data lakes by using lake table formats, while a data warehouse is used to analyze the data in the data lakes for online analytical processing (OLAP).

Challenges

Users may encounter various challenges when using unified batch and stream processing. The Flink community is actively resolving these challenges and optimizing the batch job capability of Flink to gradually improve its batch and streaming engines. The following section outlines these challenges and the solutions provided by the Flink community.

Differences between streaming shuffle and batch shuffle

In most cases, the shuffle for Flink streaming jobs is different from that for Flink batch jobs. The pipeline shuffle is used for streaming jobs. Data is not stored on disks, but all operators must be started when a streaming job starts. This does not match the common scheduling requirements of batch jobs. Therefore, the blocking shuffle is commonly used for batch jobs. This way, the upstream task writes the shuffle data to a file, and then consumes the shuffle data after the downstream task starts.

By default, Flink uses the internal shuffle. Data of the upstream compute node is written to the local disk of TaskManagers, and the downstream node connects to the TaskManagers of the upstream node to read the shuffle file. As a result, the TaskManagers cannot immediately exit after completing the computing task. The TaskManagers cannot release the shuffle file until it is consumed by the downstream node. This results resource wastage and increase the cost of failover.

Therefore, the Flink community designed the shuffle service as a pluggable[8] service to allow users to easily extend it to implement Remote Shuffle Service (RSS). RSS uses a separate cluster to provide services. This can help increase resource utilization and reduce the fault tolerance overheads of the shuffle of TaskManagers. Apache Celeborn[9] can be used as a remote shuffle of Flink.

The Flink community also proposed hybrid shuffle in Shuffle 3.0. Hybrid shuffle combines the benefits of pipeline shuffle and blocking shuffle, allowing data to be written to the memory for direct consumption, or to disks for later consumption when the memory is insufficient and the downstream node does not consume data in a timely manner. Hybrid shuffle allows the downstream node to consume data at any time, either during or after data production by the upstream node. This completely eliminates resource fragmentation[10].

Batch processing performance

The performance of Flink batch jobs is crucial to users of Flink Batch. The Flink community has made many significant improvements to improve job performance, as evidenced by consistent performance gains on the TPC-DS benchmark in each version. For example, Operator Fusion Codegen[11] is used to optimize the code generated by SQL Planner; adaptive local hash aggregate[12] is used to dynamically determine whether to use local aggregation; runtime filter and dynamic data prune are used to improve the data processing efficiency; and Adaptive Execution Plan (AQE) is used to determine parallelism automatically and perform dynamic load balance[13].

Slow nodes

In a distributed system, the performance of a single concurrent subtask may degrade due to machine failures, resource insufficiency, or network issues. Slow nodes may become a bottleneck of a job. Similar to traditional MapReduce and Spark, Flink 1.17 introduced speculative execution to resolve the issue of slow nodes[14]. After Flink detects a long-tail task, it deploys the image instance of the task on a non-hotspot machine. Then, Flink uses the result of the task from whichever instance completes first and cancels the remaining tasks.

Usability of parallelism configuration

Configuring an appropriate degree of parallelism for operators in a Flink batch job is challenging. Setting it too low may result in prolonged job runtimes and increased rollbacks during failovers. Conversely, setting it too high may lead to wasted resources and additional costs for task deployment and network shuffles. To resolve this issue, Flink 1.15 introduced Adaptive Batch Scheduler[15], which allows Flink to automatically determine the parallelism of operators based on the amount of consumed data. This eliminates the need to adjust the parallelism for jobs manually.

Compatibility between Flink SQL and Hive SQL

Flink SQL uses standard ANSI SQL, which differs in syntax from Hive SQL. As a result, many users encounter issues during migration from Hive SQL to Flink SQL. Although Flink 1.15 introduced Hive Dialect[15], there are still notable compatibility issues with Hive SQL. For example, after selecting a batch of jobs for migration, Kuaishou found many unsupported syntax elements. The community quickly responded to Kuaishou's feedback. Flink SQL prioritized compatibility with some important and commonly used syntaxes, such as CREATE TABLE AS, ADD JAR, USING JAR, macro commands, and Transform.

Flink 1.16 made many improvements to support Hive syntaxes, including CREATE TABLE AS[16], ADD JAR, and USING JAR[17]. After qTest testing, the overall compatibility between Flink SQL and Hive SQL has reached 95%, which ensures that existing queries can be migrated to Flink.

Conclusion

Unified batch and stream processing is a key development focus for Flink. Many users provide feedback to the Flink community when they use this architecture. Many developers contribute to enhancing Flink Batch. As a result, many users have been able to smoothly deploy the architecture of unified batch and stream processing in their production environments. In the next article, we will introduce main application scenarios and use cases of unified batch and stream processing, as well as the use cases of unified batch and stream processing, gathered from public channels.

Although unified batch and stream processing of Flink has been commonly used in production environments, the community continues to make further improvements. For example, the batch capability of DataStream API needs to be fully aligned with that of DataSet API. Flink will be further integrated with Apache Celeborn to achieve dynamic shuffle switching, introduce memory to multi-level storage, and support Hybrid Shuffle. Likewise, Flink will be further integrated with Apache Paimon to provide the capabilities of a compute engine and storage for unified batch and stream processing, making it easier for users to build a unified streaming and batch lakehouse with Apache Flink and Apache Paimon.

Welcome to follow Flink Forward Asia on Linkedin and Twitter

Flink Forward Asia is an event organized by Alibaba Cloud to promote Apache Flink education, adoption, usage, and community contributions. This social is created for Flink users to exchange technologies and information. You can obtain cutting-edge information about Flink streaming and batch processing and directly communicate with Flink developers and committers.