This article is compiled from Xiaolin He, Alibaba's Senior Technical Expert and Apache Flink PMC Member and Committer, who shared it at the 2022 Flink Forward Asia (FFA) Conference. This article is mainly divided into three parts:
As a stream-batch integrated computing engine, Flink provides a unified API, unified operator description, and unified scheduling. However, there are still some subtle differences at the bottom of the Flink operator.
For a batch operator, its input is a finite data set. The batch operator performs calculations based on the complete dataset. If the memory is insufficient during the calculation, the data is spilled to the disk.
For a stream operator, its input is an infinite data set. Unlike batch operators, stream operators cannot start processing after all input data has been collected, nor can they save this data to the disk. Therefore, stream operators process the data one by one (or data may be calculated in small batches).
After the stream operator receives a piece of data from the upstream, the stateful operator reads the previous calculation result from the state, calculates the current data, and stores the calculation result in the state. Therefore, state optimization is a very important part of stream computing.
Batch processing only contains Append messages, but stream processing contains Append messages and UpdateBefore, UpdateAfter, and Delete messages, which are revisions of historical data. In the computational logic of the operator, if the revision message cannot be processed well, the output data will expand and cause new problems.
How can we make a Flink job run fast? There are six methods: reduce duplicate computing, reduce invalid data, solve data-skew issues, improve computing throughput, reduce state access, and reduce state size.
Except that the last two are for stream processing, the rest are all for common computing engines.
Next, by analyzing the process of how a Flink SQL becomes a Flink job, we introduce how the job is optimized.
After the Flink engine receives an SQL text, it parses it into a SqlNode using SqlParser. The Flink engine queries the metadata information in the catalog to verify the tables, fields, types, and UDFs in the SqlNode. After passing the verification, SqlNode is converted to LogicalPlan, and Optimizer is used to optimize LogicalPlan to ExecPlan. ExecPlan is the final execution plan at the SQL layer. The CodeGen technology is used to translate the execution plan into executable operators, which is described by Transformation, converted into JobGraph, and submitted to the Flink cluster for execution.
How can we make a SQL job execute fast? Optimizer plays an important role here. It converts LogicalPlan equivalent to SQL semantics into ExecPlan that can be executed efficiently, and the conversion process from ExecPlan to JobGraph is deterministic.
The following figure shows the internal details of the Optimizer. A Flink job usually contains multiple INSERT statements. When multiple INSERT statements are translated into a LogicalPlan, a DAG is formed. This optimizer is called the DAG Optimizer.
The input of the Optimizer includes five parts: LogicalPlan, Flink Conf, Constraints (such as PK), Statistics, and Hints from SQL.
The specific optimization process includes several steps. First, you need to break up DAG based on View, which ensures the execution plan can be reused as much as possible. After the breakup, an independent Tree will be obtained. From the leaf node to the root node, the Tree will be optimized by Calcite Optimizer. Then, the optimized results are reassembled into the DAG of PhysicalPlan. Then, the PhysicalPlan is rewritten using the sub-plan reuse technology to further eliminate duplicate computing. Finally, PhysicalPlan is converted to ExecPlan. Based on Exec DAG, more optimizations can be made (such as MultipleInput Rewrite), which reduces the overhead caused by network shuffling, and DynamicFiltering Rewrite, which reduces the reading and calculation of invalid data.
Calcite Optimizer is a classic relational algebra optimization in databases. It includes rule-based optimization (RBO) and cost-based optimization (CBO). It also defines a large number of optimization rules and deduces a large amount of meta information (such as PrimaryKey and RowCount) to help the Optimizer obtain the optimal execution plan.
In Calcite Optimizer, classic optimization ideas are combined. First, some deterministic rewrites are performed on LogicalPlan (such as SubQuery rewrites, decorrelation, and constant folding). Then, some general rewrites will be done (such as various push-down) to form FlinkLogicalPlan. This part of the optimization is generic to stream processing and batch processing. Then, Stream PhysicalPlan and Batch PhysicalPlan perform specific optimizations based on their respective underlying implementations. For example, Batch PhysicalPlan selects different Join algorithms based on Cost Model in batch processing, and Stream PhysicalPlan determines whether to convert Retract processing to Upsert processing and whether to enable Local/Global processing to resolve data hotspot issues based on the configurations.
The DAG optimizer solves a large number of various problems mentioned earlier. Whether Flink SQL is fast enough plays a very critical role.
Next, we will introduce some best practices of optimization based on some high-frequency usage scenarios and SQL in production.
The first is sub-plan reuse optimization. The two SQL statements in the following figure are optimized into two independent Pipelines. In the execution plan in the lower-left corner of the following figure, A> 10 is calculated twice in the Scan operator and the Filter operator.
By enabling sub-plan reuse optimization, the Optimizer automatically discovers and merges subgraphs that have identical computational logic in the execution plan to avoid duplicate calculations. In this example, since the two Scan operators are the same, after optimization, they become one Scan operator to reduce data reads. The optimized DAG is shown in the lower-right corner of the figure above.
Logically, Optimizer can also reuse the computation of a>10 in Filter. However, when the query is very complex and combined with various push-down optimizations, it is very troublesome to find an ideal reusable execution plan. The View-based sub-plan reuse technology may be used to maximize sub-plan reuse.
Here, View can be understood as a sub-plan reuse unit. Based on the user-defined View, the optimizer can automatically identify which Views can be reused. The SQL statements in the following figure are translated into the execution plan in the lower-left corner. After the execution plan is optimized based on the sub-plan reuse technology, Scan and Filter a>10 are reused (as shown in the execution plan in the lower-right corner of the following figure).
Aggregation queries are frequently used in production. Here, select a,sum(b) from my_table group by a; is used as an example to describe the optimization of aggregation. The SQL statement is translated into the logical execution plan on the left in the following figure, and the right side of the figure shows the execution of the topology task. The concurrency of the Scan operator is 3, and the concurrency of the aggregation operator is 2.
A small box in the figure represents a piece of data. The value of field a is represented by color, and the value of field b is represented by number. All data of the Scan operator is shuffled by field a and output to the aggregate operator. In stream computing, when an aggregate operator receives a piece of data, it extracts the group key corresponding to the data, obtains the aggregation result before the corresponding key from the state, and performs aggregation based on the previous aggregation result and the current data. Finally, the result is updated to the state and output to the downstream. (As for logic, you can refer to the pseudo-code in the figure above.)
In the preceding figure, the input of the aggregate operator contains 15 data records, and the aggregate calculation triggers 15 access operations to state (15 times get and put operations). Such frequent operations may affect the performance. At the same time, the first aggregate instance receives eight red data entries, which will form a hotkey. How to solve the problem of frequent access to state and hotkey is often problematic in production.
You can enable MiniBatch to reduce state access and hotkey access. The corresponding configuration is:
allow-latency indicates that the size of the MiniBatch is five seconds, which means the data within five seconds forms a Mini Batch. After MiniBatch is enabled, the computational logic of aggregation changes. When an aggregate operator receives a piece of data, it does not directly trigger computing but puts the data into a Buffer. When the Buffer is full, the computing starts. The data in Buffer is grouped by key and calculated by each group. For each key, the previous aggregation results corresponding to the key are obtained from the state, and all data corresponding to the key is calculated one by one. After the calculation is completed, the result is updated to the state and output to the downstream.
Here, the number of visits to the state is equal to the number of keys. If all the data shown in the preceding figure is put into one Buffer, the number of keys is four, so there are only four accesses to state. Compared with the original 15 accesses, the overhead of accessing state is significantly reduced after the MiniBatch optimization is enabled.
MiniBatch is enabled for the following scenarios:
In the preceding figure, we find that the first aggregate instance accesses the state less frequently, but the amount of data to be processed remains unchanged. The total data amount and the data amount of a single key of the aggregate instance are much larger than those of other aggregate instances, which indicates there is a data skew. We can use two-phase aggregation (local/global) to solve data skew and avoid data hotspot issues.
In order to enable two-phase aggregation, you must enable MiniBatch and set agg-phase-strategy to TWO PHASE/AUTO. The following figure shows the plan after the two-phase aggregation feature is enabled.
LocalAgg and the upstream operator are chained together, which means Scan and LocalAgg use the same concurrency, and there is no data shuffle in the middle of execution. In LocalAgg, aggregation is performed based on the size of MiniBatch, and the aggregation result is output to GlobalAgg.
In the preceding figure, the number of red data aggregated by LocalAgg is reduced from eight to three in the first GlobalAgg instance. There is no data skew between GlobalAgg instances.
Local/Global is enabled for the following scenarios:
If a query contains distinct aggregation functions, Local/Global cannot resolve the data hotspot issues. As shown in the following figure, LocalAgg performs aggregation by field a and field b, and the aggregation does not work well. Therefore, the first GlobalAgg instance receives a large amount of hot data, a total of seven red data entries. If the aggregation keys (a and b) of LocalAgg are inconsistent with the downstream shuffle key (a), Local/Global cannot resolve the data hotspot issue of DistinctAgg.
We can perform partial/final aggregation to resolve the hotspot issue of DistinctAgg. PartialAgg shuffles using a group key and a distinct key. This ensures the same group key and distinct key can be shuffled to the same PartialAgg instance to complete the first layer of aggregation, reducing the amount of input data in FinalAgg. You can use the following configuration to enable Partial/Final optimization:
After the configuration is enabled, the optimizer translates the original plan into the plan shown in the following figure:
In the preceding example, after the aggregation by PartialAgg, the number of hotkeys in the first FinalAgg instance is reduced from seven to three. Partial/Final is enabled for the following scenarios:
Unlike LocalAgg, PartialAgg stores the results in the state, which doubles the state overhead. Incremental optimization is introduced based on Partial/Final to solve the problem that PartialAgg introduces additional state overheads. It combines the Local/Global and Partial/Final optimizations and combines the Partial GlobalAgg and the Final LocalAgg into an IncrementalAgg operator. It only stores the values related to the Distinct Value. Therefore, the size of the state is reduced. The following figure shows the execution plan after IncrementalAgg is enabled.
Some query rewrites can also reduce the state size of the aggregate operator. Many users use COUNT DISTICNT + CASE WHEN to count the results of different conditions in the same dimension. In the aggregate operator, each COUNT DISTICNT function uses an independent state to store distinct-related data, which is redundant. You can reduce the state size by changing the CASE WHEN to FILTER, and the optimizer will share the state data of the same dimension.
Join is a very common query in our production, and the performance improvement effect brought by the optimization of Join is very obvious. In the following figure, a simple Join Query is used as an example to describe the optimization related to Join. The Query is translated into a Regular Join (also known as a dual-stream Join), and its state retains all data on the left and right sides. When the dataset is large, the Join State is also very large, which may cause serious performance problems in production. Optimizing Join State is very important in stream computing.
You can use the following two methods to optimize Regular Join:
In addition to Regular Join, Flink provides other Joins: Lookup Join, Interval Join, Temporal Join, and Window Join. When meeting business requirements, you can rewrite Regular Join to other Joins to save State. The following figure shows the conversion between Regular Join and other Joins:
Lookup Join is widely used in production, so we will introduce its optimization separately here. Currently, Flink provides multiple optimization methods for Lookup Join:
The first optimization method is the synchronous query mechanism and the asynchronous query mechanism. As shown in the following figure, in a synchronous query, after a piece of data is sent to the dimension table collector, you need to wait for the result to be returned before you can process the next piece. There is a lot of waiting time. In an asynchronous query, a batch of data is sent to the dimension table collector. After the batch of data is queried, it is returned to the downstream. Using the asynchronous query mode significantly improves query performance. Flink provides query hints to enable synchronous and asynchronous modes, as shown in the following figure:
The second optimization method is the ordered mechanism and unordered mechanism in asynchronous mode. In the ordered mode, you need to wait until all data is returned and sort the data to ensure that the output sequence is the same as the input sequence. After the sorting is completed, the data can be sent to the downstream. In the unordered mode, the output sequence is not required. As long as the query result is returned, it can be directly sent to the downstream. Therefore, the unordered mode can greatly improve performance compared to the ordered mode. Flink provides query hints to enable the ordered and unordered modes, as shown in the following figure. (Flink also provides Job-level configuration, enabling all dimensional table queries to use the same output mode.)
The third optimization method is the cache mechanism. This is a very common optimization that uses local Memory Lookup to improve query efficiency. Currently, Flink provides three cache mechanisms:
Data duplication often occurs in stream computing, so deduplication is performed very frequently. In earlier versions, Flink searches for the first row of data using group by and first_value and searches for the last row of data using group by and last_value. The following figure shows an example:
The preceding query is converted to the aggregate operator. Its State is very large and cannot guarantee complete semantics. Since the downstream output data may come from different rows, it is inconsistent with the semantics of deduplication by row.
In recent versions, Flink provides a more efficient deduplication method. Flink does not specifically provide the deduplication syntax. Currently, the Over statement is used to express the deduplication semantics, as shown in the following figure. If you want to search for the first row, output row_number in ascending order of time and filter the rows based on row_number. In this case, the State only needs to store Keys for deduplication. If you want to search for the last row, output row_number in descending order of time. The State will only store the last row of each Key.
TopN (also known as Rank) is also frequently used in production. Flink does not provide a special syntax for TopN, which is also implemented through the Over statement. Currently, TopN provides three implementations (each with decreasing performance).
The first method, AppendRank, requires that the input is an Append message. In this case, the State in the TopN operator only needs to store N pieces of data that meet the requirements for each Key.
The second method, UpdateFastRank, requires that the input is an Update message and the upstream Upsert Key must contain Partition Key. The Order-by field must be monotonic, and the monotonic direction must be opposite to the Order-by direction. The query on the right in the preceding figure is used as an example. GROUP BY a, b generates an update stream with a, b as the keys. In the TopN operator, the Upsert Key (a, b) generated in the upstream contains the Partition Key (a, b) in the Over statement. The input parameters of the Sum function are all positive numbers (where c >= 0), so the result of Sum will be monotonically increasing, so the field c is monotonically increasing, which is opposite to the sorting direction. Therefore, this query can be translated into UpdateFastRank.
The third method, RetractRank, does not require any input. All input data needs to be stored in the State, and its performance is also the worst.
In addition to modifying queries to meet business requirements so the optimizer can select better operators, TopN has some other optimization methods:
cache_hit = cache_size * parallelism / top_n_num / partition_key_num
. This shows that increasing the Cache size can increase the Cache hit ratio (the Cache size can be modified by table.exec.rank.topn-cache-size
, the default value is 10,000). Note: When the Cache size is increased, the memory of the TaskManager needs to be increased accordingly.Flink provides a variety of connector interfaces to help the optimizer generate better execution plans. You can implement interfaces based on the capabilities of the connector to improve the execution performance of the collector.
Any optimizer is not perfect. Flink provides a Hints mechanism to influence the optimizer to obtain a better execution plan. There are two types of hints:
In the future, we will provide more in-depth and more intelligent optimizations in addtional scenarios to improve the execution efficiency of the Flink engine.
First, we will continue to explore the depth of optimization. For example, if the problem of State duplication occurs in multiple consecutive Joins, we can optimize these Joins into one Join to avoid State duplication.
Secondly, we hope to expand the breadth of optimization and combine different business scenarios for targeted optimization.
Lastly, we hope the optimization will be more intelligent, and we will explore doing some dynamic optimization work (such as automatically optimizing the execution plan online according to traffic changes).
The Thinking and Design of a Quasi-Real-Time Data Warehouse with Stream and Batch Integration
Performance Analysis and Tuning Guides for Hybrid Shuffle Mode
150 posts | 43 followers
FollowApache Flink Community China - June 15, 2021
Apache Flink Community China - June 8, 2021
Alibaba Cloud MaxCompute - June 25, 2024
Apache Flink Community China - August 12, 2022
Apache Flink Community - July 5, 2024
Apache Flink Community China - August 2, 2019
150 posts | 43 followers
FollowRealtime Compute for Apache Flink offers a highly integrated platform for real-time data processing, which optimizes the computing of Apache Flink.
Learn MoreAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreResource management and task scheduling for large-scale batch processing
Learn MoreMore Posts by Apache Flink Community