Flink Course Series (8): Detailed Interpretation of Flink Connector

This article gives a detailed interpretation of Flink Connector from the four aspects: connectors, Source API, Sink API, and the future development of collectors.

By Ren Qingsheng

1) Overview of the Connectors – A Bridge between Flink and External Systems

1.1) Connectors

Important sources and destinations of Flink data: Connectors are the bridge between Flink and external systems. For example, Kafka helps read data, process in Flink, and then re-write back to external systems such as HIVE and Elasticsearch.
Event control in data processing: Event processing watermark and checkpoint alignment record.
Load balance: Properly distribute data partitions based on loads of different concurrencies.
Data parsing and serialization: Data may be stored in binary form in external systems and various forms of columns in the database. After writing data to Flink, it needs to be parsed before the subsequent data processing. Likewise, when writing data back to external systems, it also needs to be serialized, converting it into the corresponding storage format in the external system for storage.

The preceding figure shows a typical example.

First, some of the records are read from Kafka through the Source. Then, these records are sent to the operators in Flink for computing. After that, the results are written to Elasticsearch through the Sink. The Source and the Sink serve as the interfaces at both ends of the Flink job.

2) Source API – The Entrance of Flink Data

2.1) Evolution of Source Interface

As shown on the left side of the figure, source interfaces used in Flink before version 1.10 were the SourceFunction API for processing streaming data and the InputFormat API for processing batch data. In Flink, after version 1.10, the community introduced a new Source API to refactor the entire Source. Let's understand why this is done.

To address the inconsistency between batch and stream implementations.

As the ecosystem continues to grow, some problems are exposed in the old APIs. One of the most obvious problems is the inconsistency between batch and stream implementations.

To address the complex implementation of the simple interface.

The previous APIs seem to be relatively simple. However, all the logics and operations for the implementation of APIs are very complicated, which is not user-friendly enough for developers.

Therefore, to solve these problems, a new Source API is designed in FLIP-27, which has the following features:

Batch-stream unification: Maintaining one code set instead of two sets of code is enough for stream data processing and batch data processing.
Simple implementation: Source API defines many conceptual abstractions that appear complex and simplify the operation work for developers.

2.2) Core Abstractions

2.2.1) Record Splits

A numbered collection of records: Take Kafka as an example. A split of Kafka is defined as an entire split or a part of a split. For example, start the consumption where the data offset is 100. A range from 100 to 200 is defined as a split; the range from 201 to 300 is also defined as a split. A record split is defined as long as it is a collection of records with a unique number.
Traceable progress: It is needed to record the progress of the data processing in the split and know what is currently being processed when recording checkpoints. So that once a fault occurs, the split is recovered directly from the fault.
All information of the splits recorded: For example, in Kafka, the beginning and the end of a partition are included in the entire record split. Since Checkpoints are recorded by record split, the information in the record resolution must be self-consistent.

2.2.2) Record Split Enumerator

Detect record splits: Splits that exist in the external systems are to be detected.
Allocate record splits: Enumerator serves as a coordinator, which assigns tasks to the source reader.
Coordinate source readers: For example, inform the readers whose progress is moving too fast to slow down to ensure the consistency of watermarks

2.2.3) Source Reader

Read data from a record split: Read data based on the record splits assigned by the enumerator.
Process the Event Time watermark: Extract the Event Time from the data read from external systems and send the corresponding watermarks.
Parse data: Deserialize the data read from external systems and send it to downstream operators.

2.3) Enumerator-Reader Architecture

The split enumerator runs on the job master, and the source reader runs on the task executor. Therefore, the enumerator plays a leader and coordinator role, while the reader is an executor.

Their checkpoint storages are separate, but with some communication in between. For example, an enumerator needs to assign tasks to a reader and also notifies the reader that no splits need to be processed subsequently. Due to the different operation environments, there will inevitably be some network communications between them. Therefore, the following definition of communication stack is introduced.

The communication stack provides developers with some events for implementation.

First, the top layer is a source event for developers to define some customization operations. For example, if a source reader must stop reading data under certain conditions, the split enumerator can inform the source reader through the source event.

The next layer is the operator coordinator, which communicates with the operator that performs tasks through operator events. Some operator events are predefined, such as adding splits and notifying the leader that there are no new splits. These events that work on all sources are abstracted at the operator event layer.

Address lookup determines the operator to which a message is sent. After the entire Flink job begins to be executed, there will be directed acyclic graphs one after another. Different operators may run on different task managers. It is down to Address lookup to find the corresponding tasks and operators.

Due to the network communication, there is a Remote Procedure Call (RPC) Gateway between the job master and the task executor. The network transmission between all events is achieved by means of RPC through this gateway.

2.4) Design of Source Reader

The community has provided SourceReaderBase to simplify the implementation of source readers and reduce the work of developers. Users can directly inherit the SourceReaderBase class for development, which significantly simplifies the development work. Let's take a look at the SourceReaderBase. The source reader seems to have many components, as shown in the following figure, but it can be split into two parts for better understanding.

With the elementQueue in the middle as a barrier, the part marked in blue on the left is the component dealing with external systems. The part marked in orange on the right deals with the Flink engine side.

First, the left part contains one or more split readers. A fetcher drives each reader, and a fetcher manager manages multiple fetchers. The implementation here varies. For example, multiple partitions can be consumed by only one SplitReader on one thread. Or multiple threads can be enabled with each running one feature and one reader. Each reader consumes data in one partition in parallel. It's up to how users choose the implementation.

For performance, SplitReader extracts a batch of data from the external system each time and puts it into elementQueue. As shown in the figure, the blue box contains the batch of data extracted each time, and the orange box includes each piece of data in this batch.

On the right side of elementQueue are RecordEmitter and SourceOutput. RecordEmitter sends each record to the downstream SourceOutput for output. Each time RecordEmitter pulls a batch of data from the intermediate elementQueue and sends data to the downstream piece by piece. Note that the implementation must be non-blocking since RecordEmitter is driven by the main thread, which currently adopts a lock-free mailbox mode that divides the work to be executed into mails, and a mail is retrieved from the mailbox for work each time.

Each time RecordEmitter sends data to the downstream, it informs the downstream if there is any data to be processed subsequently. Meanwhile, the processing progress of the current split is recorded in SplitStates, including its current state and how it goes.

When SplitEnumerator finds a new split in the external system, it needs to call the addSplits method through RPC to add the new split to the reader. On the SplitFetchermanager side, the new split is allocated according to the threading model that the user has previously selected.

If there is only one thread, then the thread is assigned a new task, and the reader is assigned to read the new split. If there are multiple threads, a new thread is created, and a reader is also created to process the split separately. Similarly, the processing progress needs to be recorded in SplitStates.

2.5) Create Checkpoint

Now, let's see how checkpoints are handled in the new Source API.

On the left is the coordinator, namely the split enumerator. As shown in the figure, it still has a split (Split_5) unassigned. On the arrows in the middle are some splits under transmission. Dotted lines indicate the checkpoint barrier. It can be seen that Split_2 is already ahead of the checkpoint, and Split_4 is behind the checkpoint.

The reader at the bottom is requesting a new split from SplitEnumerator. The three readers have been assigned some splits and processed them, and positions have been generated. Let's see what the enumerator and the reader need to store at the checkpoints.

Enumerator: unassigned record splits (Split_5), record splits assigned but unsaved to checkpoints (Split_4)
Reader: allocated record splits (Split_0,1,3), record the split state (Split_2)

2.6) Three Steps for Simple Implementation of Source

2.6.1) Split and SplitState

Split: external system split
SplitSerializer: serialize or deserialize the Split and pass it to SourceReader
SplitState: the state of Split, used for Checkpoint and recovery

2.6.2) SplitEnumerator

Discover and subscribe to Splits
EnumState: the state of the Enumerator, used for Checkpoint and recovery
EnumStateSerializer: serialize or deserialize EnumState

2.6.3) SourceReader

SplitReader: interface for data interaction with external systems
FetcherManager: specify a threading model (currently available)
RecordEmitter: convert the message type and process the event time

The steps mentioned above deal with external systems, which means that the Flink engine itself is rarely involved. So users no longer need to worry about problems concerning checkpoint locks, multithreading, and so on. Instead, they can concentrate more on the development and the interaction with external systems. Therefore, the new Source API greatly simplifies the development through these abstractions.

3) Sink API – Exit for Flink Data

Note that Flink features exactly once semantics without data repetition or data loss. For that, Flink has done much work, among which the two-phase commit at the Sink is of vital importance.

3.1) Pre-commit Phase

In the pre-commit phase, since the distributed system features "one coordinator with multiple executors" mode, the coordinator needs to request "commit" first. It needs to send a commit request message to all executors to start the entire two-phase commit.

When the executor receives the message, it will make some preparations for commit. After this, all executors will inform the coordinator that everything is ready for the next step. When the coordinator receives the request of "continue" from executors, the pre-commit phase ends. Next is the second phase, the commit phase.

3.2) Commit Phase

The committer sends a message to the executors when it decides to commit. The executors will process the data prepared in order to execute the commit. After this, the executors will reply to the coordinator whether the commit is executed normally.

Once the coordinator decides to enter the commit phase, all executors must execute the order to the letter. Therefore, if there is something wrong with a coordinator at this phase, the commit still needs to be executed after the coordinator is recovered. Once the commit is determined, executors must carry out the commit all the way down.

If an executor fails to commit in the pre-commit phase, it may report an error, such as network disconnection, to the coordinator. Or, when the time is up and the coordinator does not receive the response request from executor three, the coordinator may trigger the rollback in the second phase. It will tell all executors that this commit fails and they need to roll back to the previous state. After that, the executors will roll back to undo the previous operation.

3.3) How to Implement Two-phase Commit in Flink

3.3.1) Pre-commit Phase

Take the Sink of this file system as an example.

After receiving the checkpoint barrier, the Sink of the file system performs the pre-commit operation (writing the current data to a temporary file on the hard disk). When the pre-commit phase is over, all operators will return the "ready for commit" message to the coordinator.

3.3.2) Commit Phase

The commit is executed in the second phase. The JobManager sends a commit command to all operators. After receiving this command, the Sink performs the commit operation.

Again, take the file system as an example. As mentioned just now, the data is written to a temporary file in the pre-commit phase. Then, in the commit phase, the temporary file will be renamed according to the predefined name specification, thus implementing the commit operation.

Note that the temporary file is not useless. It plays a role in situations such as rollback. The two-phase commit mechanism is subtly used to ensure the exact-once semantics.

3.4) Sink Model

1) Writer: Continuously write data from the upstream to a certain intermediate state during the write or pre-commit phase.

2) Committable: The "intermediate state" mentioned above can perform the commit operation.

3) Committer: Commit the Committable.

4) Global Committer: An optional component depending on the external system, for example, Iceberg.

4) Future Improvement

Improve new Source: Since Source and Sink have just been launched, there are still some problems. Some developers may have new requirements for updates and improvements. At present, the Source is relatively stable but still needs to be continuously improved.
Migrate existing connectors to the new API: With the advancement of stream-batch unified connectors, all connectors will be migrated to the new API.
Connector testing framework: With the connector testing framework, attempts have been made to provide all connectors with a relatively consistent testing standard. Test developers no longer need to write some cases or consider a variety of test environments, test scenarios, and so on. Developers are enabled to test their codes in different scenarios using different cases quickly and focus more on the development logic itself, thus greatly reducing the testing workload. This is also what the development of Source API, Sink API, and subsequent framework is all about. It aims to make it easier for the connector development, thus attracting more developers to contribute to the Flink ecosystem.

Community