Accelerated Integration: Unveiling Flink Connector's API Design and Latest Advances

This article is compiled from the presentation by Ren Qingsheng, committer and PMC member of Apache Flink, at the Flink Forward Asia 2023 core technology session (Part 2).

Abstract: This article is compiled from the presentation by Ren Qingsheng, committer and PMC member of Apache Flink, at the Flink Forward Asia 2023 core technology session (Part 2). The content mainly consists of the following four parts:

Source API
Sink API
Integrating Connectors into Table/SQL API
Catalog API

Apache Flink provides stream processing APIs at different levels of abstraction. The low-level DataStream API provides the building blocks for application development based on programming languages such as Java. The high-level Table and SQL APIs enable development based on declarative languages. This layered structure is adopted in the design of Apache Flink's connector APIs. The foundational Source and Sink APIs are the counterpart to the DataStream API, and the highest-level Catalog API is the counterpart to the Table and SQL APIs. With this bottom-up structure in mind, let's dive into each API's design details and implementation considerations.

幻灯片1.PNG

Apache Flink Source API

幻灯片2.PNG

The Source API has undergone several iterations, with the first and public versions introduced in Flink 1.12 and Flink 1.14, respectively. The Source API's predecessors, namely, the InputFormat and SourceFunction interfaces, will be phased out in the upcoming 2.0 release. We recommend that you use the latest Source API to build a Flink connector.

Similar to a Flink cluster, the Source API adopts a master-slave architecture. The "brain" of a source is SplitEnumerator. By definition, it is responsible for generating splits.

A split represents a portion of data from an external system, such as a topic partition in Apache Kafka and a file or directory in a file system. The role of the SplitEnumerator of a source is to discover splits and assign them to SourceReaders, which are the actual processors of splits. Only one SplitEnumerator runs as a single instance on the JobManager of each source, and multiple SourceReaders run in parallel on the TaskManagers. The number of SourceReaders is equal to the source parallelism. Communication between the SplitEnumerator and SourceReaders relies on Remote Procedure Calls (RPCs) between the JobManager and TaskManagers. To facilitate the coordination between the SplitEnumerator and SourceReaders in split assignment and overall management, Flink provides SourceEvents, which are custom events transmitted directly between the SplitEnumerator and SourceReaders.

幻灯片3.PNG

Considering that SourceReader is a low-level API, we've introduced the SourceReaderBase class to simplify the implementation of a source, as shown in the preceding figure. SourceReaderBase is designed to decouple the source's communication with external systems from its coordination with Flink. With SourceReaderBase, you can zero in on the interaction with external systems and minimize focus on Flink-specific tasks, such as checkpoint and main thread management. Building on the capability offered by SourceReaderBase, we've added the SplitReader API to implement the logic of fetching and reading data from the assigned split. On one side, SplitReaders store the data to an element queue used by SourceReaderBase, creating a buffer between the external system and Flink. On the other side, Flink runs the mailbox-based main task thread to extract data from the queue, which is then processed by the RecordEmitter and forwarded to the downstream system. The RecordEmitter is responsible for deserialization, converting the data from the external system to the format used by the downstream system.

幻灯片4.PNG

Based on personal experience, I'd like to share a few heads-up for developing a source. First, it's essential to create a buffer to separate exchanges with the external system from coordination with Flink, such as the element queue used by SourceReaderBase we mentioned earlier. This is because Flink's main task thread adopts the mailbox model to handle checkpoint and control messages. Engaging the thread in interactions with the external system can impair the performance of downstream operators, the entire task, and even checkpointing. Therefore, make sure that you use a different thread for I/O operations.
Second, utilize the existing APIs to streamline development. For example, many developers create a thread or thread pool within the SplitEnumerator implementation, leading to troublesome maintenance of multiple threads. This heavy lifting can be done by the callAsync method of the SplitEnumeratorContext interface. When you call this method, a worker thread is automatically created to perform I/O operations, and Flink takes care of maintaining the involved threads. Similarly, you can take advantage of the SourceReaderBase and SplitReader APIs to significantly simplify development, rather than implement the logic from scratch.

幻灯片5.PNG

We've also enhanced the Source API's capability in recent releases, starting with the introduction of Hybrid Source. In a typical use case, Flink needs to read historical data from storage systems, such as Hadoop Distributed File System (HDFS), and then read real-time data from Kafka or other queue services. Hybrid Source features the capability of reading data from heterogeneous sources in sequence and ensures smooth switching between sources. For example, the KafkaEnumerator is initialized based on the position or time information in the SwitchContext passed from the FileEnumerator, enabling a seamless transition from a historical data source to a real-time data source. As shown in this figure, Hybrid Source is built on top of the Source API, encapsulating the SplitEnumerator and SourceReader interfaces and adding a utility class to facilitate switching between sources.

幻灯片6.PNG

Another important enhancement is the support for watermark alignment.
It's common for a Flink job to experience data reading discrepancies at the instance level regardless of the number of sources. If the job reads data from a single source such as Kafka, a partition might significantly lag behind other partitions due to network issues. If the job reads data from two independent sources, one source is likely to process data faster than the other source due to system differences. This imbalance can impede downstream operations, such as join and aggregation, because these operations are triggered only when all watermarks are aligned. If a source instance lags behind, all records produced by other source instances must be buffered in the state of the downstream operator. This leads to uncontrollable growth of the operator's state.

To address this issue, we've introduced the watermark alignment feature. Within a single source, the feature is implemented using the SourceCoordinator class. If different sources are involved, the CoordinatorStore interface is used to exchange the watermark information among SourceCoordinators. Each SourceOperator periodically reports the current watermark to its SourceCoordinator. Then, the SourceCoordinator calculates the minimal watermark. If the watermarks of certain SourceOperators significantly exceed the minimal value, the SourceOperators are paused to ensure watermark alignment.

Apache Flink Sink API

Similar to the Source API, the Sink API has undergone several iterations. Its predecessors, namely, the OutputFormat and SinkFunction interfaces, will be deprecated in the 2.0 release. To cater to different business requirements, we've released two versions of the Sink API. Today I'm going to focus on version 2, the SinkV2 API. For simplicity, I'll refer to it as the Sink API throughout my presentation. In essence, the Sink API is a factory interface that helps you build a sink topology, eliminating the need to adhere to the Source API's master-slave architecture. The core component of a sink topology is SinkWriter, which is responsible for serializing upstream data and writing it to external systems. If you want to implement exactly-once semantics or the two-phase commit mechanism, you can use SinkCommitter together with SinkWriter. The SinkWriter produces a committable each time a checkpoint is triggered. Then, the SinkCommitter commits the committable.

In a distributed system with concurrent sink tasks, the SinkCommitter buffers the committables until all checkpoints are completed and then commits them to external systems to achieve the two-phase commit mechanism. The combination of SinkWriter and SinkCommitter is versatile but cannot support all use cases. For example, an Iceberg or Hive sink might involve post-checkpoint operations, such as file merging. To increase adaptability, we've added three components: PreWrite, PreCommit, and PostCommit. You can use these components to insert any custom logic into the sink topology, such as interconnecting multiple operators and adding concurrency. If the standard combination of SinkWriter and SinkCommitter cannot meet your business requirements, we recommend that you use these custom components.

幻灯片8.PNG

Sinks for different external systems often have the same basic functions: buffering data based on user-defined parameters, sending asynchronous requests to the destination system, and retrying unsuccessful requests. To simplify development, Flink encapsulates this generic functionality into a base class called AsyncSinkBase. The class allows you to configure buffering parameters, such as a size limit for buffered data and a timeout condition, and provides an ElementConverter that converts data into request entries to be sent to the destination system. You might have noticed that this base class implements at-least-once semantics rather than exactly-once semantics. This is because in practice, at-least-once semantics is more cost-efficient and suits most use cases. In short, the AsyncSinkBase class makes it easier and faster to build sinks with at-least-once semantics.

Integration into Table/SQL API

Understanding the bottom layer, let's uncover how to integrate a source or sink into the Table or SQL API.

幻灯片9.PNG

DynamicTableSource, the next layer above the Source API, is extended by the ScanTableSource and LookupTableSource interfaces. By definition, a ScanTableSource scans data in a source table and a LookupTableSource looks up data in a dimension table. As shown in the sample workflow, a ScanTableSource reads data from Kafka, a LookupTableSource performs a point query in a Redis database, and a DynamicTableSink writes the results to Hive. Essentially, DynamicTableSource, ScanTableSource, and LookupTableSource are factory interfaces or constructors. The actual heavy lifting is handled by the lower-level DataStream, Source, and Sink APIs.

幻灯片10.PNG

Let's dive into the details of these interfaces, starting with the getChangelogMode method provided by the ScanTableSource interface. This method returns the set of changes supported by a source, such as INSERT, UPDATE, and DELETE changes. For example, the three preceding changes should be returned for a MySQL Change Data Capture (CDC) source and only the INSERT change should be returned for a Kafka source. The return value of this method is used by the Table or SQL planner to check compatibility with downstream operators and detect potential logic errors.

Another method provided by the ScanTableSource interface is getScanRuntimeProvider. This method returns a provider based on the given context containing user-defined configurations in the CREATE TABLE statement. The provider is then used to construct the corresponding source.

幻灯片11.PNG

Considering that the DataStream API does not support point query, the LookupTableSource interface is introduced to construct LookupFunctions that can fetch individual values from an external table.

LookupFunction is used for a synchronous lookup, whereas its variant, AsyncLookupFunction, is used for an asynchronous lookup. Flink 1.16 enhanced the lookup capability for dimension tables. For example, a source cache is added to accelerate the construction of LookupFunctions for runtime. Given the single focus on querying a specific field in an external system, the LookupTableSource interface provides only the getLookupRuntimeProvider method to create LookupFunctions for runtime based on user configurations.

幻灯片12.PNG

The design of the DynamicTableSink interface is straightforward. This interface provides a method that checks the support for INSERT, UPDATE, and DELETE changes to help the planner detect logic errors, and a method to construct sinks based on user configurations.

幻灯片13.PNG

In addition, you can easily build advanced sources by using the group of interfaces whose names are prefixed with Supports. For example, the SupportsFilterPushdown and SupportsProjectionPushdown interfaces enable the planner to push down a filter and a projection into a DynamicTableSource object, respectively. You can view the details of these interfaces in the source code and implement them as needed.

Apache Flink Catalog API

Now, we've reached the highest level of abstraction, the Catalog API.

幻灯片14.PNG

SQL developers often need to write tedious CREATE TABLE statements and manually define each field's type and name. In scenarios in which the upstream table contains hundreds of fields, developers need to map each field to Flink's data type and list them in column definitions. If you want to create a table by using a complex connector, such as a connector to an SSL-enabled Kafka cluster, the statement length is extended due to the burdensome connector configurations in the WITH clause.

Another prevalent issue is the challenge in reusing configurations. Tables are independently created and configured even if they use the same Flink cluster. The manual mapping of data types further increases redundancy during SQL development.

The Catalog API is designed to address the issue of cumbersome table creation and configuration reuse. Essentially, Flink catalogs provide metadata of external systems. The Catalog API is an abstraction of the Source API and involves high-level concepts, including databases, tables, partitions, views, and functions. Implementing a catalog simply involves mapping the concepts between an external system and the catalog. Take note that the mapping list varies based on the external system. For example, if the external system does not have partitions or functions, they can be omitted in the mapping. The Catalog API also supports persistent storage of metadata. For example, you can use a Hive catalog to store permanent table metadata for subsequent reuse.

幻灯片15.PNG

Another benefit of the Catalog API is the centralized management of various external systems, which significantly reduces the configuration burden of SQL users. For example, some systems use a database schema and others may not, and some define data containers as databases and others as namespaces. Despite these differences, the Catalog API can map the concepts to Flink's abstractions. This enables users to focus on the Flink side. They only need to write a simple SELECT statement to access the converted table data.

幻灯片16.PNG

Mapping a MySQL database or similar systems to Flink is straightforward because they share a set of concepts. The question is how to deal with a storage system of unstructured data. To illustrate the process, let’s use Kafka as an example. In this example, data is stored in a Kafka topic in the JSON format without an upper layer like "database." The Kafka cluster is mapped to a catalog. The catalog creates a default database, infers each data field's type based on the JSON content, and writes the data to each table row. This way, users can access the data in a Kafka topic by using a simple SELECT statement, eliminating the need to repeatedly configure information about the Kafka cluster and significantly reducing complexity.

The Catalog API also unlocks opportunities for advanced features, such as data lineage management. To support this feature, Flink has added catalog modification listeners, as proposed in FLIP-294. These listeners detect catalog modifications, such as table creation and deletion, and report them to a metadata platform, such as Atlas and Datahub. When a table is created in the catalog, a corresponding data node is created on the metadata platform; when the table is deleted, the node is also deleted.

幻灯片17.PNG

However, data nodes are not sufficient for lineage management. As proposed in FLIP-314, Flink will add job lineage listeners in the upcoming 1.19 release to facilitate the monitoring of job status changes. For example, when a job is started, its listener can obtain the sources and sinks and connections are established between data nodes. This helps users fully understand the data lineage and flow in a Flink cluster.

幻灯片18.PNG

Turning back to this diagram, let's recap the bottom-up approach to developing a connector. The starting point is the Source and Sink APIs at the DataStream layer. Together, they form the cornerstone of a connector. To make a connector accessible from the Table API and SQL queries, implement the DynamicTableSource and DynamicTableSink APIs, which are the respective constructors of sources and sinks. If you want to make it simpler and easier for SQL users to create tables with connectors, implement the Catalog API. By enabling direct data selection based on concept mapping, the Catalog API eliminates repetitive definitions and configurations in SQL statements and inherently provides lineage and metadata management capabilities.

Community

Accelerated Integration: Unveiling Flink Connector's API Design and Latest Advances

Apache Flink Source API

Apache Flink Sink API

Integration into Table/SQL API

Apache Flink Catalog API

Read previous post:

Read next post:

Apache Flink Community

You may also like

Comments

Apache Flink Community

Related Products

Realtime Compute for Apache Flink

Media Solution

Message Queue for Apache Kafka

AnalyticDB for MySQL