Flink CDC 3.0｜ A Next-generation Real-time Data Integration Framework

By Qingsheng Ren and Leonard Xu

Flink CDC Overview

Flink CDC is a real-time data integration framework based on the Change Data Capture (CDC) technology of database changelogs. It provides multiple advanced features, such as full and incremental data synchronization, lock-free reading, parallel reading, automatic synchronization of schema changes, and distributed architecture, on top of Apache Flink's excellent pipeline capability and robust ecosystem. Moreover, the Flink CDC community has grown rapidly over the past three years, boasting 111 contributors, 8 maintainers, and a DingTalk group of over 9,800 users.
Thanks to the joint efforts of community users and developers, Flink CDC 3.0 was launched on December 7, 2023, marking the shift of Flink CDC from a connector that captures changes in Flink data sources to an end-to-end streaming ELT framework based on Flink. Flink CDC 3.0 supports real-time data synchronization from MySQL databases to Apache Doris and StarRocks.

Design Principles of Flink CDC 3.0

Challenges in Data Synchronization

Despite the technical advantages and thriving community of Flink CDC, users experienced the following pain points:

User experience: Flink CDC provides only source connectors and does not support end-to-end data integration, and it is difficult to create jobs by using SQL syntax or calling DataStream API operations.
Frequent maintenance: Frequent table creation and deletion operations are necessary due to the frequent changes of schemas in source databases.
Scalability: Large amounts of resources are required to synchronize data from thousands of tables and ingest tens of thousands of tables into data lakes or data warehouses. In addition, scaling cannot be automatically performed to handle different resource requirements for the full synchronization and incremental synchronization stages.
Neutrality: Flink CDC is licensed under Apache License 2.0 whose copyright belongs to Alibaba (Ververica) rather than Apache Flink.

In response, the maintainers of the Flink CDC community reviewed the deficiencies of Flink CDC and identified the technical challenges in improving Flink CDC and the data integration technology in general.

Large data volume: Users' legacy databases are large in size, commonly containing over 100 TB of data.
Real-time processing of incremental data: The business value of incremental data is higher than that of historical data but decreases over time, which leads to high requirements on timeliness.
Data ordering: CDC tools must support global preservation of data orders to ensure the consistency of processed data.
Dynamic changes in schemas: Incremental data grows over time, leading to frequent changes in schemas.

To address these challenges, we aim to turn Flink CDC into an open-source CDC tool that can handle massive data integration scenarios.

Purpose of Flink CDC 3.0

In partnership with the maintainers of the Flink CDC community, we launched Flink CDC 3.0 based on the following design principles:

End-to-end experience: As an end-to-end data integration framework, Flink CDC 3.0 provides API operations for data integration to help users easily build jobs.
Automation: Flink CDC 3.0 can automatically synchronize schema changes from upstream to downstream, allowing users to add tables to existing jobs at any time.
Ultimate scalability: Idle resources can be automatically reclaimed, and a single sink instance can write to multiple tables simultaneously.
Donation: To attract more users, Ververica has finished the donation of Flink CDC to the Apache Software Foundation, making it an official sub-project of Apache Flink.

Design of Flink CDC 3.0

Architecture

The architecture of Flink CDC 3.0 is divided into four layers:

Flink CDC API: YAML-formatted API operations are provided to help end users configure data synchronization pipelines. Users can call the API operations in Flink CDC CLI.
Flink CDC Connect: Source and sink connectors are provided to interact with external systems. Flink CDC 3.0 encapsulates the source connectors of Apache Flink and Flink CDC to read and write data to external systems.
Flink CDC Composer: This layer translates data synchronization tasks into Flink DataStream jobs.
Flink CDC Runtime: Custom Flink operators are provided for different data synchronization scenarios to implement advanced features, such as schema changes, routing, and transformation.

API Design for Data Integration Scenarios

The API operations of Flink CDC 3.0 are tailored to data integration scenarios. Users do not need to concern themselves with the implementation details of the framework. They can easily create a data synchronization job by writing a YAML-formatted API file where data sources and sinks are configured. The following figure shows a sample code for synchronizing data from a MySQL database to Apache Doris.

Pipeline Connector API

To facilitate the integration of external systems into a data synchronization pipeline, Flink CDC 3.0 introduced the Pipeline Connector API.

DataSource: DataSource is used to collect change events from external systems and pass them to downstream operators. It is composed of EventSourceProvider and MetadataAccessor. EventSourceProvider builds Flink sources, whereas MetadataAccessor accesses metadata.
DataSink: DataSink is used to apply schema changes received from upstream operators and write the changed data to external systems. It is composed of EventSinkProvider and MetadataApplier. EventSinkProvider builds Flink sinks, whereas MetadataApplier applies metadata changes (such as table schema changes) to the destination system.

To ensure compatibility with the Flink ecosystem, the design of DataSource and DataSink uses the same logic as Apache Flink. Developers can easily integrate external systems with Flink CDC 3.0 by using Flink connectors. For Flink CDC 3.1, the community plans to integrate more external systems, such as Apache Paimon, Apache Iceberg, Apache Kafka, and MongoDB, to expand the ecosystem and use cases of Flink CDC.

Core Features of Flink CDC 3.0

To achieve high performance in scenarios such as schema changes, full database synchronization, and table merging, Flink CDC 3.0 integrates the capabilities of Apache Flink and provides multiple custom Flink operators to support various synchronization modes.

Schema Evolution

Schema evolution is a common but challenging feature of data synchronization frameworks. Flink CDC 3.0 introduces SchemaRegistry to map jobs in a topology and uses SchemaOperator to manage schema changes in job topologies.
Here's how Flink CDC 3.0 handles schema changes:

When a schema change is detected in a data source, SchemaRegistry issues a pause request to SchemaOperator. After receiving the request, SchemaOperator pauses stream ingestion and flushes the data pipeline to maintain schema consistency.
Once the schema change is synchronized to the external system, SchemaRegistry issues a resume request to SchemaOperator. After receiving the request, SchemaOperator resumes stream ingestion.

Full Database Synchronization
Users can specify a multi-table or full database synchronization task by configuring DataSource in the configuration file of Flink CDC 3.0. The schema evolution feature enables automatic synchronization for the entire database. When new tables are detected, SchemaRegistry automatically creates replicas in the destination system.

Table Merging

Another common use case of Flink CDC 3.0 is merging multiple source tables into a single sink table. Flink CDC 3.0 employs a Route mechanism to implement table merging and synchronization. Users can define routing rules in the configuration file of Flink CDC 3.0 by using regular expressions to specify the source tables and the sink table.

High-performance Data Structure

To reduce serialization overhead during data transmission, Flink CDC 3.0 adopts a high-performance data structure.

Schemaless deserialization: Schemaless deserialization decouples schema information from changed data. Before sending changed data, DataSource sends the schema description, which is tracked by the framework. This way, schema information does not need to be bound to each changed record, and the serialization cost for wide tables is significantly reduced.
Binary storage format: Data is stored in a binary format during synchronization. Deserialization is performed only when the detailed data of a field is read (such as when the table is partitioned by the primary key) to reduce serialization costs.

In addition to fundamental data synchronization capabilities, Flink CDC 3.0 provides multiple advanced features, such as automatic synchronization of schema changes, full database synchronization, and table merging and synchronization, to cater to complex data integration scenarios. The automatic synchronization of schema changes frees users from manual intervention when schema changes occur in a data source, greatly reducing operational costs. Moreover, only few operations are needed to configure a multi-table or multi-database synchronization task, facilitating users' development.

Quick Start for Flink CDC 3.0

The Flink CDC community provides detailed product documentation that describes the architecture and terms of Flink CDC 3.0, quick-start tutorials on how to sync data from MySQL databases to Apache Doris or StarRocks, and Flink CDC 3.0 demos.

Acknowledgement

We are grateful to Flink CDC users for their trust and feedback, as well as to the Apache Doris and StarRocks communities and developers whose generous support helped materialize Flink CDC 3.0. Contributors are listed below in alphabetical order.
BIN, Dian Qi, EchoLee5, FlechazoW, FocusComputing, Hang Ruan, He Wang, Hongshun Wang, Jiabao Sun, Josh Mahonin, Kunni, Leonard Xu, Maciej Bryński, Malcolmjian, North.Lin, Paddy Gu, PengFei Li, Qingsheng Ren, Shawn Huang, Simonas Gelazevicius, Sting, Tyrantlucifer, TJX2014, Xin Gong, baxinyu, chenlei677, e-mhui, empcl, gongzhongqiang, gaotingkai, ice, joyCurry30, l568288g, lvyanquan, pgm-rookie, rookiegao, skylines, syyfffy, van, wudi, wuzhenhua, yunqingmo, yuxiqian, zhaomin, zyi728, and zhangjian

Community

Flink CDC 3.0｜ A Next-generation Real-time Data Integration Framework

Flink CDC Overview

Design Principles of Flink CDC 3.0

Challenges in Data Synchronization

Purpose of Flink CDC 3.0

Design of Flink CDC 3.0

Architecture

API Design for Data Integration Scenarios

Pipeline Connector API

Core Features of Flink CDC 3.0

Schema Evolution

Table Merging

High-performance Data Structure

Quick Start for Flink CDC 3.0

Acknowledgement

Read previous post:

Read next post:

Apache Flink Community

You may also like

Comments

Apache Flink Community

Related Products

Realtime Compute for Apache Flink

MaxCompute

Message Queue for Apache Kafka

Hologres