By Qingsheng Ren and Leonard Xu
Flink CDC is a real-time data integration framework based on the Change Data Capture (CDC) technology of database changelogs. It provides multiple advanced features, such as full and incremental data synchronization, lock-free reading, parallel reading, automatic synchronization of schema changes, and distributed architecture, on top of Apache Flink's excellent pipeline capability and robust ecosystem. Moreover, the Flink CDC community has grown rapidly over the past three years, boasting 111 contributors, 8 maintainers, and a DingTalk group of over 9,800 users.
Thanks to the joint efforts of community users and developers, Flink CDC 3.0 was launched on December 7, 2023, marking the shift of Flink CDC from a connector that captures changes in Flink data sources to an end-to-end streaming ELT framework based on Flink. Flink CDC 3.0 supports real-time data synchronization from MySQL databases to Apache Doris and StarRocks.
Despite the technical advantages and thriving community of Flink CDC, users experienced the following pain points:
In response, the maintainers of the Flink CDC community reviewed the deficiencies of Flink CDC and identified the technical challenges in improving Flink CDC and the data integration technology in general.
To address these challenges, we aim to turn Flink CDC into an open-source CDC tool that can handle massive data integration scenarios.
In partnership with the maintainers of the Flink CDC community, we launched Flink CDC 3.0 based on the following design principles:
The architecture of Flink CDC 3.0 is divided into four layers:
The API operations of Flink CDC 3.0 are tailored to data integration scenarios. Users do not need to concern themselves with the implementation details of the framework. They can easily create a data synchronization job by writing a YAML-formatted API file where data sources and sinks are configured. The following figure shows a sample code for synchronizing data from a MySQL database to Apache Doris.
To facilitate the integration of external systems into a data synchronization pipeline, Flink CDC 3.0 introduced the Pipeline Connector API.
To ensure compatibility with the Flink ecosystem, the design of DataSource and DataSink uses the same logic as Apache Flink. Developers can easily integrate external systems with Flink CDC 3.0 by using Flink connectors. For Flink CDC 3.1, the community plans to integrate more external systems, such as Apache Paimon, Apache Iceberg, Apache Kafka, and MongoDB, to expand the ecosystem and use cases of Flink CDC.
To achieve high performance in scenarios such as schema changes, full database synchronization, and table merging, Flink CDC 3.0 integrates the capabilities of Apache Flink and provides multiple custom Flink operators to support various synchronization modes.
Schema evolution is a common but challenging feature of data synchronization frameworks. Flink CDC 3.0 introduces SchemaRegistry to map jobs in a topology and uses SchemaOperator to manage schema changes in job topologies.
Here's how Flink CDC 3.0 handles schema changes:
Full Database Synchronization
Users can specify a multi-table or full database synchronization task by configuring DataSource in the configuration file of Flink CDC 3.0. The schema evolution feature enables automatic synchronization for the entire database. When new tables are detected, SchemaRegistry automatically creates replicas in the destination system.
Another common use case of Flink CDC 3.0 is merging multiple source tables into a single sink table. Flink CDC 3.0 employs a Route mechanism to implement table merging and synchronization. Users can define routing rules in the configuration file of Flink CDC 3.0 by using regular expressions to specify the source tables and the sink table.
To reduce serialization overhead during data transmission, Flink CDC 3.0 adopts a high-performance data structure.
In addition to fundamental data synchronization capabilities, Flink CDC 3.0 provides multiple advanced features, such as automatic synchronization of schema changes, full database synchronization, and table merging and synchronization, to cater to complex data integration scenarios. The automatic synchronization of schema changes frees users from manual intervention when schema changes occur in a data source, greatly reducing operational costs. Moreover, only few operations are needed to configure a multi-table or multi-database synchronization task, facilitating users' development.
The Flink CDC community provides detailed product documentation that describes the architecture and terms of Flink CDC 3.0, quick-start tutorials on how to sync data from MySQL databases to Apache Doris or StarRocks, and Flink CDC 3.0 demos.
We are grateful to Flink CDC users for their trust and feedback, as well as to the Apache Doris and StarRocks communities and developers whose generous support helped materialize Flink CDC 3.0. Contributors are listed below in alphabetical order.
BIN, Dian Qi, EchoLee5, FlechazoW, FocusComputing, Hang Ruan, He Wang, Hongshun Wang, Jiabao Sun, Josh Mahonin, Kunni, Leonard Xu, Maciej Bryński, Malcolmjian, North.Lin, Paddy Gu, PengFei Li, Qingsheng Ren, Shawn Huang, Simonas Gelazevicius, Sting, Tyrantlucifer, TJX2014, Xin Gong, baxinyu, chenlei677, e-mhui, empcl, gongzhongqiang, gaotingkai, ice, joyCurry30, l568288g, lvyanquan, pgm-rookie, rookiegao, skylines, syyfffy, van, wudi, wuzhenhua, yunqingmo, yuxiqian, zhaomin, zyi728, and zhangjian
Understand Flink SQL: Real-Time SQL Query Execution for Stream and Batch Data
150 posts | 43 followers
FollowAlibaba Cloud Community - May 31, 2024
Apache Flink Community - May 28, 2024
Alibaba EMR - January 10, 2023
Apache Flink Community China - May 18, 2022
ApsaraDB - February 29, 2024
Apache Flink Community China - September 15, 2022
150 posts | 43 followers
FollowRealtime Compute for Apache Flink offers a highly integrated platform for real-time data processing, which optimizes the computing of Apache Flink.
Learn MoreConduct large-scale data warehousing with MaxCompute
Learn MoreA fully-managed Apache Kafka service to help you quickly build data pipelines for your big data analytics.
Learn MoreA real-time data warehouse for serving and analytics which is compatible with PostgreSQL.
Learn MoreMore Posts by Apache Flink Community