Abstract: This article is based on a keynote speech given by Jark Wu, head of Flink SQL and Flink CDC at Alibaba Cloud, during Flink Forward Asia 2023. In the speech, WU introduced the development of the Flink CDC community over 2023, discussed the new positioning and architectural design of Flink CDC 3.0, and unveiled the next step in the open source journey of Flink CDC.
Flink CDC is a series of database connectors that are developed based on Apache Flink, a stream processing engine that mainly handles data like logs of clicks and impressions. In actual business scenarios, however, logs of clicks are not everything. There is massive value hidden in users' business databases waiting to be discovered. Flink CDC is a solution to the deficiency of Apache Flink in reading such data, as it allows users to read data change logs from databases in streams.
Flink CDC supports more than a dozen data sources, such as MySQL, PostgreSQL, PolarDB, OceanBase, TiDB, and MongoDB. It can also work seamlessly with Flink SQL. This combination lets users perform various tasks simply with standard SQL syntax: developing sophisticated applications, synchronizing data to data warehouses and data lakes in real time, building materialized views on databases for data analytics, and implementing complex mechanisms to process change logs, such as auditing.
Flink CDC went through remarkable development over the year of 2023.
1. Ecosystem
CDC connectors for IBM DB2 and Vitess were released. Moreover, the support of reading incremental snapshots was added for Oracle, SQL Server, MongoDB, and PostgreSQL databases, allowing users to read data from these databases in a lock-free, parallel manner. This makes for better performance and higher stability.
2. Engine capabilities
Various advanced features were added to the engine.
For one, Flink CDC now supports dynamic addition of tables, which lets users add tables to an existing data synchronization job. The added tables will be synchronized in parallel, without interrupting the current job.
For another, Flink CDC now supports automatic scale-in. Flink CDC performs a synchronization job in two stages: full synchronization of historical data and incremental synchronization of change data. When synchronizing the historical data, which is large in volume, high concurrency is necessary for ensuring the transmission throughput. However, when it comes to the incremental synchronization stage, a single task might suffice. Previously, to save resources, our users had to manually adjust the concurrency by restarting the job. Today, automatic scale-in has become an out-of-the-box feature of Flink CDC. The system will automatically tune down the parallelism and reclaim idle resources to help users save cost.
What's more, Flink CDC now allows the tasks of splitting a whole table into chunks to be asynchronous. The pipeline procedure for the tasks will dramatically improve the read performance when very large tables are involved.
Also, specifying offset is supported. Users can specify an offset or a timestamp of the binary logs from which they want the system to start reading data. This way, the need for a full read is eliminated.
The at-least-once reading mode is supported, which can deliver higher efficiency than exactly-once for scenarios where idempotence is acceptable.
Last but not least, Flink CDC now supports Apache Flink versions from 1.14 to the latest 1.18.
The fast development of Flink CDC's ecosystem and engine capabilities is largely attributed to its vibrant community. Over the past year, the Flink CDC community attracted 70% more contributors and had 58% more commits and 44% more stars. The total number of stars has exceeded 4,600.
The community wouldn't have been what it is without the 100+ open source contributors and all our users that leverage Flink CDC in their projects.
Among the contributions in Flink CDC, nearly half were made by people outside Alibaba, such as members of GigaCloud, XTransfer, OceanBase, Tencent, and DiDi. We are glad to have such a diverse ecosystem.
Flink CDC has been utilized in a wide scope of industries, spanning finance, logistics, gaming, Internet, and cloud services scenarios.
It's been more than three years since the first release of Flink CDC, which was initially designed as a set of connectors for different data sources. However, as it gains wide adoption, the capabilities centered around this definition are no longer enough for our customers' scenarios. Therefore, we need to rethink the positioning of Flink CDC.
The guests from our customers shared their use of Flink CDC for data integration, which is indeed the most common application of this tool. However, since Flink CDC is merely a series of connectors, there is still heavy work for our customers to fit Flink CDC into their business, and the limitations in the functionalities of Flink CDC will hamper the user experience.
While we work to let Flink CDC connect more data sources, such as databases, message queues, data lakes, files, and SaaS offerings, we want it to do more than just connection. We expect it to be an end-to-end solution for real-time data integration, covering all procedures from data sources and pipelines to the destinations.
After hard efforts from the community, I proudly announce Flink CDC 3.0, a real-time data integration framework. It marks another milestone in the development of Flink CDC. Flink CDC 3.0, built upon Apache Flink, has been open-sourced, with features such as real-time synchronization, synchronization of entire databases, shard merging and synchronization, synchronization of schema changes, automatic inference of schemas, addition of states and tables, and automatic scale-in. Flink CDC 3.0 already supports synchronization from MySQL to StarRocks and Apache Doris. In future versions, support for more data sources will be added, including Paimon, Kafka, and MongoDB. Flink CDC 3.0 also provides an API that combines YAML files and a CLI tool, which will make it much easier for our users to develop real-time data integration jobs.
To create a synchronization job from MySQL to Doris, a user simply needs to specify in a YAML file the connection information of the source MySQL database and destination Doris database, and the information of the tables that are involved. For example, here I want to synchronize all the tables in a MySQL database, so I submit the YAML file through a shell script in Flink CDC to create the database synchronization job. Then, the job will automatically create schema information for the tables and transmit both the existing data and incremental data to the Doris database. The schema changes in the MySQL database will also be synchronized in real time.
In this process, all the underlying complexity is handled by the engine, leaving users the need for only the simplest configurations. You can start using Flink CDC 3.0 through the download link. Your feedback is welcomed.
Flink CDC started from a personal side project in 2020 to an open-source project in the Ververica repository. In Flink CDC 2.0, the incremental snapshot feature was released, enabling the tool to be put into large-scale production. The community also went through significantly expansion. Today, the release of the 3.0 version marks another milestone in its tech journey.
After talking about the technological development of Flink CDC, we also want to share with you the plan to further open-source Flink CDC. The progress of Flink CDC wouldn't have been possible without our community, so we are more than ready to give back. Here, I'm glad to announce that Flink CDC has been donated to Apache Software Foundation as a sub-project of Apache Flink. You can click here to learn more from the discussion in the community.
*Change Data Capture (CDC): a technology commonly used to capture the data changes in databases.
Apache Flink Has Become the De Facto Standard for Stream Computing
Flink CDC 3.0| A Next-generation Real-time Data Integration Framework
150 posts | 43 followers
FollowApache Flink Community China - January 31, 2024
Apache Flink Community China - June 2, 2022
Alibaba Cloud Indonesia - March 23, 2023
Apache Flink Community China - February 28, 2022
Apache Flink Community - May 30, 2024
Apache Flink Community China - June 2, 2022
150 posts | 43 followers
FollowRealtime Compute for Apache Flink offers a highly integrated platform for real-time data processing, which optimizes the computing of Apache Flink.
Learn MoreConduct large-scale data warehousing with MaxCompute
Learn MoreSecure and easy solutions for moving you workloads to the cloud
Learn MoreThis solution helps you easily build a robust data security framework to safeguard your data assets throughout the data security lifecycle with ensured confidentiality, integrity, and availability of your data.
Learn MoreMore Posts by Apache Flink Community