StarRocks x Flink CDC for End-to-End Real-Time Links

This article discusses real-time data warehouse construction and offers examples of using Flink CDC and StarRocks for real-time links and data updates.

By Tianyi Wang (Architect of StarRocks Solutions) and Kang Zhou (Team OLAP of Alibaba Cloud Open-Source Big Data)

Background of Real-Time Data Warehouse Construction

Requirements of Real-Time Data Warehouse

The rapid development of the Internet industry is causing the types of businesses and enterprises to increase and the amount of data to grow. The data kanban service with the Apache Hadoop ecosystem as the core can only implement offline services. In some areas, the ability to process data in real-time has become one of the bottlenecks limiting enterprise data realization. Building a fast-paced data kanban for data analysis has become an inevitable choice.

Development of Real-Time Data Warehouse

Real-time data warehouses have three well-known watersheds. The first one is building from scratch. The emergence of Apache Storm breaks the single computing method of MapReduce, enabling businesses to process T+0 data. The second watershed is from emergence to completeness. The emergence of Lambda and Kappa architecture has taken offline data warehouses one step towards real-time data warehouses. The evolution of Lambda architecture to Kappa architecture closely integrates the offline data warehouse model and the real-time data warehouse model. The third watershed is from complex to simple. Implementing the Flink technology stack has simplified the real-time data warehouse architecture, which is recognized as the best solution for stream-batch integration.

The real-time data warehouse implemented by using Flink as a real-time computing engine transfers some complex computing to the OLAP analysis engine, making the analysis requirements at the application layer flexible. However, the exclusion of data warehouse changing its data cannot be altered. The next-generation real-time data warehouse platform needs to provide better performance and needs complete functions to match different services.

As a platform-wide MPP architecture, StarRocks provides various methods of performance optimization and flexible modeling. The ultimate performance experience can be obtained on a variety of models (such as pre-aggregation, wide tables, and star/snowflake). Integrating StarRocks and Flink to build an open-source real-time data warehouse allows users to synchronize data in seconds and analyze and query quickly. Scenarios (such as real-time and frequent updates) can be supported better with the StarRocks primary key model.

Pain Points of Flink-Based Open-Source Real-Time Data Warehouses

In the original solution of building and implementing data ware houses based on Flink, different collection tools (such as Flume, Canal, and Logstash) are required due to the diversity of data sources. We use different analysis engines for different businesses. For example, for a fixed report business, fact and dimension tables can be flattened into wide tables in advance with known query statements, making full use of the powerful single-table query capability of ClickHouse. Apache Druid can be used for highly concurrent query requests to withstand the concurrency pressure caused by the centralized use of a large number of users during peak hours. The stacking of technology stacks can meet business requirements but can also make the analysis layer bloated, increasing the cost of development and O&M.

In general, the StarRocks x Flink-based open-source real-time data warehouse ecosystem is built with five layers:

Data Sources: Data sources can be various (such as MySQL Binlog, crawler data, or flat files).
Data Collection: Users pull incremental data from the upstream with multiple CDC tools (such as Canal and Debezium), write the data to Kafka, and consume the data in Kafka through Flink.
Real-Time Computing: Lightweight ETL tasks (such as widening tables or data cleansing) can be completed with the real-time computing capabilities of Flink.
Data Storage: Flink relies more on OLAP engines than other real-time technology stacks.
Running the Backend Application: It can be a real-time monitoring system, reporting system, recommendation system, and data operation service.

Fast data ingestion and short latency matter the most for the real-time data warehouse system built with Flink as the computing engine. In such an architecture, data is collected from the data sources to the OLAP analysis system through the tool, message queue, and real-time computing layer. Long links bring great risks to development and O&M. Blocking any module will affect real-time performance. Meanwhile, we choose different storage engines to adapt to different businesses at the data storage layer. We are facing many challenges from the data links above. We need to do more exploration in terms of timeliness, functionality, and maintainability. Thus, we can summarize that many aspects need to be optimized.

CDC components are not unified, and links are too long. Bottlenecks in any component will affect timelines. Too many components require multi-department collaborative maintenance, with double learning and maintenance costs.
Some synchronization components (such as Debezium) need to lock the table being read when ensuring data consistency, which may affect business updates.
The analysis layer adapts to different business types with many data storage products since no one product can adapt to most businesses.
The corresponding logic for deduplication operations is complex. MapStat logic needs to be added to Flink.

Flink CDC for End-to-End Real-Time Links

Flink CDC is a component developed by the Flink community that integrates data collection, conversion, and loading. It can directly read full or incremental data from data sources (such as MySQL, PostgreSQL, and Oracle) and write them to downstream OLAP data storage systems. With Flink CDC, upstream data changes can be simply and efficiently captured and synchronized to downstream OLAP data warehouses.

Construction of an Integrated Data Transmission Link

Data collection tools are indispensable in the construction of traditional real-time data warehouses. Due to the inconsistency of upstream data sources, we may access different synchronization and collection tools at the data collection layer. For example, when collecting data from Oracle, we usually choose GoldenGate, but we may choose Canal or Debezium for MySQL. Some collection tools support full data synchronization, and some support incremental data synchronization. After the data passes through the collection layer, it is transmitted to message queues (such as Kafka) where the incremental data is consumed by Flink. Then, the data is written to the downstream OLAP data warehouse or data lake.

In business development, upstream data sources, message middleware, Flink, and downstream analytical data warehouses are maintained in different departments. When business changes or faults are debugging, multiple departments may be required to work together, increasing the difficulty of business development and testing. Replace the data collection components and message queues in the preceding figure with Flink CDC. The collection components and message queues in the dashed box are merged into the computing layer Flink, which simplifies analysis links and reduces maintenance costs. At the same time, fewer components mean fewer failures and transmission bottlenecks. Thus, data effectiveness will be improved.

With Flink CDC, components in the data trace become fewer, the architecture becomes clear and simple, and maintenance becomes convenient. As in the preceding example, Flink CDC is used to pull incremental data from MySQL. Flink SQL is used to create a MySQL CDC table of facts and dimensions, and a widening operation is performed in Flink to write the results to downstream StarRocks. The whole process of grabbing, converting, and loading can be completed through a Flink CDC job.

Synchronization of Full + Incremental Data

We divide synchronization into two stages in the traditional data synchronization framework:

Full Data Synchronization Stage: A full data synchronization tool (such as DataX or sqoop) is used to synchronize tables at the snapshot level.
Incremental Data Synchronization: An incremental synchronization tool (such as Canal or GoldenGate) is used to synchronize incremental data after a snapshot is pulled in real-time.

We can choose the multi-thread import mode during full data synchronization to speed up the import. When full data synchronization is performed under the multi-thread model, after data is split, data synchronization is completed by starting multiple concurrent tasks. Multiple concurrent services may not belong to the same read transaction and have a certain time interval. Therefore, data consistency cannot be strictly guaranteed. We have two solutions to ensure the consistency of data and balance from the perspective of engineering and technical implementation.

Stop writing data to ensure the static nature of snapshot data by locking tables. However, this will affect online business.
Data is no longer sliced in single-thread synchronization mode. However, import performance cannot be guaranteed.

With Flink CDC, full and incremental data can be synchronized. Debezium is used as an underlying collection tool in Flink CDC version 1.x. Databases or tables need to be locked to ensure data consistency during full data reading. Flink 2.0 introduces the Chunk splitting algorithm to solve this problem by ensuring lock-free data reading. The splitting algorithm of Chunk is similar to the principle of database and table sharding. Data sharding is performed using the primary key of a table.

After chunk data is sharded, each chunk is only responsible for the data within its primary key range with the condition of ensuring read consistency of each chunk. This is the basic principle of the lock-free algorithm.

StarRocks: A New Solution for Real-Time Data Updates

StarRocks is a fast and full-scenario MPP enterprise-class data warehouse product with the capability of horizontal online expansion and high availability at the financial level. It is compatible with the MySQL protocol and MySQL ecosystem and provides important features (such as a comprehensive vectorization engine and federated query of various data sources). As an analytical data warehouse with MPP architecture, StarRocks can support PB-level data volume and has flexible modeling methods. It can build a fast and unified analysis layer data storage system through optimization methods (such as materialized views, bitmap indexes, and sparse indexes).

StarRocks launched its primary key model in version 1.19. Compared with the updated model, the primary key model can support real-time and frequent update scenarios better. The primary key model requires a table to have a unique primary key (the primary key in a traditional data warehouse). It supports updating and deleting rows in a table by primary key.

Optimization of the Primary Key Model for Real-Time Data Changes

Variable data is often undesirable in OLAP data warehouses. We use batch updates to handle large amounts of data changes in traditional data warehouses. We have two ways to deal with data changes.

Insert the modified data into the new partition and complete the data change through partition exchange.
Some OLAP data warehouse products provide update functions based on the Merge on Read model to complete data changes.

Partition Exchange Data Update Mode

We can delete the original partition of most OLAP data warehouse products by manipulating partitions and replacing them with new partitions, so the change operations on a large amount of data are realized. Generally speaking, there are three steps:

Create a new partition table and store the new data in a new table according to business changes
Uninstall and delete the original partition
Mount the partitions in the new table into the target table

Swapping partitions to achieve large-scale data changes is a heavy operation and is suitable for low-frequency batch data updates. Due to changes in table definitions, developers cannot independently complete data changes through this solution.

Merge on Read Data Update Mode

Some OLAP data warehouses provide Merge on Read-based data change models. For example, ClickHouse provides the MergeTree engine, which can complete asynchronous updates but fails to synchronize data in real-time. After we specify the FINAL keyword, ClickHouse merges the results before they are returned, which allows us to synchronize data updates in quasi-real-time. However, due to the high cost of the FINAL operation, it is not enough to support the frequent dimension update requirements brought by real-time data warehouses. Even in low-frequency update scenarios, the ClickHouse Merge Tree solution cannot be copied to other storage systems.

StarRocks provides an updated model (Unique Key model) similar to ClickHouse Merge Tree, which updates data by the Merge on Read model. In the updated model, StarRocks assigns a version number to each batch of imported data. One primary key may have multiple versions that merge during the query and return the record of the latest version.

Merge on Read mode is simple and efficient in writing but consumes a large number of resources in version merging during reading. Due to the merge operator, predicates cannot be pushed down, and indexes cannot be used, seriously affecting the performance of queries. StarRocks provides a primary key model based on the Delete and Insert mode, which avoids the problem that operators cannot be pushed down due to version merging. The primary key model is suitable for scenarios where data needs to be updated in real-time. It can solve row-level update operations and support the TPS of million traffic. It is suitable for scenarios where MySQL or other business libraries are synchronized to StarRocks.

We selected some queries for comparison in the TPCH standard test set. Compared with the Unique Key model based on the Merge on Read, the performance of the Delete and Insert mode-based primary key model is improved.

Support of Primary Key Model for Deduplication Operations

Eliminating duplicate data is a common challenge in real business. The deletion of deduplication helps reduce the capacity consumed by storage in a data warehouse. In some specific scenarios, duplicate data is unacceptable. For example, duplicate operations are performed according to user IDs in business scenarios (such as customer group selection and precision marketing) to avoid repeatedly pushing marketing information. In traditional offline computing, the distinct function can be used to complete deduplication. In real-time computing, deduplication is an incremental and long-term process, and we can deduplicate by adding MapState logic in Flink. However, in most cases, MapStat can only ensure data deduplication within a certain time window. It is difficult to deduplicate incremental data with historical data in the OLAP database. As the time window increases, deduplication in Flink occupies a large number of memory resources and makes the computing layer bloated.

The primary key model requires a table to have a unique primary key and rows in the table can be updated and deleted based on the primary key. The uniqueness of the primary key is highly matched with the requirements of deduplication. The primary key model has completed deduplication during data import, avoiding the resource consumption caused by manual deduplication. We can select an appropriate deduplication column as the primary key by dismantling the business logic and completing the requirement of data deduplication according to the unique primary key in the Delete and Insert way during data import. Compared with the implementation of deduplication in Flink, the StarRocks primary key model can save a large number of hardware resources, which is simpler to operate. Also, the deduplication of incremental and historical data can be performed.

Optimization of Primary Key Model for Data Changes in Wide Tables

In fixed report business scenarios, data is typically cleaned in Flink based on fixed queries and then flattened into wide tables. The excellent multi-dimensional analysis performance of wide tables is used to speed up the queries and simplify the data models used by analysts. However, due to the attribute that a wide table needs to be pre-aggregated, when dimension data changes, the wide table needs to be rerun to update the data. The primary key model of StarRocks can be applied to data change scenarios. The function of updating some columns meets the requirements of various businesses to partially update different fields in a wide table.

There are dozens, hundreds, or thousands of columns in the wide table model. This poses a challenge to the primary key model that updates data by UPSERT. We need to obtain all information about the changing rows before we can complete the data update of the wide table. This leads to the fact that the change will be accompanied by the previous table read. The changed data row needs to be pulled from StarRocks, and the inserted statements need to be spelled out to complete the data update. This puts query pressure on StarRocks. The partial update simplifies the UPSERT. After enabling the partial_update parameter, we can only modify some of the specified columns based on the primary key, and the original value column remains unchanged.

In the following example, we can consume data in Kafka by the Routine Load import. In the properties, we need to set "partial_update" = "true", specify that the partial column update mode is enabled, and specify the column names to be updated, namely, COLUMN(id, name):

CREATE ROUTINE LOAD routine_load_patical_update_demo on example_table  COLUMNS (id, name), COLUMNS TERMINATED BY ',' PROPERTIES (    "partial_update" = "true" ) FROM KAFKA (    "kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092",    "kafka_topic" = "my_topic",    "kafka_partitions" = "0,1,2,3",    "kafka_offsets" = "101.0.0.200" );

StarRocks x Flink CDC: Build a Fast and Unified Open-Source Real-Time Data Warehouse Platform

Flink CDC solves the problem of lengthy data links, while StarRocks provides a solution of extreme performance and integrated data storage at the OLAP analysis layer to match different business scenarios. The real-time data warehouse platform built by the integration of StarRocks and Flink CDC can reduce development and O&M costs.

StarRocks x Flink CDC: Real-Time Data Warehouse Architecture in Wide Table

With the joint solution of StarRocks and Flink CDC, we can plan real-time data warehouses into a four-tier structure.

Data source layer and real-time application layer are the same as the original architecture without adjustment.
Data Transmission and Computing Layer: Data collection layer, message queues, and fact computing layer are placed in Flink CDC, which simplifies the data link and reduces development and O&M costs.
Data Analysis and Storage Layer: As the data storage engine of the analysis layer, StarRocks can provide different data models to support different types of businesses, which simplifies the complex technology stack of data storage at the analysis layer.

In scenarios where ETL is not complex, we can implement most ETL operations in Flink. The business model is simple in some scenarios. Data of facts and dimensions are flattened into a wide table with the multi-stream join capability of Flink. DWD, DWS, and ADS layer models are divided in Flink. Unstructured data can also be incrementally written to Iceberg, Hudi, or Hive. Complete the integrated architecture of the lake house using the appearance function of StarRocks.

When complex business logic is introduced in the ETL process, it may occupy a large number of memory resources at the Flink computing layer. The mode of the wide table cannot cope with multi-dimensional analysis scenarios where queries are not fixed. We can choose to replace the wide table model with the star model. Then, the cleaning and modeling of data are completed in StarRocks.

StarRocks x Flink CDC: Real-Time Data Change Architecture

Analysts often explore data from different dimensions in some complex business scenarios (such as in self-service BI reports and operation analysis scenarios). The randomness and flexibility of queries require the OLAP analysis engine to have good support for performance and many modeling methods to satisfy users to pull metrics and dimensions, drill down, roll up, and associate queries on the page at near random.

StarRocks can use a more flexible star model instead of a wide table. StarRocks provides different join methods to enhance the real-time association of multiple tables, such as Boardcast Join, Shuffle Join, Bucket Join, Replica Shuffle Join, and Colocation Join. CBO selects the type of join reorder and join based on the table statistics. It provides multiple optimization methods (such as predicate pushdown, limit pushdown, and materialization delay) to accelerate queries associated with multiple tables.

Based on the real-time join capability of StarRocks, we can post ETL operations to StarRocks and complete data modeling in StarRocks through the real-time join. We can also support data changes with the Primary Key model and create slowly changing dimensions in StarRocks to implement dimension data changes.

The real-time data warehouse built by the star/snowflake model allows us to post the modeling operation of Flink at the computing layer into the StarRocks engine. In Flink, we only need to clean data at the ODS layer. Dimension and fact tables are synchronized to StarRocks through Flink CDC. In StarRocks, fact and dimension data are implemented at the ODS layer and perform aggregation by the aggregation model or materialized views. We can quickly calculate query results and ensure the business data is highly homologous and consistent at different model layers with the real-time multi-table association capability of StarRocks and multiple optimization methods (such as intelligent CBO optimizer, sparse index, and vectorization engine).

In real life, the attributes of dimensions are not static and will change slowly over time. The star model can store fact and dimension tables independently and decouple dimension data from wide tables to deal with the problem of slowly changing dimensions through the primary key model of StarRocks. Generally speaking, we have three solutions to deal with the problem of slowly changing dimensions:

Using the primary key model to directly overwrite the original dimension values based on the primary key. This approach is easy to implement but does not retain historical data, which doesn't allow us to analyze historical dimension change information.
With the detailed model, directly add dimension rows and manage different dimension attribute versions through the version column. The reason for modifying the solution before querying is to filter an appropriate dimension version based on business conditions.
With the primary key model, the version information is introduced into the primary key, and the method of directly modifying and adding dimension rows is mixed. This method is complicated, but it can deal with complex dimension changes comprehensively.

User Cases of StarRocks x Flink CDC

StarRocks and Flink CDC has significantly simplified the complexity of data links on a well-known e-commerce platform. Users can build a real-time data kanban platform through StarRocks to implement multi-dimensional data filtering, flexible funnel analysis, and flexible analysis of different dimension movements.

Difficulties and Challenges

ClickHouse was initially selected as the storage engine for the data analysis layer in the e-commerce data kanban platform. However, as the business grows, ClickHouse cannot effectively support it in some scenarios, which are mainly reflected in the following aspects:

The status of some orders changes depending on the user's order. However, generally speaking, the status of orders for more than two weeks will not change.
Some of the changed data is not suitable for storage in the form of wide tables, and some of the business requirements iterate frequently. The modeling of wide table + star can serve the business changes better.
The scaling operations of ClickHouse are complex and cannot automatically rebalance tables. A long business maintenance window is required.

The e-commerce platform has re-made technical selections to solve the problems above. After continuous comparison and stress testing, StarRocks was finally selected as the data storage engine for the analysis layer.

System Architecture

The real-time kanban business can be divided into five parts:

Data Source Layer: There are two types of data sources: event tracking logs from the web and client and order data in the business database.
Flink CDC: Flink CDC captures upstream event tracking logs and business data, cleans and converts the data in Flink CDC, and writes the data to StarRocks.
Data Storage Layer: According to business requirements, the fact data in the DWD layer is combined with the dimension data to form a wide table, which is written to the DWS layer in the form of a view. Different topic domains are divided in the ADS layer.
Data Service Layer: It consists of a data metric platform and a funnel analysis platform. It performs logical calculations based on internal metrics and funnel definitions and finally generates reports for analysts to view.
Data Middle Platform: It provides stability assurance, data asset management, and data service system around the big data analysis platform.

Selection Benefits

Data Transmission Layer: The upstream event tracking data and the incremental data from the MySQL order database can be directly pulled by Flink CDC. Compared with the link of MySQL → Canal → Kafka → Flink, the architecture is clearer and simpler. In particular, for the upstream MySQL sharding order transaction database, we can merge tables in different databases through mapping in Flink CDC and write to the downstream StarRocks after cleaning. The Canal and Kafka components are omitted to reduce the costs of hardware resources and human maintenance.
Data Storage Layer: We can replace ClickHouse with StarRocks to expand complex businesses through a flexible star model during business modeling without being limited by the business model of wide tables. The primary key model can adapt to order data changes in the MySQL business database and modify the historical data in StarRocks in real-time based on the order ID. When nodes are expanded, StarRocks is simpler and non-intrusive to the business and can complete automated data redistribution.
Performance: The 40 billion of a single table is associated with four million-dimension tables. The average query time is 400ms, and the TP99 is about 800ms. Compared with the original architecture, the performance is significantly improved. After StarRocks is replaced, CPU usage drops from 70% to 40% during peak hours. The hardware cost is saved.

Outlook

Recently, Alibaba Cloud E-MapReduce (EMR) cooperated with the StarRocks community to launch the first StarRocks cloud product. We can select Flink and StarRocks with corresponding specifications on EMR.

An excellent product should not be separated from community ecology. You are welcome to participate in the joint construction of StarRocks and Flink community and test the end-to-end real-time data warehouse link scheme StarRocks Primary Key x Flink CDC.

Community