LI (nickname: zhixin) is the head of Table Store of Alibaba Cloud Intelligence, founder of Apache Paimon, and Apache Flink PMC member. WU is a big data expert at Tongcheng Travel and a contributor to Apache Hudi and Apache Paimon. DI is the head of the big data computing platform of Autohome Inc. WANG is a senior technical expert of China Unicom Digital Tech, and a contributor to Apache Paimon. This article includes the following parts:
Nowadays, leading global enterprises are shifting from traditional data analysis architectures to the lakehouse architecture that is built on top of Spark, Flink, or Presto. Traditional data analysis architectures are built on top of Hive, Hadoop, Hadoop Distributed File System (HDFS), MapReduce, HiveSQL, or Hive storage. The lakehouse architecture supports the underlying lake formats such as Apache Iceberg, Delta Lake, and Apache Hudi, with data stored in HDFS, Object Storage Service (OSS), or Amazon Simple Storage Service (Amazon S3).
The lakehouse architecture offers several benefits for data analysis and storage.
Timeliness is the core driving force of business migration. Apache Flink is the core compute engine that improves timeliness, as shown in the following figure.
Compared with Hive, the lakehouse architecture have several advantages, which make computing more convenient and improve timeliness to a certain extent.
The rightmost architecture in the preceding figure consists of Apache Flink, Kafka, and an online analytical processing (OLAP) system. This architecture can deliver timeliness in seconds. We plan to leverage the timeliness of Apache Flink for offline data warehouses.
However, the lakehouse architecture and Apache Flink are two different concepts and work in different ways.
An intermediate architecture is required to integrate Apache Flink into the lakehouse architecture and combine streaming with the lakehouse technology to create a brand-new streaming lakehouse architecture. This way, end-to-end stream computing is implemented in minutes. All data can be stored and queried in lakehouses.
The streaming lakehouse architecture implements unified batch and stream processing by combining stream computing of Apache Flink with the lakehouse technology.
In summary, the lakehouse architecture is attractive for enterprises but is difficult to run. The biggest challenge lies in lake formats due to the streaming technology and a large number of updates generated by streams.
This section describes the exploration on the integration of Apache Flink and the lakehouse architecture by the Apache Flink community and Alibaba Cloud.
A lot of improvements were made in the lakehouse architecture in 2023. In 2020, Alibaba Cloud tried to integrate Apache Flink into Apache Iceberg. Apache Iceberg is an outstanding lake format. It has a simple architecture, an open ecosystem, and a simple design. After Apache Flink was integrated into Apache Iceberg, Apache Iceberg was able to read and write streams in Apache Flink. We also promoted stream updates to the Apache Iceberg community.
Such integration also encountered some issues. Apache Iceberg is designed for batch processing, and its simple architecture must be retained to adapt to different compute engines. This hinders kernel improvements for stream updates. Currently, data in Apache Flink cannot be written to Apache Iceberg in real time. We recommend that you use a service-level agreement (SLA) of 1-hour data updates.
Like other enterprises, Alibaba Cloud also attempted to integrate Apache Flink with Apache Hudi. Apache Hudi is introduced for Spark and is a format to enhance the upsert capabilities of Spark. After Apache Flink is integrated with Apache Hudi, the time to update data in Apache Hudi by using Spark was decreased from hours to 10 minutes. However, Apache Hudi is designed for batch processing and is not suitable for stream computing and stream updates. This poses architectural challenges for future design. We recommend that you use an SLA of 10-minute data updates.
To address the preceding issues, we designed a new streaming data lake format, Apache Paimon, which is formerly called Flink Table Store. Apache Paimon innovatively combines the lake format and log-structured merge-tree (LSM tree), bringing real-time stream updates into the lake architecture. Apache Paimon can be better integrated with Apache Flink and Spark.
Apache Paimon natively supports stream processing provided by Apache Flink and batch processing provided by Spark due to better integration of Apache Flink and Spark. Apache Paimon provides powerful streaming read and write capabilities and streaming lake storage with latency of 1 to 5 minutes.
The following figure shows the architecture of Apache Paimon:
Apache Paimon is a lake format that is designed for unified batch and stream processing. It is a file format that stores data in OSS or HDFS.
You can use Flink change data capture (CDC) to ingest data into Apache Paimon with a few clicks. You can also use Flink SQL or Spark SQL to write data to Apache Paimon in batch or streaming mode. Apache Paimon supports nearly all of existing open source engines. Apache Paimon data can also be read by Apache Flink or Spark in streaming mode. This is one of unique capabilities of streaming data lakes.
Apache Paimon provides the following benefits:
Flink Table Store was launched in January 2022 due to an idea proposed to build a new type of data store during Flink Forward Asia 2020. Through one-year research and development, Flink Table Store 0.3 was released in January 2023. After that, we decided to decouple Flink Table Store from Apache Flink, move Flink Table Store to the Apache community, and rename it Apache Paimon to enhance its ecosystem. Apache Paimon is not a data store specific to Apache Flink but is suitable for all compute engines in the Apache community. In September and December 2023, two versions were released to drive the development of Apache Paimon. In Apache Paimon 0.6 that was released in December 2023, primary key tables and append tables were completely available in production environments.
Currently, the Apache Paimon community has more than 120 contributors from different industries and more than 1,500 stars of Apache Paimon.
At Flink Forward Asia 2023, various enterprises such as Alibaba Cloud, Tongcheng Travel, Autohome Inc., China Unicom, and Ping An, shared numerous insights and practices related to Apache Paimon.
This section presents the dialogues with experts from Tongcheng Travel, Autohome Inc., and China Unicom Digital Tech. The dialogues can help you understand the practices of the streaming lakehouse architecture based on Apache Flink and Apache Paimon and the business benefits it offers.
Speakers
LI Jinsong | Head of Table Store of Alibaba Cloud Intelligence, founder of Apache Paimon, and Apache Flink PMC member
WU Xiangping | Big data expert at Tongcheng Travel, and an Apache Hudi and Apache Paimon contributor
WU is engaged in big data computing and data lakehouses of Tongcheng Travel.
LI: Flink Table Store had already been used by Tongcheng Travel. How did you discover Flink Table Store and why did you finally adopt Apache Paimon for your streaming lakehouse architecture?
WU: Our streaming lakehouse architecture has evolved in three stages:
(1) Spark and Apache Kudu
Similar to other enterprises, Tongcheng Travel used Hive as an offline data warehouse in the early stage. Later, we used Apache Kudu to meet the increasing requirements for real-time data processing. After Apache Kudu is used, if data at the operational data store (ODS) layer is sent to Hive, the data is also sent to Apache Kudu. This way, we can apply projects that have few dependencies to Apache Kudu to develop applications. When you schedule Spark jobs that run in offline mode, intermediate data is generated based on the original Kudu tables at an interval of 10 minutes or 1 hour.
This implementation principle met specific business requirements. However, the latency of data processing is about 10 minutes to 1 hour because Apache Kudu is used based on offline scheduling in Spark. This could not meet our business timeliness requirements. In addition, ODS data is stored in Hive and Apache Kudu, which causes the storage of duplicate data and adversely affects data reuse. It is difficult for data warehouse personnel to maintain Apache Kudu. The cost of the streaming lakehouse architecture is high because Apache Kudu is built based on SSDs.
(2) Apache Flink and Apache Hudi
To address the preceding issues, we tried to adopt Apache Flink and Apache Hudi for our streaming lakehouse architecture in 2022. This architecture resolved some issues in the streaming lakehouse architecture based on Apache Kudu. For data reuse purposes, we used Apache Hudi to update tables at the ODS layer in real time. These tables can also be used by downstream engines. Streaming read based on Apache Hudi is also implemented.
We also found some issues in this architecture. It takes more than 10 hours to synchronize data from an Apache Hudi table that contains more than 10 GB of data. The more the amount of data is stored in the Apache Hudi table, the more the resources are consumed for synchronization. In this architecture, data consistency cannot be ensured, and query efficiency is lower than we expected.
(3) Apache Flink and Apache Paimon
To address the preceding issues, we started exploring Apache Paimon in June 2023 or even earlier. We developed solutions, such as end-to-end real-time data processing, to issues related to Apache Hudi by using Apache Flink and Apache Paimon to improve the efficiency of streaming read and write. This architecture ensures the consistency of final data, supports compatibility with a wide range of table formats, and supports data updates based on primary keys due to the state-of-the-art design, the lakehouse technology, and simple state management of Apache Paimon.
LI: At present, how many jobs of Tongcheng Travel are processed in Apache Paimon?
WU: Tongcheng Travel has migrated about 80% of jobs, equating to over 500 jobs, from Apache Hudi to Apache Paimon. We have built more than 10 scenarios based on end-to-end real-time computing of Apache Paimon. We have also built more than 10 scenarios for batch processing based on the lookup engine of Apache Paimon. Apache Paimon stores roughly 100 TB of data, which includes about 100 billion data records in total.
LI: Can you elaborate on the overall architecture and its benefits of the architecture?
WU: Tongcheng Travel uses federated HDFS as the storage base, adopts a new lakehouse architecture, and provides a hybrid lakehouse platform that supports streaming and batch processing based on Apache Paimon. This architecture involves various table engines of Apache Paimon and uses query engines, such as Flink, Spark, Trino, and StarRocks at the upper layer. With this architecture in place, we've achieved a 30% reduction in required synchronization resources at the ODS layer, substantial enhancements in read and write performance, a threefold increase in write speeds, and a seven times improvement in the efficiency of certain queries. Furthermore, the use of tag-based features enables extensive reuse of duplicate data, leading to a reduction of approximately 40% in the storage space for data export. Additionally, metric developers have seen their development efficiency increase by nearly 50% due to the reusability of intermediate data.
LI: Apache Paimon brings a great business value to the production. Can you share the future plan of Tongcheng Travel with us?
WU: Apache Paimon differs greatly from Hive in many aspects. For example, Apache Paimon offers additional management, cleansing, and merging services. We plan to improve the integration of management services, allowing users to interact with lakehouse tables as seamlessly as they would with Hive. We will gradually migrate our business from Hive to Apache Paimon. Furthermore, we will also follow up the streaming lineage and data repair capabilities of the Apache Paimon community and build links based on Apache Paimon more efficiently.
Speakers
LI Jinsong | Head of Table Store of Alibaba Cloud Intelligence, founder of Apache Paimon, and Apache Flink PMC member
DI Xingxing | Head of the big data computing platform of Autohome Inc.
DI Xingxing serves as the director of the big data computing platform of Autohome Inc. He has been engaged in the big data industry for nine years. He is an early deep user of Apache Flink and has a lot of interactions with users in the Apache Paimon community. Autohome Inc. provides high-quality, comprehensive services throughout the whole lifecycle of automobiles, from selection and purchase to use and trade-in, all with a focus on saving users' time and money.
LI: Can you describe the big data architecture of Autohome Inc. and the main roles of Flink and Apache Paimon in the architecture?
DI: As outlined by previous speakers, Autohome Inc.'s big data architecture follows a multi-layered structure, illustrated in the figure below.
LI: Why did Autohome Inc. choose Apache Paimon? What issues does Apache Paimon resolve for Autohome Inc.? What issues does the integration of Flink, Apache Paimon, and StarRocks resolve?
DI: In terms of the application of big data architecture, Autohome Inc. has similar experiences as the preceding enterprises, such as Tongcheng Travel. However, Autohome Inc. migrated business from Apache Iceberg to Apache Paimon. In 2021, Autohome Inc. collaborated with the Apache Iceberg community to explore the implementation of lake formats. However, it is concluded that Apache Iceberg fell short in meeting our requirements for stream processing. Despite offering high write throughput in data update scenarios, Apache Iceberg suffered from low query efficiency, necessitating additional data merging tasks for improved performance. Therefore, we opted for Apache Paimon, given its superior stream processing capabilities and deep integration with Flink. In terms of streaming read, downstream services can consume the incremental data of Apache Paimon by using Flink after the data is written to Apache Paimon. This ensures data ordering. Additionally, we found that interacting with the Apache Paimon community, which has roots in the Flink community, was more efficient than engaging with the Apache Iceberg community, predominantly because the latter is primarily composed of international members.
DI: Regarding the integration of Flink, Apache Paimon, and StarRocks, Flink is the base of real-time computing, while Apache Paimon is the base of lake formats. Both Flink and Apache Paimon help resolve the data timeliness issue, which is the biggest pain point of Autohome Inc. As mentioned by the speaker of Caocao Mobility, second-level data timeliness is not necessarily required in many scenarios, such as high-level management decision analysis and frontline operations. The 5-minute data timeliness can fully meet analysis requirements. A pure real-time solution is not required in most cases. Built based on Flink and Apache Paimon, the streaming lakehouse can provide the required data timeliness and achieve unified batch and stream processing.
Regarding efficiency in data application, StarRocks enhances query speeds once data is written into Apache Paimon in near-real-time mode. Apache Paimon has been integrated with StarRocks in our internal systems, such as integrated development environment (IDE) development platforms, report systems, and multi-dimensional analysis systems.
In conclusion, the integration of Flink and Apache Paimon helps resolve the data timeliness issue, while the integration of Apache Paimon and StarRocks helps improve data analysis efficiency.
LI: In which scenarios does Autohome Inc. use Apache Paimon?
DI: Apache Paimon provides comprehensive features in terms of lake formats. At Autohome Inc., we leverage Apache Paimon in various ways similar to other enterprises. For example, we have enhanced the timeliness of our recommendation model training by updating our sample data from next-day (T+1) or next-hour (H+1) to mere minutes, thanks to Apache Paimon's partial update capabilities. Tag splicing is a JOIN operation. During traditional JOIN operations, a large amount of state data is saved. For example, JOIN operations among multiple streams are complex and occupy a large number of resources, which results in high development costs. The partial update capability of Apache Paimon allows users to define a primary key for a table in advance so that data in each stream can be written independently. This greatly reduces development costs and simplifies the business logic. This capability is also applied to some analysis and report scenarios that have a high requirement for data timeliness.
Speakers
LI Jinsong | Head of Table Store of Alibaba Cloud Intelligence, founder of Apache Paimon, and Apache Flink PMC member
WANG Yunpeng | Senior technical expert of China Unicom Digital Tech and Apache Paimon contributor
WANG Yunpeng is from China Unicom Digital Tech and will share insights into the implementation and application of Apache Paimon in China Unicom. In early 2021, to build a robust digital product and operations system and to better support a myriad of industries, China Unicom consolidated its five specialized subsidiaries to form China Unicom Digital Tech. This new entity unifies core competencies in cloud computing, big data, IoT, AI, blockchain, and security. With years of experience in data research and development, WANG Yunpeng leads the data computing R&D team within the Data Intelligence Division at China Unicom Digital Tech. He is mainly in charge of building the trillion-level real-time computing platform and streaming lakehouse for the company.
LI: What prompted China Unicom to adopt Apache Paimon, and how has its architecture evolved since then?
WANG: To meet specific data join and integration requirements that arise during the evolution of the real-time computing platform, China Unicom turned to Apache Paimon. Initially, our platform relied on Spark Streaming for micro-batch processing, but issues such as high latency and limited state data management led us to adopt Flink. Flink enhanced data timeliness and enabled stateful stream computation. However, the initial business mainly involves event-based rule matching, which has minimal data join requirements. However, as our business grew, we encountered new challenges, such as joining streaming data with other streams and with batch data, and the need for effective integration of both types of data. At that time, we implemented data joins and integration based on the state data of Flink. We also used external databases, such as HBase and Redis, as a solution. Yet, due to China Unicom's vast data scale and the enormous size of individual tables, these databases struggled to cope with our requirements. A key issue we faced was the separation of streaming and batch data at the storage level, leading to data redundancy and inconsistencies. Therefore, it is necessary to merge streaming and batch data. As a solution, we considered the potential of streaming lakehouse within the Flink and Apache Paimon architecture, based on unified reads and writes of streaming and batch data and the efficient real-time update capabilities of Apache Paimon. Our goal was to achieve unified storage and computing of streaming and batch data and simplify the data processing architecture.
LI: What is the volume of data managed by China Unicom?
WANG: China Unicom manages an immense volume of data. Currently, our streaming platform supports around 700 streaming tasks, daily accumulating approximately 1,000 terabytes of data and processing trillions of data points. More than 100 Apache Paimon tables are used, including some primary key tables. The number of primary keys in a single table is nearly 0.8 billion, and the number of primary keys in the largest table is 17 billion. Moreover, nearly 100 billion data records of Internet service providers (ISPs) are written every day to the table that has the largest volume of written data.
LI: Can you outline the current big data architecture and application scenarios of China Unicom?
WANG: In China Unicom, we use Apache Paimon for services that require minute-level latency, as well as for services that involve joins and integration of streaming and batch data. The following figure shows the typical use case for Apache Paimon, which is used to build a user-centered panoramic view.
In this architecture, basic information, usage information, and behavior information related to users need to be joined and integrated, as well as updated in real time, to support upper-layer services such as real-time subscription, real-time feature engineering, and real-time security. The architecture is designed to use Apache Paimon tables for storage and Flink for data processing. The data sources mainly include database changelogs, user behavior data, latest location data, and Internet preferences. After a data source is connected to the architecture, data at the ODS layer is ingested into the lake for data layering. Lakehouse tables are created to perform JOIN operations. Simple JOIN operations are performed at the storage layer based on Apache Paimon and Flink. This reduces the amount of state data, implements unified storage of streaming and batch data, ensures data consistency, and simplifies the data processing architecture.
Based on our practices, the number of required host resources is reduced by about 50%, and the efficiency of data revision, batch analysis, and real-time queries of intermediate results is also improved. In addition, the Apache Paimon community is active, which helps quickly resolve issues.
In summary, Apache Paimon facilitates the implementation of unified batch and stream processing. The integration of Apache Paimon and Flink leads to the achievement of unified computing and storage for stream and batch data. In this way, the unified batch and stream processing architecture is achieved.
Accelerated Integration: Unveiling Flink Connector's API Design and Latest Advances
Introduction to Unified Batch and Stream Processing of Apache Flink
150 posts | 43 followers
FollowApache Flink Community - May 10, 2024
Apache Flink Community - April 30, 2024
Apache Flink Community - April 8, 2024
Alibaba EMR - April 15, 2024
Apache Flink Community - June 11, 2024
Alibaba EMR - April 25, 2024
150 posts | 43 followers
FollowRealtime Compute for Apache Flink offers a highly integrated platform for real-time data processing, which optimizes the computing of Apache Flink.
Learn MoreA real-time data warehouse for serving and analytics which is compatible with PostgreSQL.
Learn MoreHelp media companies build a discovery service for their customers to find the most appropriate content.
Learn MoreTransform your business into a customer-centric brand while keeping marketing campaigns cost effective.
Learn MoreMore Posts by Apache Flink Community