Apache Paimon is an advanced lake format that supports building a Realtime Lakehouse Architecture, effectively integrating with Apache Flink and Apache Spark for both streaming and batch processes. It utilizes a combination of lake format and LSM (log-structured merge-tree) to facilitate real-time streaming updates within lake architectures. Key features include real-time updates with high performance, large-scale append data processing, and comprehensive data lake capabilities such as ACID transactions and schema evolution. For developers interested in exploring Apache Paimon, there are quick-start guides available for both Apache Flink and Apache Spark.
Building on this foundation, the challenges in streaming computing highlight the necessity for Paimon's robust capabilities:
These challenges necessitated the invention of a new solution, leading to Jinsong Li's development of Apache Paimon, which tailored specifically to these needs. His journey through various open-source projects and the evolution of stream computing landscapes led to the inception of Apache Paimon.
In the pursuit of improving real-time data processing capabilities, one approach involved enhancing Apache Hive to handle real-time data streams. This method sought to transform Apache Hive from a batch-oriented system into one that could support streaming inputs. By leveraging the batch capabilities of Apache Hive alongside streaming data, the solution aimed to reduce storage costs and increase query flexibility.
However, the main challenge was achieving low-latency processing while maintaining the data consistency guarantees typically associated with batch processing. The real-time solution using Apache Hive, in conjunction with Flink Hive Sink, addresses several crucial aspects of data handling:
Another significant attempt was the real-time enhancement of Apache Iceberg. This approach aimed to extend Apache Iceberg's capabilities to handle real-time data flows more effectively, providing stronger ACID guarantees and improving metadata management. The real-time enhancement of Apache Iceberg focused on integrating it with Apache Flink to enhance its data processing capabilities. This collaboration aimed to allow real-time data ingestion into the data lake and facilitate streaming reads directly from Apache Iceberg, enhancing the usability of traditional Apache Hive-based data warehouses. However, this solution faced significant challenges in handling upsert scenarios, crucial for Change Data Capture (CDC) processes. It struggled with the high storage and computational costs associated with maintaining full and incremental tables and was cumbersome to manage. Moreover, the architecture had difficulties efficiently processing CDC data generated during stream computing.
The exploration of upsert capabilities with Apache Hudi represented a pivotal development in streaming computing and storage. This initiative focused on integrating Apache Hudi to support upserts—inserts and updates—which are critical for real-time data processing where changes to data must be captured instantly. Apache Hudi's approach offered a way to manage streaming data that frequently changes, providing efficient mechanisms for handling scenarios where data mutuality and state consistency are necessary.
Advantages: Apache Hudi introduces innovative methods for handling Upserts by utilizing Apache Flink State to map keys to file groups, which automates the scaling process. Apache Hudi's Bucket Index solution further enhances performance by solving significant issues with Apache Flink State Index by dividing data into multiple buckets determined by a hash function. This simplifies the indexing process and alleviates many performance-related challenges previously encountered.
Drawbacks: The primary drawback of Apache Hudi's approach involves its complex system design, which often leads to performance degradation, especially when handling large datasets exceeding 500 million entries. This complexity also results in high storage costs as all indexes are stored in RocksDB State, making the system less efficient. Additionally, data consistency can be compromised if other engines attempt to read or write, disrupting the state’s index. Moreover, selecting an appropriate bucket number for the Bucket Index can be challenging, impacting performance and leading to issues with small file management.
Apache Hudi's system, originally designed for batch processing with Apache Spark, struggles to fully adapt to stream processing scenarios, leading to increased system complexity and maintenance difficulties. Despite improvements in stability in recent versions, Apache Hudi's adaptability issues and intricate settings make it less user-friendly, particularly for new users navigating its multiple operational modes.
The project aims to integrate streaming computations with data lake storage systems, providing real-time updates and data management solutions tailored for modern data architectures, driven by the need for a system that could handle streaming data more effectively than existing technologies.
The ideal solution for a streaming lake format would encompass a robust architectural foundation similar to Apache Iceberg, meeting all basic demands of lake storage:
Building on the ideal solution for a streaming lake format, the inception of Apache Paimon marks a significant stride towards realizing these aspirations. Emerging from discussions within the Apache Flink community, Apache Paimon integrates robust data management strategies to cater effectively to dynamic streaming environments.
Key Features of Apache Paimon:
The benefits of Apache Paimon include:
Apache Paimon serves a range of use cases, particularly enhancing the functionality of streaming data architectures. One notable application is its integration in building streaming data warehouses, where it supports seamless real-time data processing and analytics. This is particularly beneficial for organizations looking to streamline their data processing frameworks and reduce the complexity typically associated with large-scale data operations.
Below are some of the industry case studies of Apache Paimon:
Tongcheng Travel, a significant player in the travel industry in China, embarked on a transformative journey with Apache Paimon to enhance their data management and processing capabilities. Initially, the company utilized Apache Hive and later Apache Kudu to manage its data warehouse needs, aiming to meet the increasing demands for real-time data processing. However, the latency and complexity in managing data storage posed challenges. In pursuit of a more efficient solution, Tongcheng Travel shifted to Apache Flink and Apache Hudi, which improved data reuse and streaming capabilities but still struggled with data synchronization and consistency.
Recognizing these challenges, Tongcheng Travel transitioned to Apache Paimon in 2023, leveraging its advanced features for real-time data processing and efficient state management. This shift enabled the company to process about 80% of their jobs with Apache Paimon, enhancing the performance of over 500 jobs and managing roughly 100 TB of data across various real-time and batch processing scenarios. The use of Apache Paimon led to significant improvements, including a 30% reduction in synchronization resources, a threefold increase in write speeds, and substantial query efficiency gains. This case study exemplifies the transformative impact of Apache Paimon in optimizing data lakehouse architectures, significantly enhancing data handling, and operational efficiency in large-scale business environments.
Autohome Inc., a leader in automotive services, has significantly advanced its big data architecture by integrating Apache Paimon into its systems. Spearheaded by Di Xingxing, the head of the big data computing platform at Autohome, the company transitioned from Apache Iceberg to Apache Paimon due to the latter's superior stream processing capabilities and efficient community interaction. This shift was motivated by the need for improved data timeliness and the ability to handle real-time data updates more effectively.
Apache Paimon's integration with Apache Flink and StarRocks at Autohome has created a robust streaming lakehouse architecture that enhances real-time computing and data analysis efficiency. This system enables Autohome to update its recommendation models and other data-driven processes from daily or hourly updates to updates within minutes, significantly reducing data latency and supporting dynamic decision-making. The use of Apache Paimon at Autohome exemplifies its utility in large-scale enterprises where data timeliness and processing efficiency are crucial.
China Unicom has successfully integrated Apache Paimon into its streaming lakehouse architecture, spearheaded by WANG Yunpeng of China Unicom Digital Tech. The organization initially utilized Apache Spark Streaming and later transitioned to Apache Flink to overcome challenges related to high latency and state data management in real-time processing. As the data complexities and volume escalated, the existing systems struggled to meet the dynamic needs of data integration and management. This led China Unicom to adopt Apache Paimon, which facilitated a unified approach to handling streaming and batch data, significantly reducing data redundancy and inconsistencies.
The implementation of Apache Paimon allowed China Unicom to manage a vast volume of data, supporting 700 streaming tasks, and processing trillions of data points across more than 100 tables. This strategic move not only simplified their data architecture by enabling efficient real-time updates and integrations but also significantly enhanced the performance of their data operations. The architecture leverages Apache Paimon for minute-level latency requirements and complex data integration tasks, proving essential for real-time applications crucial to China Unicom's operations.
Apache Paimon on Alibaba Cloud offers unique features that enhance real-time data ingestion into data lakes. This integration allows for high-throughput data writing and low-latency queries, supporting both streaming and batch data processing. Apache Paimon can be easily integrated with various Alibaba Cloud services like Realtime Compute for Apache Flink, as well as with big data frameworks like Apache Spark and Apache Hive. This setup facilitates the construction of data lakes on Hadoop Distributed File System (HDFS) or Object Storage Service (OSS), enhancing data lake analytics capabilities. To experience the advanced features of Apache Paimon on Alibaba Cloud, visit Realtime Compute for Apache Flink and start your free trial here.
Apache Paimon is an advanced lake format designed for Realtime Lakehouse Architecture, seamlessly integrating with Apache Flink and Apache Spark for streaming and batch processes. It addresses challenges in streaming computing by offering real-time updates, high-performance data processing, and comprehensive data lake capabilities. Leveraging earlier enhancements in technologies like Apache Hive, Apache Iceberg, and Apache Hudi, Apache Paimon emerges as a production-ready solution with dynamic table storage and deep Apache Flink integration.
Through industry case studies, including Tongcheng Travel, Autohome Inc., and China Unicom, Apache Paimon demonstrates its transformative impact on optimizing data lakehouse architectures and enhancing operational efficiency. With integration on Alibaba Cloud, Apache Paimon facilitates seamless real-time data ingestion into data lakes, empowering organizations to leverage streaming data effectively.
Evolution of Flink 2.0 State Management Storage-computing Separation Architecture
152 posts | 43 followers
FollowApache Flink Community - July 5, 2024
Apache Flink Community - June 11, 2024
Apache Flink Community China - January 31, 2024
Apache Flink Community - April 30, 2024
Alibaba EMR - November 14, 2024
Apache Flink Community - April 8, 2024
152 posts | 43 followers
FollowRealtime Compute for Apache Flink offers a highly integrated platform for real-time data processing, which optimizes the computing of Apache Flink.
Learn MoreA fully-managed Apache Kafka service to help you quickly build data pipelines for your big data analytics.
Learn MoreA cloud-native real-time data warehouse based on Apache Doris, providing high-performance and easy-to-use data analysis services.
Learn MoreApsaraMQ for RocketMQ is a distributed message queue service that supports reliable message-based asynchronous communication among microservices, distributed systems, and serverless applications.
Learn MoreMore Posts by Apache Flink Community