All Products
Search
Document Center

Realtime Compute for Apache Flink:Use Apache Paimon to build a streaming lakehouse solution

Last Updated:Jul 01, 2024

Apache Paimon (Paimon) provides a unified storage format for different data types. Paimon can work with Apache Flink and Apache Spark to implement a real-time lakehouse architecture that supports streaming and batch operations. Paimon innovatively combines the lake format and the log-structured merge-tree (LSM) structure to support real-time streaming updates in the lake architecture. You can use Paimon tables in Realtime Compute for Apache Flink to quickly build a data lake based on cloud storage services, such as Object Storage Service (OSS).

Paimon provides the following capabilities:

  • Enhanced real-time data ingestion: Paimon can work with Realtime Compute for Apache Flink to ingest different types of data into a data lake that supports automatic schema change synchronization and real-time updates from various database systems, such as MySQL. Tens of millions of data records can be efficiently ingested with low latency.

  • Unified stream and batch processing: Paimon can work with Apache Flink to facilitate stream processing and Apache Spark to facilitate batch processing. Paimon provides a unified format for data lake storage to improve ease of use and reduces costs.

  • Extensive ecosystem integration: Paimon can seamlessly integrate with a variety of Alibaba Cloud compute services, such as Realtime Compute for Apache Flink, E-MapReduce (Spark, StarRocks, Hive, and Trino), and MaxCompute.

  • Innovative lakehouse storage: Paimon uses deletion vectors and indexes to ensure a minute-level latency for streaming, batch, and online analytical processing (OLAP) queries.

For more information, see Apache Paimon.

Usage

Familiarize yourself with Paimon

Create a Paimon catalog

A Paimon catalog provides access to Paimon tables stored in external systems. It allows you to manage Paimon tables in a centralized manner and can be accessed by other Alibaba Cloud services. You can use Paimon catalogs in the following ways:

Create a Paimon table

Write data to a Paimon table

Consume data from a Paimon table

  • Query or consume data from a Paimon table. For more information, see Consume data from a Paimon table. If you want to consume data from a primary key table in streaming mode, make sure that you complete the changelog producer configuration.

  • Configure the consumer offset of a Paimon table. For more information, see Configure a consumer offset.

  • Save the consumer offset of a Paimon table or retain expired snapshot files that are still in use. For more information, see Specify a consumer ID.

  • Run a batch deployment to read the historical states of a Paimon table. For more information, see Batch Time Travel.

Maintain a Paimon table