Apache Paimon (Paimon) provides a unified storage format for different data types. Paimon can work with Apache Flink and Apache Spark to implement a real-time lakehouse architecture that supports streaming and batch operations. Paimon innovatively combines the lake format and the log-structured merge-tree (LSM) structure to support real-time streaming updates in the lake architecture. You can use Paimon tables in Realtime Compute for Apache Flink to quickly build a data lake based on cloud storage services, such as Object Storage Service (OSS).
Paimon provides the following capabilities:
Enhanced real-time data ingestion: Paimon can work with Realtime Compute for Apache Flink to ingest different types of data into a data lake that supports automatic schema change synchronization and real-time updates from various database systems, such as MySQL. Tens of millions of data records can be efficiently ingested with low latency.
Unified stream and batch processing: Paimon can work with Apache Flink to facilitate stream processing and Apache Spark to facilitate batch processing. Paimon provides a unified format for data lake storage to improve ease of use and reduces costs.
Extensive ecosystem integration: Paimon can seamlessly integrate with a variety of Alibaba Cloud compute services, such as Realtime Compute for Apache Flink, E-MapReduce (Spark, StarRocks, Hive, and Trino), and MaxCompute.
Innovative lakehouse storage: Paimon uses deletion vectors and indexes to ensure a minute-level latency for streaming, batch, and online analytical processing (OLAP) queries.
For more information, see Apache Paimon.
Usage
Familiarize yourself with Paimon
The first time you use Paimon, we recommend that you start with the basic features. For more information, see Getting started with basic features of Apache Paimon.
Learn the features of Paimon tables. If your data requires streaming updates, use primary key tables. Otherwise, use append-only tables (without primary keys).
For information about how Paimon ensures data freshness and consistency, see Data latency and consistency.
For information about a step-by-step guide to build a streaming lakehouse, see Build a streaming data lakehouse by using Realtime Compute for Apache Flink and Apache Paimon.
Create a Paimon catalog
A Paimon catalog provides access to Paimon tables stored in external systems. It allows you to manage Paimon tables in a centralized manner and can be accessed by other Alibaba Cloud services. You can use Paimon catalogs in the following ways:
Create and use a Paimon catalog. For more information, see Manage Apache Paimon catalogs.
Synchronize the metadata of a Paimon table to Data Lake Formation (DLF). For more information, see Create a Paimon DLF catalog.
Create a Paimon external table in MaxCompute to access the associated Paimon table. For more information, see Create a Paimon MaxCompute catalog.
Synchronize the metadata of a Paimon table to DLF and create a Paimon external table in MaxCompute. For more information, see Create a Paimon sync catalog.
Create a Paimon table
Directly create a Paimon table in a Paimon catalog. For more information, see Manage Paimon tables.
Synchronize data from external sources, such as MySQL and Apache Kafka, to create a Paimon table by using the CREATE TABLE AS (CTAS) statement or CREATE DATABASE AS (CDAS) statement. For more information, see Use the CREATE TABLE AS (CTAS) or CREATE DATABASE AS (CDAS) statement to create a table.
Write data to a Paimon table
Insert new data to or update data in a Paimon table. For more information, see Write data to a Paimon table.
Join a Paimon table with other tables and apply aggregate functions. For more information, see Merge engine.
Partially or completely overwrite a Paimon table. For more information, see Use the INSERT OVERWRITE statement to overwrite data.
Delete data from a Paimon table. For more information, see Use the DELETE statement to delete data.
Delete partitions from a Paimon table. For more information, see Modify the schema of an Apache Paimon table.
Consume data from a Paimon table
Query or consume data from a Paimon table. For more information, see Consume data from a Paimon table. If you want to consume data from a primary key table in streaming mode, make sure that you complete the changelog producer configuration.
Configure the consumer offset of a Paimon table. For more information, see Configure a consumer offset.
Save the consumer offset of a Paimon table or retain expired snapshot files that are still in use. For more information, see Specify a consumer ID.
Run a batch deployment to read the historical states of a Paimon table. For more information, see Batch Time Travel.
Maintain a Paimon table
Learn how to address common issues related to Paimon. For more information, see FAQ about upstream and downstream storage.
Optimize the read and write performance of Paimon tables. For more information, see Performance optimization.
Query the metadata of a Paimon table, such as the partitions and the total size of files in each partition. For more information, see System tables.
Modify the schema of a table in a Paimon Catalog. For more information, see Modify the schema of an Apache Paimon table.
Delete a table from a Paimon catalog. For more information, see Drop an Apache Paimon table.
Change the number of buckets for a Paimon table that uses fixed bucket mode. For more information, see Change the number of buckets in fixed bucket mode.
Clean up obsolete files in the directory of a Paimon table. For more information, see Clean up expired data.