Apache Hudi is a data lake framework that allows you to update and delete data in Hadoop compatible file systems. Hudi also allows you to consume changed data.
Table types
Hudi supports the following table types:
Copy on Write
Data is stored in the Parquet format. Each update creates a new version of files during a write operation.
Merge on Read
Data is stored based on the combination of a columnar storage format, such as Parquet, and a row-based storage format, such as Avro. Base data is stored in a columnar storage format. Incremental data is stored in a row-based storage format. Incremental data is logged to row-based files and is compacted as required to create new versions of columnar files.
The following table describes the differences between the table types.
Point of difference | Copy on Write | Merge on Read |
Data latency | High | Low |
Query latency | Low | High |
Update cost (I/O) | High (Each update creates a new version of files during a write operation.) | Low (Updated data is appended to a Delta log.) |
Parquet file size | Small | Large |
Write amplification | High | Low (depending on the merge policy) |
Query types
Hudi supports the following three query types:
Snapshot queries
Queries the latest snapshot of a specific commit. For Merge on Read tables, Hudi merges the base data in columnar storage and real-time log data online during snapshot queries. For Copy on Write tables, Hudi can query the latest version of data in the Parquet format.
Both Copy on Write and Merge on Read tables support snapshot queries.
Incremental queries
Queries the latest data that is written after a specific commit.
Both Copy on Write and Merge on Read tables support incremental queries.
Read-optimized queries
Queries only the latest data within a specified scope before a specific commit. Read-optimized queries are optimized snapshot queries on Merge on Read tables. This query type can be used to reduce the query latency generated by online merging of log data, with data query timeliness sacrificed.
The following table describes the differences between snapshot queries and read-optimized queries.
Point of difference | Snapshot query | Read-optimized queries |
Data latency | Low | High |
Query latency | High for Merge on Read tables | Low |
Scenarios
Near real-time ingestion
Hudi can be used to insert, update, and delete data. You can ingest log data from Kafka and Log Service into Hudi in real time. You can also synchronize data updates recorded in the binary logs of databases to Hudi in real time.
Hudi optimizes the formats of files generated during the data writing process to solve the issue caused by small files in Hadoop Distributed File System (HDFS). In this aspect, Hudi is more compatible with HDFS than other traditional data lake solutions.
Near real-time analytics
Hudi supports various data analytics engines, such as Hive, Spark, Presto, and Impala. Hudi is lightweight and does not rely on additional service processes.
Incremental data processing
Hudi supports incremental queries. You can run Spark Streaming jobs to query the latest data that is written after a specific commit. Hudi allows you to consume changed data in HDFS. You can use this consumption feature to optimize the existing system architecture.