All Products
Search
Document Center

E-MapReduce:Hudi

Last Updated:Dec 12, 2024

Apache Hudi is a data lake framework that allows you to update and delete data in Hadoop compatible file systems. Hudi also allows you to consume changed data.

Table types

Hudi supports the following table types:

  • Copy on Write

    Data is stored in the Parquet format. Each update creates a new version of files during a write operation.

  • Merge on Read

    Data is stored based on the combination of a columnar storage format, such as Parquet, and a row-based storage format, such as Avro. Base data is stored in a columnar storage format. Incremental data is stored in a row-based storage format. Incremental data is logged to row-based files and is compacted as required to create new versions of columnar files.

The following table describes the differences between the table types.

Point of difference

Copy on Write

Merge on Read

Data latency

High

Low

Query latency

Low

High

Update cost (I/O)

High (Each update creates a new version of files during a write operation.)

Low (Updated data is appended to a Delta log.)

Parquet file size

Small

Large

Write amplification

High

Low (depending on the merge policy)

Query types

Hudi supports the following three query types:

  • Snapshot queries

    Queries the latest snapshot of a specific commit. For Merge on Read tables, Hudi merges the base data in columnar storage and real-time log data online during snapshot queries. For Copy on Write tables, Hudi can query the latest version of data in the Parquet format.

    Both Copy on Write and Merge on Read tables support snapshot queries.

  • Incremental queries

    Queries the latest data that is written after a specific commit.

    Both Copy on Write and Merge on Read tables support incremental queries.

  • Read-optimized queries

    Queries only the latest data within a specified scope before a specific commit. Read-optimized queries are optimized snapshot queries on Merge on Read tables. This query type can be used to reduce the query latency generated by online merging of log data, with data query timeliness sacrificed.

The following table describes the differences between snapshot queries and read-optimized queries.

Point of difference

Snapshot query

Read-optimized queries

Data latency

Low

High

Query latency

High for Merge on Read tables

Low

Scenarios

  • Near real-time ingestion

    Hudi can be used to insert, update, and delete data. You can ingest log data from Kafka and Log Service into Hudi in real time. You can also synchronize data updates recorded in the binary logs of databases to Hudi in real time.

    Hudi optimizes the formats of files generated during the data writing process to solve the issue caused by small files in Hadoop Distributed File System (HDFS). In this aspect, Hudi is more compatible with HDFS than other traditional data lake solutions.

  • Near real-time analytics

    Hudi supports various data analytics engines, such as Hive, Spark, Presto, and Impala. Hudi is lightweight and does not rely on additional service processes.

  • Incremental data processing

    Hudi supports incremental queries. You can run Spark Streaming jobs to query the latest data that is written after a specific commit. Hudi allows you to consume changed data in HDFS. You can use this consumption feature to optimize the existing system architecture.