Iceberg is an open table format for data lakes. You can use Iceberg to quickly build your own data lake storage service on Hadoop Distributed File System (HDFS) or Alibaba Cloud Object Storage Service (OSS). Then, you can use a compute engine such as Apache Flink, Apache Spark, Apache Hive, or Apache Presto of the open source big data ecosystem to analyze data in your data lake.
Features
Apache Iceberg is designed to migrate Hive data to the cloud. After the release of multiple updates, Apache Iceberg became a standard table format for data lakes that are deployed on the cloud. For more information about Apache Iceberg, visit the Apache Iceberg official website.
Apache Iceberg provides the following features:
Builds a low-cost lightweight data lake storage service based on HDFS or an object storage system.
Connects to mainstream open source compute engines for data ingestion and analysis.
Provides comprehensive atomicity, consistency, isolation, durability (ACID) semantics.
Supports row-level data changes.
Supports historical version backtracking.
Supports efficient data filtering.
Supports schema changes.
Supports partition changes.
Supports hidden partitioning.
The following table compares open source ClickHouse (real-time data warehouse), open source Hive (offline data warehouse), and Alibaba Cloud E-MapReduce (EMR) Iceberg (data lake) from the dimensions of system architecture, business value, and maintenance costs.
Item | Subitem | Open source ClickHouse | Open source Hive | Alibaba Cloud EMR Iceberg |
System architecture | Architecture | Integrated computing and storage | Decoupled computing and storage | Decoupled computing and storage |
Multiple compute engines | Not supported | Supported | Supported | |
Data storage in an object storage system | Not supported | Not fully supported | Supported | |
Data storage in HDFS | Not supported | Supported | Supported | |
Storage format openness | No | Yes | Yes | |
Business value | Timeliness | Accurate to the second | Accurate to the hour or day | Accurate to the minute |
Computing flexibility | Low | High | High | |
Transaction | Not supported | Not fully supported | Supported | |
Table-level semantic generality | Poor | Poor | Excellent | |
Row-level data change | Not supported | Limited support | Supported | |
Data quality | Excellent | Good | Good | |
Maintenance costs | Query performance | High | Very high | Very high |
Storage costs | High | Medium | Low | |
Self-service | Not supported | Not supported | Supported | |
Resource scalability | Medium | Medium | Excellent |
Comparison between Alibaba Cloud EMR Iceberg and Apache Iceberg
The following table compares Alibaba Cloud EMR Iceberg and Apache Iceberg from the dimensions of basic features, data changes, and compute engines.
The check mark (✓) indicates that the related item is supported, and the cross mark (x) indicates that the related item is not supported.
Category | Item | Subitem | Apache Iceberg | EMR Iceberg |
Basic features | ACID | None | ✓ | ✓ |
Historical version backtracking | None | ✓ | ✓ | |
Source and sink integration | Batch | ✓ | ✓ | |
Streaming | ✓ | ✓ | ||
Efficient data filtering | None | ✓ | ✓ | |
Data changes | Schema evolution | None | ✓ | ✓ |
Partition evolution | None | ✓ | ✓ | |
Copy-on-write update | None | ✓ | ✓ | |
Merge-on-read update | Read | ✓ | ✓ | |
Write | ✓ | ✓ | ||
Compaction | x | x | ||
Compute engines | Apache Spark | Read | ✓ | ✓ |
Write | ✓ | ✓ | ||
Apache Hive | Read | ✓ | ✓ | |
Write | ✓ | ✓ | ||
Apache Flink | Read | ✓ | ✓ | |
Write | ✓ | ✓ | ||
PrestoDB or Trino | Read | ✓ | ✓ | |
Write | ✓ | ✓ | ||
Programming languages | Java | None | ✓ | ✓ |
Python | None | ✓ | ✓ | |
Advanced features | Native connection to Alibaba Cloud OSS | None | x | ✓ |
Native connection to Alibaba Cloud Data Lake Formation (DLF) | None | x | ✓ | |
Data access acceleration based on data caching in local disks | None | x | ✓ | |
Automatic merging of small files | None | x | ✓ |
In this table, information is provided based on an objective analysis of the status of Apache Iceberg and Alibaba Cloud EMR Iceberg by the end of September 2021. This information may change based on the updates of Apache Iceberg and EMR Iceberg.
Scenarios
Iceberg is one of the core components of a general-purpose data lake service. The following table describes the scenarios in which you can use Iceberg.
Scenario | Description |
Write and read data in real time | Upstream data is ingested to an Iceberg-based data lake in real time to perform a query. You can run a Flink or Spark streaming job to write log data to an Iceberg table in real time. Then, you can use a compute engine such as Hive, Spark, Flink, or Presto to read the data in real time. For more information, Apache Iceberg connector, Run a Spark streaming job to write data to an Iceberg table, Use Spark to read data, and Apache Iceberg connector. Iceberg supports ACID transactions, which isolate data write operations from data read operations to avoid dirty data. |
Delete or update data | Most data warehouses do not support row-level data deletion or update. In most cases, you can run an offline job to read all data from a source table, change the data, and then write the changed data to the source table. If Iceberg is used, data changes can be performed on files instead of tables. The scope of the change operation narrows down. This way, you can update or delete your business data by performing a change operation based on a smaller scope. In an Iceberg-based data lake, you can run a command that is similar to |
Control data quality | You can use an Iceberg schema to check for and delete abnormal data from data that is being written or further process the abnormal data. |
Change the schema of a table | You can use DDL statements supported by Spark SQL to change the schema of the Iceberg table. When you change the schema of an Iceberg table, you do not need to export all historical data in the table based on the new schema. Therefore, the speed of changing a schema is fast. Iceberg supports ACID transactions, which prevents schema changes from affecting data read operations. This way, the data that you read and the data that you write are consistent. |
Real-time machine learning | In machine learning scenarios, a long period of time may be required to process data, such as cleansing, converting, and characterizing data. You may also need to process historical data and real-time data. Iceberg simplifies these workflows. Iceberg provides a complete and reliable real-time stream to cleanse, convert, and characterize data. You do not need to process historical data and real-time data. Iceberg also supports the native SDK for Python, which is developed to meet the requirements of developers who use machine learning algorithms. |