Iceberg

0.0.201

Iceberg is an open table format for data lakes. You can use Iceberg to quickly build your own data lake storage service on Hadoop Distributed File System (HDFS) or Alibaba Cloud Object Storage Service (OSS). Then, you can use a compute engine such as Apache Flink, Apache Spark, Apache Hive, or Apache Presto of the open source big data ecosystem to analyze data in your data lake.

Features

Apache Iceberg is designed to migrate Hive data to the cloud. After the release of multiple updates, Apache Iceberg became a standard table format for data lakes that are deployed on the cloud. For more information about Apache Iceberg, visit the Apache Iceberg official website.

Apache Iceberg provides the following features:

Builds a low-cost lightweight data lake storage service based on HDFS or an object storage system.
Connects to mainstream open source compute engines for data ingestion and analysis.
Provides comprehensive atomicity, consistency, isolation, durability (ACID) semantics.
Supports row-level data changes.
Supports historical version backtracking.
Supports efficient data filtering.
Supports schema changes.
Supports partition changes.
Supports hidden partitioning.

The following table compares open source ClickHouse (real-time data warehouse), open source Hive (offline data warehouse), and Alibaba Cloud E-MapReduce (EMR) Iceberg (data lake) from the dimensions of system architecture, business value, and maintenance costs.

Item	Subitem	Open source ClickHouse	Open source Hive	Alibaba Cloud EMR Iceberg

Item	Subitem	Open source ClickHouse	Open source Hive	Alibaba Cloud EMR Iceberg
System architecture	Architecture	Integrated computing and storage	Decoupled computing and storage	Decoupled computing and storage
	Multiple compute engines	Not supported	Supported	Supported
	Data storage in an object storage system	Not supported	Not fully supported	Supported
	Data storage in HDFS	Not supported	Supported	Supported
	Storage format openness	No	Yes	Yes
Business value	Timeliness	Accurate to the second	Accurate to the hour or day	Accurate to the minute
	Computing flexibility	Low	High	High
	Transaction	Not supported	Not fully supported	Supported
	Table-level semantic generality	Poor	Poor	Excellent
	Row-level data change	Not supported	Limited support	Supported
	Data quality	Excellent	Good	Good
Maintenance costs	Query performance	High	Very high	Very high
	Storage costs	High	Medium	Low
	Self-service	Not supported	Not supported	Supported
	Resource scalability	Medium	Medium	Excellent

Comparison between Alibaba Cloud EMR Iceberg and Apache Iceberg

The following table compares Alibaba Cloud EMR Iceberg and Apache Iceberg from the dimensions of basic features, data changes, and compute engines.

Note

The check mark (✓) indicates that the related item is supported, and the cross mark (x) indicates that the related item is not supported.

Category	Item	Subitem	Apache Iceberg	EMR Iceberg

Category	Item	Subitem	Apache Iceberg	EMR Iceberg
Basic features	ACID	None	✓	✓
	Historical version backtracking	None	✓	✓
	Source and sink integration	Batch	✓	✓
	Source and sink integration	Streaming	✓	✓
	Efficient data filtering	None	✓	✓
Data changes	Schema evolution	None	✓	✓
	Partition evolution	None	✓	✓
	Copy-on-write update	None	✓	✓
	Merge-on-read update	Read	✓	✓
		Write	✓	✓
		Compaction	x	x
Compute engines	Apache Spark	Read	✓	✓
	Apache Spark	Write	✓	✓
	Apache Hive	Read	✓	✓
	Apache Hive	Write	✓	✓
	Apache Flink	Read	✓	✓
	Apache Flink	Write	✓	✓
	PrestoDB or Trino	Read	✓	✓
	PrestoDB or Trino	Write	✓	✓
Programming languages	Java	None	✓	✓
Programming languages	Python	None	✓	✓
Advanced features	Native connection to Alibaba Cloud OSS	None	x	✓
	Native connection to Alibaba Cloud Data Lake Formation (DLF)	None	x	✓
	Data access acceleration based on data caching in local disks	None	x	✓
	Automatic merging of small files	None	x	✓

Note

In this table, information is provided based on an objective analysis of the status of Apache Iceberg and Alibaba Cloud EMR Iceberg by the end of September 2021. This information may change based on the updates of Apache Iceberg and EMR Iceberg.

Scenarios

Iceberg is one of the core components of a general-purpose data lake service. The following table describes the scenarios in which you can use Iceberg.

Scenario	Description

Scenario	Description
Write and read data in real time	Upstream data is ingested to an Iceberg-based data lake in real time to perform a query. You can run a Flink or Spark streaming job to write log data to an Iceberg table in real time. Then, you can use a compute engine such as Hive, Spark, Flink, or Presto to read the data in real time. For more information, Apache Iceberg connector, Run a Spark streaming job to write data to an Iceberg table, Use Spark to read data, and Apache Iceberg connector. Iceberg supports ACID transactions, which isolate data write operations from data read operations to avoid dirty data.
Delete or update data	Most data warehouses do not support row-level data deletion or update. In most cases, you can run an offline job to read all data from a source table, change the data, and then write the changed data to the source table. If Iceberg is used, data changes can be performed on files instead of tables. The scope of the change operation narrows down. This way, you can update or delete your business data by performing a change operation based on a smaller scope. In an Iceberg-based data lake, you can run a command that is similar to `DELETE FROM test_table WHERE id > 10` to change data in a table.
Control data quality	You can use an Iceberg schema to check for and delete abnormal data from data that is being written or further process the abnormal data.
Change the schema of a table	You can use DDL statements supported by Spark SQL to change the schema of the Iceberg table. When you change the schema of an Iceberg table, you do not need to export all historical data in the table based on the new schema. Therefore, the speed of changing a schema is fast. Iceberg supports ACID transactions, which prevents schema changes from affecting data read operations. This way, the data that you read and the data that you write are consistent.
Real-time machine learning	In machine learning scenarios, a long period of time may be required to process data, such as cleansing, converting, and characterizing data. You may also need to process historical data and real-time data. Iceberg simplifies these workflows. Iceberg provides a complete and reliable real-time stream to cleanse, convert, and characterize data. You do not need to process historical data and real-time data. Iceberg also supports the native SDK for Python, which is developed to meet the requirements of developers who use machine learning algorithms.