×
Community Blog Data Lake Exploration – Delta Lake

Data Lake Exploration – Delta Lake

This article explores Delta Lake and discusses the implementation of two solutions related to traditional data warehouses based on Hive tables.

An Introduction to Delta Lake

Delta Lake is an open-source storage framework used by DataBricks to build the lake house architecture. It supports query and computing engines, such as Spark, Flink, Hive, PrestoDB, and Trino. As an open format storage layer, it provides a reliable, secure, and high-performance guarantee for the lake house architecture while providing the integration of batch processing and stream processing.

Key Features of Delta Lake:

  1. Atomicity, Consistency, Isolation, and Durability (ACID) Transactions: Delta Lake supports concurrent reads and writes of multiple pipelines using different levels of isolation policies.
  2. Data Version Management: Delta Lake uses Snapshot to manage and audit the versions of data and metadata and uses time-travel to query data of historical versions or backtrack to historical versions.
  3. Open-Source File Format: Delta Lake stores data through parquet format to achieve high-performance compression and other features.
  4. Integration of Batch Processing and Stream Processing: Delta Lake supports batch and stream reads and writes of data.
  5. Metadata Evolution: Delta Lake allows users to merge schemas or rewrite schemas to adapt to changes in data structures in different periods.
  6. Rich DML: Delta Lake supports updating, inserting, deleting, and merging to meet the needs of users in different scenarios, such as CDC scenarios.

File Structure

A big difference between lake tables and common Hive tables is that the metadata of lake tables is self-managed and stored in the file system. The following figure shows the file structure of a Delta Lake table:

1

The file structure of Delta Lake consists of two parts:

  • The _delta_log Directory: It stores all the metadata of a Delta Lake table.
  • Each operation on the table is called a commit, including data operation (inserting/updating/deleting/merging) and metadata operation (adding new columns/modifying table configuration).
  • Every ten commits are automatically merged into a checkpoint file in Parquet format by default, which is used to accelerate the parsing of metadata and support regular cleaning of historical metadata files.
  • Data Directory and File: Files outside the _delta_log directory. It should be noticed that:
  • The data organization form of the partition table by Delta Lake is the same as the common Hive table. The partition fields and their correspondi ng values are used as part of the actual data path.
  • Not all visible data files are valid. Delta Lake organizes tables in the form of snapshots, and the valid data files corresponding to the latest snapshot are managed in_delta_log metadata.

Metadata Mechanism

Delta Lake uses Snapshot to manage multiple versions of a table and supports time-travel queries for historical versions. No matter whether you query the latest snapshot or the snapshot information of a certain version in history, you need to parse the metadata information of the corresponding snapshot first, which mainly involves:

  • The read/write protocol of the current Delta Lake (Protocol)
  • Field information and configuration information of the table (Metadata)
  • A list of valid data files. This is described by a set of added files (AddFile) and removed files (RemoveFile).

When loading a specific snapshot, try to find a checkpoint file less than or equal to this version and use the checkpoint file and the logs generated later than the checkpoint file to parse the metadata information to speed up the loading process.

EMR Delta Lake

The Alibaba Cloud EMR Team has been following the Delta Lake community since 2019 and implementing it in the commercial products of EMR. During this period, Delta Lake has been continuously polished and upgraded in terms of iterative features, optimization performance, integrated ecology, higher usability, and scenario implementation to help integrate into EMR products and become more user-friendly.

The following table summarizes the main self-developed features of EMR Delta Lake compared with open-source Delta Lake (Community 1.1.0):

2

Note:

Delta Lake (1.x) only supports Spark 3 and is bound to specific Spark versions. As a result, some new features and optimizations cannot be used on the old versions and Spark 2, but features of EMR Delta Lake stay synchronized with Delta Lake (0.6) of Spark 2 and Delta Lake (1.x) of Spark 3.

In-Depth Integration with DLF

Data Lake Formation (DLF) is a fully managed service that helps users build a cloud data lake and lake house quickly. It provides capabilities for unified metadata management, unified permission, secure and convenient data entry into the lake, and one-click data exploration. It seamlessly connects multiple computing engines, breaks data silos, and insights into business value.

EMR Delta Lake is deeply integrated with DLF. This enables Delta Lake tables to synchronize metadata to the DLF metastore automatically after they are created and written. Therefore, you do not need to create Hive foreign tables to associate with Delta Lake tables like in the open-source version. After synchronization, you can directly query through Hive, Presto, Impala, and even Alibaba Cloud MaxCompute and Hologres without additional operations.

Similarly, DLF has a mature ability to enter the lake. You can directly synchronize the data of MySQL, RDS, and Kafka to generate Delta Lake tables through the configuration of the product.

On the product side of DLF, the lake-formatted Delta Lake is a first-class citizen. DLF will also provide an easy-to-use visual display and lake table management capabilities in the next iteration to help you maintain lake tables better.

G-SCD Solution

Slowly Changing Dimension (SCD) is considered one of the key ETL tasks for tracking dimensional changes. Star-shaped models are usually used to associate fact tables and dimension tables in data warehouse scenarios. If some dimensions in the dimension table are updated over time, how do you store and manage current and historical dimension values? Are they directly ignored, directly overwritten, or processed by other methods, such as permanently preserving all dimension values in history? According to different processing methods, SCD defines many types, among which SCD Type 2 keeps all historical values by adding new records.

We may not need to focus on all historical dimension values in an actual production environment, but on the new value in a certain time, such as day- or hour-based granularity, and the value of a dimension in each day or hour. Therefore, the actual scenario can be transformed into a slowly changing dimension based on a fixed granularity or business snapshot (Based-Granularity Slowly Changing Dimension – G-SCD).

There are several options in the implementation of traditional data warehouses based on Hive tables. The article introduces the following two solutions:

  • Solution 1: Build an incremental data table of the T+1 point in time in streaming mode and merge data in the incremental data table with data in the T partition of an offline table. Then, a T+1 partition is generated in the offline table. T indicates granularity or business snapshot. It is easy to understand that each partition saves a full amount of data, which will cause a large amount of storage resources to be wasted.
  • Solution 2: Store an offline base table, separately store incremental data at each business moment, and merge the base table with incremental data tables when you query data. This solution reduces query efficiency.

The SCD Type 2 scenario is implemented through the upgrade of Delta Lake combined with the adaptation of SparkSQL and Spark Streaming. The following figure shows the architecture:

3

Similarly, the data in the upstream Kafka is connected. The batch data is split based on the configured business snapshot granularity on the Streaming side, committed separately, and attached to the value of the business snapshot. After receiving data, Delta Lake saves the current relationship between the snapshot and the business snapshot. In addition, when the next business snapshot arrives, Delta Lake performs a savepoint on the previous snapshot to retain the version permanently. When you query, the specific snapshot is identified by the specific value of the specified business snapshot. Then, the query is implemented through time-travel.

Advantages of G-SCD on Delta Lake:

  • Integration of Stream Processing and Batch Processing: You do not need to manage incremental data tables and base tables separately.
  • Storage Resources: You can manage incremental change dimensions with the data versioning capability of Delta Lake, but do not need to retain historical full data based on time granularity.
  • Query Performance: Query efficiency is improved with the metadata checkpoint of Delta Lake, data optimize, Zorder, and DataSkipping capabilities.
  • Retaining SQL Statements of the Original Implementation: You can still use similar partition fields to execute the snapshot data within the business time granularity to be queried as you implement snapshots with partitions.

CDC Solution

Under the current data warehouse architecture, we tend to layer data into ODS, DWD, DWS, and ADS for easy management. If the raw data is stored in MySQL or RDS, we can consume its binlog data to implement the incremental update of ODS tables. However, how about the incremental data updates from ODS to DWD or from DWD to the DWS layer? Since the Hive table does not have the ability to generate binlog-like data, we cannot implement incremental updates for each downstream process. However, the ability to enable lake tables to generate binlog-like data is the key to building real-time incremental data warehouses.

Alibaba Cloud EMR implements CDC capabilities that use it as a streaming source based on Delta Lake. When it is enabled, ChangeData is generated and persisted for all data operations so downstream streaming can be read. Also, SparkSQL syntax queries are supported (as shown in the following figure):

4

The Delta table user_city_table at the ODS layer receives source data to perform the merge operation and persist the changed data. At the DWS layer, the city_cnt_table table aggregated by city reads the ChangeData data of the user_city_table table, and the cnt aggregation fields are updated.

Follow-Up Planning

Delta Lake (as the main lake format promoted by EMR) has already been trusted and chosen by many users and has been implemented in their respective actual production environments, connecting various scenarios. We will continue to strengthen our investment in Delta Lake, deeply explore its integration with DLF, enrich capabilities for lake table O&M and management, and reduce the cost of users to enter the lake in the future. In addition, we will continue to optimize the reading and writing performance, improve the ecological construction with the big data system of Alibaba Cloud, and promote the construction of the integrated architecture of lake houses for users.

0 0 0
Share on

Alibaba EMR

62 posts | 6 followers

You may also like

Comments

Alibaba EMR

62 posts | 6 followers

Related Products