The real-time data lake solution allows you to use foreign tables to accelerate data reads and writes in Object Storage Service (OSS). This helps improve query efficiency and simplify data processing.

Background information

As cloud storage, especially object storage, evolves, data lake solutions gradually evolve towards cloud-native technologies. OSS is used as the unified storage of cloud data lakes in the Alibaba Cloud lakehouse architecture to provide a secure, cost-effective, highly reliable, and scalable data lake solution.

The real-time data lake solution marks a significant development of data lakes. This solution focuses on the real-time and streaming performance of data in the lakehouse architecture. Hologres supports real-time data writes, real-time data updates, and real-time data analysis. Based on these powerful engine capabilities, Hologres integrates with Data Lake Formation (DLF), Hive Metastore Service (HMS), OSS, and various ecosystem capabilities to provide a comprehensive real-time data lake solution. The solution accelerates reads and writes of various types of data in OSS by using foreign tables without the need to migrate data. Foreign tables are used to map fields rather than storing data. This reduces development and O&M costs, breaks down data silos, and achieves business insights.

The following table describes the Alibaba Cloud services involved in the real-time data lake solution.

Service	Description	Reference
DLF	Alibaba Cloud DLF is a fully managed service that helps you build data lakes and data lakehouses in the cloud. DLF provides centralized metadata management, centralized permission and security management, and convenient data ingestion and exploration capabilities for data lakes in the cloud.	Overview
HMS	HMS is a core component of Apache Hive and serves as a metadata repository to manage metadata information of Hive and Spark tables. The metadata information includes storage locations of table data and table schemas such as table names, column names, data types, and partition information. HMS is used to provide metadata services and supports Hive and Spark data queries.	Hive Metastore Server
OSS	DLF uses OSS as the unified storage of cloud data lakes. OSS is a secure, cost-effective, and highly reliable service that can store large amounts of data and all types of files. OSS can provide 99.9999999999% of data durability and has become the de facto standard for data lake storage.	What is OSS?
OSS	OSS-HDFS (JindoFS) is a cloud-native data lake storage service. OSS-HDFS is seamlessly integrated with compute engines in the Hadoop ecosystem and provides better performance in offline extract, transform, and load (ETL) of big data based on Hive and Spark than native OSS. OSS-HDFS is fully compatible with Hadoop Distributed File System (HDFS) APIs and supports Portable Operating System Interface (POSIX). You can use OSS-HDFS to manage data in data lake-based computing scenarios in the big data and AI fields.	What is OSS-HDFS?

Architecture

The following figure shows the recommended data lake architecture for Hologres. The architecture covers the entire data lifecycle from collection, storage, and management to application. Hologres leverages powerful engine capabilities and flexible auto scaling policies to provide an end-to-end solution that integrates data lakes and warehouses.

Hologres湖仓加速-文档

Usage notes

In Hologres V1.1 and later, you can read data in the ORC, Parquet, CSV, and SequenceFile formats from OSS. In Hologres V1.3 and later, you can write data in the ORC, Parquet, CSV, or SequenceFile format to OSS, and read data from Apache Hudi tables or Delta Lake tables in OSS.
Note
You can view the version of your Hologres instance on the instance details page in the Hologres console. If the version of your Hologres instance is earlier than V1.1, manually upgrade your Hologres instance in the Hologres console or join the Hologres DingTalk group to apply for an instance upgrade. For more information about how to manually upgrade a Hologres instance, see Instance upgrades. For more information about how to join the Hologres DingTalk group, see Obtain online support for Hologres.
In Hologres V1.3.25 and later, you can use the multi-catalog feature of DLF to isolate metadata in the test environment, development environment, and cross-department instances. This helps ensure the security of your business. For more information about the multi-catalog feature, see Catalog.
In Hologres V1.3.26 and later, you can read data from and write data to OSS-HDFS. The service capabilities and boundaries of data lake acceleration are further expanded. Hologres is seamlessly integrated with compute engines in the Hadoop ecosystem. This accelerates reads and writes of data stored in OSS-HDFS, greatly improves the real-time analysis efficiency of data in the Hadoop ecosystem, and meets the requirements of federated queries of data lakes and real-time data analysis in fields such as big data and AI.
In Hologres V2.1.0 and later, you can read data from Apache Paimon foreign tables. Apache Paimon is a unified lake storage platform that allows you to process data in streaming and batch modes. Apache Paimon supports data writes with high throughput and data queries with low latency to allow data to flow in data lakes in real time. Users can integrate real-time and offline processing of data in data lakes by using Apache Paimon. For more information, see Apache Paimon.
In Hologres V2.2 and later, the new foreign table architecture is used. In this architecture, Hologres Query Engine (HQE) allows you to directly read data from files in the ORC and Parquet formats, and use cache-based acceleration based on local SSDs. The performance is improved by more than five times. You can use HMS to access data in OSS and OSS-HDFS. For more information, see Use HMS to access data in OSS data lakes (beta).
Note
If the version of your Hologres instance is V2.1 or earlier, contact Hologres technical support to upgrade your instance.
In Hologres V3.0 and later, the following features are added:
- The external database feature is added to support catalog-level metadata mapping for data sources such as DLF and MaxCompute. This feature improves the metadata and data management capabilities of data lakes. For more information, see CREATE EXTERNAL DATABASE.
- The external schema and external table capabilities are supported. You can create databases and tables in a specific DLF catalog to facilitate data write-back after aggregation. For more information, see CREATE EXTERNAL SCHEMA and CREATE EXTERNAL TABLE.
- High-performance data writes to Apache Paimon append-only tables are supported to facilitate data forwarding in data lakes and data warehouses.
- Paimon deletion vectors can be optimized to improve query performance when a large amount of data is deleted but compaction is not performed at the earliest opportunity.
- Delta Lake readers are reconstructed to significantly improve read performance.
- Data can be read from Iceberg-based data lakes. This helps expand the data lake ecosystem.
- Data queries in EMR clusters are accelerated after Hologres connects to the HMS for metadata mapping. For more information, see Use the HMS to access data in OSS data lakes (beta).
- Security capabilities are enhanced. By default, the service-linked role is used to access DLF 2.0. You can also use a RAM role to access DLF 2.0.

Scenarios

The following table describes the methods provided by Hologres to map external data sources.

Mapping method	Description	Supported data source	Version requirement	Scenario
CREATE EXTERNAL DATABASE	This statement is used to create an external database on a Hologres instance. You can use an external database to load the metadata of an external data source to Hologres. This allows you to manage internal and external data in Hologres and facilitates centralized metadata management by using the integrated lakehouse architecture. For more information, see CREATE EXTERNAL DATABASE.	DLF 1.0 DLF 2.0 MaxCompute	V3.0	This method is applicable if you want to map a database of a catalog in an external data source and all tables in the database to Hologres.
IMPORT FOREIGN SCHEMA	This statement is used to create multiple foreign tables in a schema in Hologres at a time to automatically map specific tables in an external data source. For more information, see IMPORT FOREIGN SCHEMA.	DLF 1.0 DLF 2.0 HMS MaxCompute Hologres	V0.8	This method is applicable if you want to map all tables in a database or schema of an external data source to a schema in Hologres.
CREATE FOREIGN TABLE	This statement is used to manually create a foreign table in Hologres to map a table or specific fields in a table of an external data source. For more information, see CREATE FOREIGN TABLE.	DLF 1.0 DLF 2.0 HMS MaxCompute Hologres	V0.8	This method is applicable if you want to map specific tables or specific fields in a table to Hologres.

Table formats and file formats

Table formats

Table format	Supported version	Supported compression method
Hudi	Data reads supported in Hologres V1.3 and later	UNCOMPRESSED GZIP SNAPPY BROTLI LZ4 ZSTD LZ4_RAW None ZLIB
Delta Lake	Data reads supported in Hologres V1.3 and later	UNCOMPRESSED GZIP SNAPPY BROTLI LZ4 ZSTD LZ4_RAW
Apache Paimon	Data reads supported in Hologres V2.1 and later Data reads from lake tables and data writes to Apache Paimon append-only tables based on DLF 2.0 supported in Hologres V3.0 and later	PARQUET UNCOMPRESSED SNAPPY GZIP LZO BROTLI LZ4 ZSTD ORC NONE ZLIB SNAPPY LZO LZ4
Iceberg	Data reads from Iceberg tables of V1 and V2 based on DLF 1.0 and HMS supported in Hologres V3.0	PARQUET UNCOMPRESSED SNAPPY GZIP LZO BROTLI LZ4 ZSTD ORC NONE ZLIB SNAPPY LZO LZ4

File formats

File format	Supported version	Supported compression method
CSV	Data reads and writes supported in Hologres V1.3 and later	COMPRESSION_CODEC BZip2Codec DefaultCodec GzipCodec SnappyCodec
Parquet	Data reads and writes supported in Hologres V1.3 and later	UNCOMPRESSED GZIP SNAPPY BROTLI LZ4 ZSTD LZ4_RAW
ORC	Data reads and writes supported in Hologres V1.3 and later	None ZLIB SNAPPY
SequenceFile	Data reads and writes supported in Hologres V1.3 and later	COMPRESSION_CODEC BZip2Codec DefaultCodec GzipCodec SnappyCodec COMPRESSION_TYPE NONE RECORD BLOCK

Data type mappings

For more information about the mappings of data types between DLF and Hologres, see Data types.