Apache Paimon-based Hologres data lake solution in serverless mode - Hologres

This topic describes the Apache Paimon-based Hologres data lake solution in serverless mode. This solution is suitable for flexible query acceleration of data in Apache Paimon-based data lakes in Object Storage Service (OSS). You can use this solution on demand without the need of reserved resources. You are charged based on the amount of resources that you use. This solution provides a flexible and scalable data lake architecture that allows you to better manage and utilize data. This helps improve the capabilities of data-driven decision-making and business innovation.

Background information

Apache Paimon is a unified lake storage that allows you to process data in streaming and batch modes. Apache Paimon supports data writing with high throughput and data queries with low latency. Common compute engines of Alibaba Cloud big data platforms, such as Realtime Compute for Apache Flink, Hologres, MaxCompute, and E-MapReduce (EMR) Spark, are comprehensively integrated with Apache Paimon. You can use Apache Paimon to deploy your data lake storage service on Alibaba Cloud OSS in an efficient manner, and connect to the preceding compute engines to perform data lake analytics. For more information, see Apache Paimon.

Hologres Shared Cluster is a serverless service that accelerates online queries of data in MaxCompute and OSS by using foreign tables. Hologres Shared Cluster uses a cloud-native architecture that separates storage from computing. This accelerates analytics on data in OSS. You can use resources on demand and are charged based on the amount of data that is scanned by SQL statements. For more information, see Overview of Hologres Shared Cluster.

Architecture

Prerequisites

A Hologres Shared Cluster instance is purchased. For more information, see Purchase a Hologres instance.
Data Lake Formation (DLF) is activated. For more information, see Getting Started.
Optional. A catalog in DLF is created. This prerequisite must be met if you want to use a custom DLF catalog. The catalog is used when you create a foreign server. For more information, see the "Create a catalog" section in Catalog.
OSS is activated for data lake storage. For more information, see Get started with OSS.
OSS-HDFS is enabled. For more information, see Enable OSS-HDFS.

Note

You can also use Hologres exclusive instances to read data from Apache Paimon-based data lakes or other data lakes in the same way as you use Hologres Shared Cluster instances. In this example, Hologres Shared Cluster instances are used.

Usage notes

Hologres V2.1.6 and later allow you to query data in Apache Paimon-based data lakes.
Hologres Shared Cluster instances only allow you to read data from OSS data lakes. If you want to import data from OSS to Hologres internal tables, you must use Hologres exclusive instances.

Procedure

Create a cluster in the new data lake scenario in the EMR console.

Log on to the EMR console. In the left-side navigation pane, click EMR on ECS. Then, create a cluster. For more information, see Step 1: Create a cluster in the Getting started with E-MapReduce topic. The following table describes the parameters.

Parameter	Description
Business Scenario	Select New Data Lake.
Optional Services (Select One At Least)	Select Spark, Hive, and Paimon. You can select more services based on your business requirements.
Metadata	Select DLF Unified Metadata.
DLF Catalog	You can select a custom DLF catalog, such as paimon_catalog. For more information about how to create a custom DLF catalog, see Create a catalog in the Catalog topic. You can also select the default DLF catalog. If you select the default DLF catalog, you do not need to configure the `dlf_catalog` parameter when you create a foreign server for Hologres Shared Cluster.
Root Storage Directory of Cluster	Select a bucket for which the OSS-HDFS service is enabled.

Create a data source.
Use EMR Hive to generate a 10-GB TPC-H test dataset in the text format. For more information, see Use EMR Spark to process data in the Use Hologres to implement a data lakehouse solution based on Delta Lake topic.
Important
In this example, you must replace the ./dbgen -vf -s 100 command with ./dbgen -vf -s 10.

Use Spark to create an Apache Paimon table.

Start Spark-SQL.

spark-sql --conf spark.sql.catalog.paimon=org.apache.paimon.spark.SparkCatalog --conf spark.sql.catalog.paimon.metastore=dlf

Create a database.

-- Create a database.
CREATE DATABASE paimon_db location 'oss://${oss-hdfs-bucket}/tpch_10G/paimon_tpch_10g/';

${oss-hdfs-bucket}: the name of the bucket for which OSS-HDFS is enabled.

Create an Apache Paimon table and import data from the text data source that you create in the Create a data source step into the table.

-- Switch to the created database.
use paimon_db;

-- Create a table and import data into the table.

CREATE TABLE nation_paimon 
TBLPROPERTIES (
    'primary-key' = 'N_NATIONKEY'
)
as select * from ${source}.nation_textfile;

CREATE TABLE region_paimon
TBLPROPERTIES (
    'primary-key' = 'R_REGIONKEY'
)
as select * from ${source}.region_textfile;

CREATE TABLE supplier_paimon
TBLPROPERTIES (
    'primary-key' = 'S_SUPPKEY'
)
as select * from ${source}.supplier_textfile;

CREATE TABLE customer_paimon
partitioned by (c_mktsegment)
TBLPROPERTIES (
    'primary-key' = 'C_CUSTKEY'
)
as select * from ${source}.customer_textfile;

CREATE TABLE part_paimon
partitioned by (p_brand)
TBLPROPERTIES (
    'primary-key' = 'P_PARTKEY'
)
as select * from ${source}.part_textfile;

CREATE TABLE partsupp_paimon
TBLPROPERTIES (
    'primary-key' = 'PS_PARTKEY,PS_SUPPKEY'
)
as select * from ${source}.partsupp_textfile;

CREATE TABLE orders_paimon
partitioned by (o_orderdate)
TBLPROPERTIES (
    'primary-key' = 'O_ORDERKEY'
)
as select * from ${source}.orders_textfile;

CREATE TABLE lineitem_paimon
partitioned by (l_shipdate)
TBLPROPERTIES (
    'primary-key' = 'L_ORDERKEY,L_LINENUMBER'
)
as select * from ${source}.lineitem_textfile;

${source}: the name of the database in which the *_textfile table in Hive resides.

Create a foreign server in a Hologres Shared Cluster instance.
Note
Take note of the following items:
- If you select a custom DLF catalog for DLF Catalog when you create the EMR cluster, set the dlf_catalog parameter to the name of the custom DLF catalog in the following sample code.
- If you select the default DLF catalog for DLF Catalog when you create the EMR cluster, delete the dlf_catalog parameter from the following sample code.
```
-- Create a foreign server.

CREATE SERVER IF NOT EXISTS dlf_server FOREIGN data wrapper dlf_fdw options 
(
    dlf_catalog 'paimon_catalog',
    dlf_endpoint 'dlf-share.cn-shanghai.aliyuncs.com',
    oss_endpoint 'cn-shanghai.oss-dls.aliyuncs.com'
);
```

Create a foreign table for the Apache Paimon table in the Hologres Shared Cluster instance.

IMPORT FOREIGN SCHEMA paimon_db LIMIT TO
(
  lineitem_paimon
)
FROM SERVER dlf_server INTO public options (if_table_exist 'update');

Query data.

In this example, the first TPC-H query statement is used.

select
        l_returnflag,
        l_linestatus,
        sum(l_quantity) as sum_qty,
        sum(l_extendedprice) as sum_base_price,
        sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
        sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
        avg(l_quantity) as avg_qty,
        avg(l_extendedprice) as avg_price,
        avg(l_discount) as avg_disc,
        count(*) as count_order
from
        lineitem
where
        l_shipdate <= date '1998-12-01' - interval '120' day
group by
        l_returnflag,
        l_linestatus
order by
        l_returnflag,
        l_linestatus;

Note

For more information about the other 21 query statements, see 22 TPC-H query statements in the Test plan topic.