All Products
Search
Document Center

Hologres:Apache Paimon-based Hologres data lake solution in serverless mode

Last Updated:Oct 24, 2024

This topic describes the Apache Paimon-based Hologres data lake solution in serverless mode. This solution is suitable for flexible query acceleration of data in Apache Paimon-based data lakes in Object Storage Service (OSS). You can use this solution on demand without the need of reserved resources. You are charged based on the amount of resources that you use. This solution provides a flexible and scalable data lake architecture that allows you to better manage and utilize data. This helps improve the capabilities of data-driven decision-making and business innovation.

Background information

Apache Paimon is a unified lake storage that allows you to process data in streaming and batch modes. Apache Paimon supports data writing with high throughput and data queries with low latency. Common compute engines of Alibaba Cloud big data platforms, such as Realtime Compute for Apache Flink, Hologres, MaxCompute, and E-MapReduce (EMR) Spark, are comprehensively integrated with Apache Paimon. You can use Apache Paimon to deploy your data lake storage service on Alibaba Cloud OSS in an efficient manner, and connect to the preceding compute engines to perform data lake analytics. For more information, see Apache Paimon.

Hologres Shared Cluster is a serverless service that accelerates online queries of data in MaxCompute and OSS by using foreign tables. Hologres Shared Cluster uses a cloud-native architecture that separates storage from computing. This accelerates analytics on data in OSS. You can use resources on demand and are charged based on the amount of data that is scanned by SQL statements. For more information, see Overview of Hologres Shared Cluster.

Architecture

image.png

Prerequisites

  • A Hologres Shared Cluster instance is purchased. For more information, see Purchase a Hologres instance.

  • Data Lake Formation (DLF) is activated. For more information, see Getting Started.

  • Optional. A catalog in DLF is created. This prerequisite must be met if you want to use a custom DLF catalog. The catalog is used when you create a foreign server. For more information, see the "Create a catalog" section in Catalog.

  • OSS is activated for data lake storage. For more information, see Get started with OSS.

  • OSS-HDFS is enabled. For more information, see Enable OSS-HDFS.

Note

You can also use Hologres exclusive instances to read data from Apache Paimon-based data lakes or other data lakes in the same way as you use Hologres Shared Cluster instances. In this example, Hologres Shared Cluster instances are used.

Usage notes

  • Hologres V2.1.6 and later allow you to query data in Apache Paimon-based data lakes.

  • Hologres Shared Cluster instances only allow you to read data from OSS data lakes. If you want to import data from OSS to Hologres internal tables, you must use Hologres exclusive instances.

Procedure

  1. Create a cluster in the new data lake scenario in the EMR console.

    Log on to the EMR console. In the left-side navigation pane, click EMR on ECS. Then, create a cluster. For more information, see Step 1: Create a cluster in the Getting started with E-MapReduce topic. The following table describes the parameters.

    Parameter

    Description

    Business Scenario

    Select New Data Lake.

    Optional Services (Select One At Least)

    Select Spark, Hive, and Paimon. You can select more services based on your business requirements.

    Metadata

    Select DLF Unified Metadata.

    DLF Catalog

    • You can select a custom DLF catalog, such as paimon_catalog. For more information about how to create a custom DLF catalog, see Create a catalog in the Catalog topic.

    • You can also select the default DLF catalog. If you select the default DLF catalog, you do not need to configure the dlf_catalog parameter when you create a foreign server for Hologres Shared Cluster.

    Root Storage Directory of Cluster

    Select a bucket for which the OSS-HDFS service is enabled.

  2. Create a data source.

    Use EMR Hive to generate a 10-GB TPC-H test dataset in the text format. For more information, see Use EMR Spark to process data in the Use Hologres to implement a data lakehouse solution based on Delta Lake topic.

    Important

    In this example, you must replace the ./dbgen -vf -s 100 command with ./dbgen -vf -s 10.

  3. Use Spark to create an Apache Paimon table.

    1. Start Spark-SQL.

      spark-sql --conf spark.sql.catalog.paimon=org.apache.paimon.spark.SparkCatalog --conf spark.sql.catalog.paimon.metastore=dlf
    2. Create a database.

      -- Create a database.
      CREATE DATABASE paimon_db location 'oss://${oss-hdfs-bucket}/tpch_10G/paimon_tpch_10g/';

      ${oss-hdfs-bucket}: the name of the bucket for which OSS-HDFS is enabled.

    3. Create an Apache Paimon table and import data from the text data source that you create in the Create a data source step into the table.

      -- Switch to the created database.
      use paimon_db;
      
      -- Create a table and import data into the table.
      
      CREATE TABLE nation_paimon 
      TBLPROPERTIES (
          'primary-key' = 'N_NATIONKEY'
      )
      as select * from ${source}.nation_textfile;
      
      CREATE TABLE region_paimon
      TBLPROPERTIES (
          'primary-key' = 'R_REGIONKEY'
      )
      as select * from ${source}.region_textfile;
      
      CREATE TABLE supplier_paimon
      TBLPROPERTIES (
          'primary-key' = 'S_SUPPKEY'
      )
      as select * from ${source}.supplier_textfile;
      
      CREATE TABLE customer_paimon
      partitioned by (c_mktsegment)
      TBLPROPERTIES (
          'primary-key' = 'C_CUSTKEY'
      )
      as select * from ${source}.customer_textfile;
      
      CREATE TABLE part_paimon
      partitioned by (p_brand)
      TBLPROPERTIES (
          'primary-key' = 'P_PARTKEY'
      )
      as select * from ${source}.part_textfile;
      
      CREATE TABLE partsupp_paimon
      TBLPROPERTIES (
          'primary-key' = 'PS_PARTKEY,PS_SUPPKEY'
      )
      as select * from ${source}.partsupp_textfile;
      
      CREATE TABLE orders_paimon
      partitioned by (o_orderdate)
      TBLPROPERTIES (
          'primary-key' = 'O_ORDERKEY'
      )
      as select * from ${source}.orders_textfile;
      
      CREATE TABLE lineitem_paimon
      partitioned by (l_shipdate)
      TBLPROPERTIES (
          'primary-key' = 'L_ORDERKEY,L_LINENUMBER'
      )
      as select * from ${source}.lineitem_textfile;

      ${source}: the name of the database in which the *_textfile table in Hive resides.

  4. Create a foreign server in a Hologres Shared Cluster instance.

    Note

    Take note of the following items:

    • If you select a custom DLF catalog for DLF Catalog when you create the EMR cluster, set the dlf_catalog parameter to the name of the custom DLF catalog in the following sample code.

    • If you select the default DLF catalog for DLF Catalog when you create the EMR cluster, delete the dlf_catalog parameter from the following sample code.

    -- Create a foreign server.
    
    CREATE SERVER IF NOT EXISTS dlf_server FOREIGN data wrapper dlf_fdw options 
    (
        dlf_catalog 'paimon_catalog',
        dlf_endpoint 'dlf-share.cn-shanghai.aliyuncs.com',
        oss_endpoint 'cn-shanghai.oss-dls.aliyuncs.com'
    );
  5. Create a foreign table for the Apache Paimon table in the Hologres Shared Cluster instance.

    IMPORT FOREIGN SCHEMA paimon_db LIMIT TO
    (
      lineitem_paimon
    )
    FROM SERVER dlf_server INTO public options (if_table_exist 'update');
  6. Query data.

    In this example, the first TPC-H query statement is used.

    select
            l_returnflag,
            l_linestatus,
            sum(l_quantity) as sum_qty,
            sum(l_extendedprice) as sum_base_price,
            sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
            sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
            avg(l_quantity) as avg_qty,
            avg(l_extendedprice) as avg_price,
            avg(l_discount) as avg_disc,
            count(*) as count_order
    from
            lineitem
    where
            l_shipdate <= date '1998-12-01' - interval '120' day
    group by
            l_returnflag,
            l_linestatus
    order by
            l_returnflag,
            l_linestatus;
    Note

    For more information about the other 21 query statements, see 22 TPC-H query statements in the Test plan topic.