All Products
Search
Document Center

E-MapReduce:Integrate Paimon with Hive

Last Updated:Jul 17, 2024

E-MapReduce (EMR) allows you to query data of Paimon in Hive. This topic describes how to query data of Paimon in Hive.

Prerequisites

A DataLake or custom cluster that contains the Hive and Paimon services is created. For more information about how to create a cluster, see Create a cluster.

Limits

Only clusters of EMR V3.46.0 or a later minor version, or clusters of EMR V5.12.0 or a later minor version allow you to query data of Paimon in Hive.

Procedure

  1. Query data from tables that are created in a Hive catalog and a Data Lake Formation (DLF) catalog.

    After you synchronize metadata from other services to Hive Metastore by using a Hive catalog, you can query data from the tables in the Hive catalog. If you select DLF Unified Metadata for Metadata when you create an EMR cluster, you can use a DLF catalog to synchronize metadata from other services to DLF and query the metadata in Hive.

    The following example shows how to use Spark to write data to a Hive catalog and query data of Paimon in Hive.

    1. Run the following command to start Spark SQL:

      spark-sql --conf spark.sql.catalog.paimon=org.apache.paimon.spark.SparkCatalog --conf spark.sql.catalog.paimon.metastore=hive --conf spark.sql.catalog.paimon.uri=thrift://master-1-1:9083 --conf spark.sql.catalog.paimon.warehouse=oss://<yourBucketName>/warehouse
      Note
      • spark.sql.catalog.paimon: defines a catalog named paimon.

      • spark.sql.catalog.paimon.metastore: specifies the metadata storage type used by the catalog. If you set this parameter to hive, metadata is synchronized to Hive Metastore.

      • spark.sql.catalog.paimon.uri: specifies the address and port number of Hive Metastore. If you set this parameter to thrift://master-1-1:9083, the Spark SQL client connects to Hive Metastore that runs on the master-1-1 node and whose listening port is 9083 to obtain metadata information.

      • spark.sql.catalog.paimon.warehouse: specifies the actual location of the data warehouse. Configure this parameter based on your business requirements.

    2. Execute the following Spark SQL statements to create a Paimon table in the created catalog and write data to the table:

      -- Switch to the paimon catalog.
      USE paimon;
      
      -- Create a test database in the created catalog and use the database. 
      CREATE DATABASE test_db;
      USE test_db;
      
      -- Create a Paimon table. 
      CREATE TABLE test_tbl (
          uuid int,
          name string,
          price double
      ) TBLPROPERTIES (
          'primary-key' = 'uuid'
      );
      
      -- Write data to the Paimon table. 
      INSERT INTO test_tbl VALUES (1, 'apple', 3.5), (2, 'banana', 4.0), (3, 'cherry', 20.5);
    3. Run the following command to start the Hive CLI:

      hive
    4. Execute the following Hive SQL statement to query the data that is written to the Paimon table:

      select * from test_db.test_tbl;
  2. Create an external table and query data from the external table.

    Hive allows you to create an external table that maps to the Paimon table in a specific path and query data from the external table. Sample code:

    CREATE EXTERNAL TABLE test_ext_tbl
    STORED BY 'org.apache.paimon.hive.PaimonStorageHandler'
    LOCATION 'oss:// <yourBucketName>/warehouse/test_db.db/test_tbl';
    
    SELECT * FROM test_ext_tbl;