All Products
Search
Document Center

E-MapReduce:Integrate Hudi with Spark SQL

Last Updated:Jul 23, 2024

You can use Spark SQL to read data from or write data to E-MapReduce (EMR) Hudi 0.8.0. This way, usage costs of Hudi are significantly reduced. This topic describes how to use Spark SQL to read data from or write data to Hudi.

limits

Only clusters of EMR V3.36.0 or a later minor version and clusters of EMR V5.2.0 or a later minor version allow you to use Spark SQL to read data from or write data to Hudi.

Methods to start Spark SQL

  • Spark 2 or Spark 3 and Hudi of a version earilier than 0.11
    spark-sql \
    --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
    --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
  • Spark 3 and Hudi 0.11 or later
    spark-sql \
    --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
    --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
    --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'

Examples

  • Create a table
    create table h0 (
      id bigint,
      name string,
      price double,
      ts long
    ) using hudi
    tblproperties (
      primaryKey="id",
      preCombineField="ts"
    );
  • Query a table
    desc formatted h0;
    The following output is returned:
    _hoodie_commit_time     string
    _hoodie_commit_seqno    string
    _hoodie_record_key      string
    _hoodie_partition_path    string
    _hoodie_file_name       string
    id                      bigint
    name                    string
    price                   double
    ts                      bigint
    Note By default, _hoodie_commit_time, _hoodie_commit_seqno, _hoodie_record_key, _hoodie_partition_path, and _hoodie_file_name are auxiliary fields added by Hudi.
  • Perform operations on the table
    - insert
    insert into h0 values (1, 'a1', 10, 1000), (2, 'a2', 11, 1000);
    
    -- update
    update h0 set name = 'a1_new' where id = 1;
    
    -- delete
    delete from h0 where id = 1;
  • Query data in the table
    • Example 1
      select id, name, price, ts from h0;
      The following output is returned:
      2    a2    11.0    1000
    • Example 2
      select * from h0;
      The following output is returned:
      4.820221130150621338    20221130150621338_0_1    id:2        40d6507e-0579-42ce-a10f-c5e07a3981e5-0_0-29-2007_2022113015062****.parquet    2    a2    11.0    1000
      Note The table is a non-partitioned table. Therefore, the _hoodie_partition_path field is left empty. The first four field values in the preceding output are the values of the default fields that are added by Hudi.