You can use Spark SQL to read data from or write data to E-MapReduce (EMR) Hudi 0.8.0. This way, usage costs of Hudi are significantly reduced. This topic describes how to use Spark SQL to read data from or write data to Hudi.
limits
Only clusters of EMR V3.36.0 or a later minor version and clusters of EMR V5.2.0 or a later minor version allow you to use Spark SQL to read data from or write data to Hudi.
Methods to start Spark SQL
- Spark 2 or Spark 3 and Hudi of a version earilier than 0.11
spark-sql \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
- Spark 3 and Hudi 0.11 or later
spark-sql \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \ --conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
Examples
- Create a table
create table h0 ( id bigint, name string, price double, ts long ) using hudi tblproperties ( primaryKey="id", preCombineField="ts" );
- Query a table
desc formatted h0;
The following output is returned:_hoodie_commit_time string _hoodie_commit_seqno string _hoodie_record_key string _hoodie_partition_path string _hoodie_file_name string id bigint name string price double ts bigint
Note By default,_hoodie_commit_time
,_hoodie_commit_seqno
,_hoodie_record_key
,_hoodie_partition_path
, and_hoodie_file_name
are auxiliary fields added by Hudi. - Perform operations on the table
- insert insert into h0 values (1, 'a1', 10, 1000), (2, 'a2', 11, 1000); -- update update h0 set name = 'a1_new' where id = 1; -- delete delete from h0 where id = 1;
- Query data in the table
- Example 1
select id, name, price, ts from h0;
The following output is returned:2 a2 11.0 1000
- Example 2
select * from h0;
The following output is returned:4.820221130150621338 20221130150621338_0_1 id:2 40d6507e-0579-42ce-a10f-c5e07a3981e5-0_0-29-2007_2022113015062****.parquet 2 a2 11.0 1000
Note The table is a non-partitioned table. Therefore, the_hoodie_partition_path
field is left empty. The first four field values in the preceding output are the values of the default fields that are added by Hudi.
- Example 1