Integrate Hudi with Spark SQL - E-MapReduce - Alibaba Cloud Documentation Center

You can use Spark SQL to read data from or write data to E-MapReduce (EMR) Hudi 0.8.0. This way, usage costs of Hudi are significantly reduced. This topic describes how to use Spark SQL to read data from or write data to Hudi.

limits

Only clusters of EMR V3.36.0 or a later minor version and clusters of EMR V5.2.0 or a later minor version allow you to use Spark SQL to read data from or write data to Hudi.

Methods to start Spark SQL

Spark 2 or Spark 3 and Hudi of a version earilier than 0.11

spark-sql \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

Spark 3 and Hudi 0.11 or later

spark-sql \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'

Examples

Create a table

create table h0 (
  id bigint,
  name string,
  price double,
  ts long
) using hudi
tblproperties (
  primaryKey="id",
  preCombineField="ts"
);

Query a table

desc formatted h0;

The following output is returned:

_hoodie_commit_time     string
_hoodie_commit_seqno    string
_hoodie_record_key      string
_hoodie_partition_path    string
_hoodie_file_name       string
id                      bigint
name                    string
price                   double
ts                      bigint

Note By default, _hoodie_commit_time, _hoodie_commit_seqno, _hoodie_record_key, _hoodie_partition_path, and _hoodie_file_name are auxiliary fields added by Hudi.

Perform operations on the table

- insert
insert into h0 values (1, 'a1', 10, 1000), (2, 'a2', 11, 1000);

-- update
update h0 set name = 'a1_new' where id = 1;

-- delete
delete from h0 where id = 1;

Query data in the table
- Example 1
```
select id, name, price, ts from h0;
```
  The following output is returned:
```
2    a2    11.0    1000
```
- Example 2
```
select * from h0;
```
  The following output is returned:
```
4.820221130150621338    20221130150621338_0_1    id:2        40d6507e-0579-42ce-a10f-c5e07a3981e5-0_0-29-2007_2022113015062****.parquet    2    a2    11.0    1000
```
  Note The table is a non-partitioned table. Therefore, the _hoodie_partition_path field is left empty. The first four field values in the preceding output are the values of the default fields that are added by Hudi.