This topic describes how to use Hudi Metastore in E-MapReduce (EMR).
Background information
An instant is added each time when Hudi performs an operation on data. When you query data, the metadata of each instant is read to obtain the valid partitions or files related to the instant. During the read process, the system spends an extended period of time performing partition listing and file listing operations because these operations cause heavy I/O workloads.
A data lake has unique metadata, such as instants and multiple versions of files. The schema of a data lake is different from the schema of a traditional table. Therefore, the EMR team introduces Hudi Metastore in the cloud to host the instant metadata of Hudi tables, and designs a lifecycle management system for partitions and files. You can accelerate the listing of partitions and files by using Hudi Metastore.
Prerequisites
A cluster of EMR V3.45.0 or a later minor version, or a cluster of EMR V5.11.0 or a later minor version is created in the China (Hangzhou), China (Shanghai), or China (Beijing) region, and DLF Unified Metadata is selected for Metadata.
Parameters
To use Hudi Metastore, perform the following operations: On the Configure tab of the Hudi service page, click the hudi.default.conf tab. Then, configure the parameters that are described in the following table.
Parameter | Description |
---|---|
hoodie.metastore.type | The implementation mode of Hudi metadata. Valid values:
|
hoodie.metadata.enable | Specifies whether to use the native metadata tables in Hudi. Valid values:
|
- Do not use metadata tables
hoodie.metastore.type=LOCAL hoodie.metadata.enable=false
- Use the native metadata tables in Hudi
hoodie.metastore.type=LOCAL hoodie.metadata.enable=true
- Use the metadata tables of Hudi Metastore in EMR (default)
hoodie.metastore.type=METASTORE hoodie.metadata.enable=false
Example
The following example describes how to use the metadata tables of Hudi Metastore in EMR and enable the acceleration feature of Hudi Metastore by using the Spark SQL CLI.
- Log on to your EMR cluster in SSH mode. For more information, see Log on to a cluster.
- Run the following command to open the Spark SQL CLI. In this example, Spark 3 is used.
spark-sql --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \ --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
If the output contains the following information, the Spark SQL CLI is opened:spark-sql>
- Run the following command to create a Hudi table:
create table h0( id bigint, name string, price double ) using hudi tblproperties ( primaryKey = 'id', preCombineField = 'id' ) location '/tmp/hudi_cases/h0';
- Run the following command to insert data into the table:
insert into h0 select 1, 'a1', 10;
- Run the following command to exit the Spark SQL CLI:
exit;
- Run the following command to view the hoodie.properties file in the .hoodie directory of the Hudi table:
hdfs dfs -cat /tmp/hudi_cases/h0/.hoodie/hoodie.properties
If the output containshoodie.metastore.type=METASTORE
andhoodie.metastore.table.id
, Hudi Metastore is used.hoodie.metastore.catalog.id= hoodie.table.precombine.field=id hoodie.datasource.write.drop.partition.columns=false hoodie.table.type=COPY_ON_WRITE hoodie.archivelog.folder=archived hoodie.timeline.layout.version=1 hoodie.table.version=5 hoodie.metastore.type=METASTORE hoodie.table.recordkey.fields=id hoodie.datasource.write.partitionpath.urlencode=false hoodie.database.name=test_db hoodie.table.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator hoodie.table.name=h0 hoodie.datasource.write.hive_style_partitioning=true hoodie.metastore.table.id=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx hoodie.table.checksum=3349919362 hoodie.table.create.schema={"type"\:"record","name"\:"h0_record","namespace"\:"hoodie.h0","fields"\:[{"name"\:"_hoodie_commit_time","type"\:["string","null"]},{"name"\:"_hoodie_commit_seqno","type"\:["string","null"]},{"name"\:"_hoodie_record_key","type"\:["string","null"]},{"name"\:"_hoodie_partition_path","type"\:["string","null"]},{"name"\:"_hoodie_file_name","type"\:["string","null"]},{"name"\:"id","type"\:["long","null"]},{"name"\:"name","type"\:["string","null"]},{"name"\:"price","type"\:["double","null"]}]}