Hudi MetaStore使用说明 - 开源大数据平台E-MapReduce

本文为您介绍如何使用E-MapReduce的Hudi MetaStore。

背景信息

Hudi每次操作数据都会新增时间线（instant），查询时需要读取所有时间线元数据，以获得在该时间点上的有效分区或文件，其中Partition Listing和File Listing涉及大量IO操作，耗时较多。

湖格式和传统表结构不同，有其特有的元数据，例如时间线和多版本的文件。因此，E-MapReduce提出了云上Hudi MetaStore，托管Hudi Table的instant元数据，并且设计了Partition和File的生命周期管理。目前已支持基于Hudi MetaStore进行Partition Listing和File Listing加速。

前提条件

已在华东1（杭州）、华东2（上海）或华北2（北京）地域创建EMR-3.45.0及后续版本或EMR-5.11.0及后续版本的集群，且元数据选择了DLF统一元数据。

参数介绍

您可以在Hudi服务配置页面的hudi.default.conf页签，设置以下参数，使用Hudi MetaStore。


参数	描述
hoodie.metastore.type	选择Hudi元数据的实现方式： LOCAL：使用Hudi原生的元数据。 METASTORE：使用EMR的Hudi MetaStore的元数据。
hoodie.metadata.enable	取值如下： false：关闭Hudi自带的元数据表。说明设置为false，才可以使用EMR的Hudi MetaStore的元数据表。 true：使用Hudi自带的元数据表。

您可以根据您的使用场景，配置不同的参数：

不使用任何元数据表

hoodie.metastore.type=LOCAL
hoodie.metadata.enable=false

使用Hudi自带的元数据表

hoodie.metastore.type=LOCAL
hoodie.metadata.enable=true

使用EMR Hudi MetaStore元数据表（默认情况）

hoodie.metastore.type=METASTORE
hoodie.metadata.enable=false

使用示例

以下内容以spark-sql为例，为您介绍如何使用EMR Hudi MetaStore元数据表，并开启加速功能。

使用SSH方式登录集群，详情请参见登录集群。

执行以下命令，进入spark-sql命令行（以Spark3为例）。

spark-sql --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
          --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

当返回信息中包含如下信息时，表示已进入spark-sql命令行。

spark-sql>

执行以下命令，新建表。

create table h0(
  id bigint,
  name string,
  price double
) using hudi
tblproperties (
   primaryKey = 'id',
   preCombineField = 'id'
) location '/tmp/hudi_cases/h0';

执行以下命令，插入数据。
```
insert into h0 select 1, 'a1', 10;
```
执行以下命令，退出spark-sql命令行。
```
exit;
```

执行以下命令，查看Hudi表的.hoodie目录内的文件hoodie.properties。

hdfs dfs -cat /tmp/hudi_cases/h0/.hoodie/hoodie.properties

当返回信息中包含hoodie.metastore.type=METASTORE和hoodie.metastore.table.id时，表示使用Hudi MetaStore成功。

hoodie.metastore.catalog.id=
hoodie.table.precombine.field=id
hoodie.datasource.write.drop.partition.columns=false
hoodie.table.type=COPY_ON_WRITE
hoodie.archivelog.folder=archived
hoodie.timeline.layout.version=1
hoodie.table.version=5
hoodie.metastore.type=METASTORE
hoodie.table.recordkey.fields=id
hoodie.datasource.write.partitionpath.urlencode=false
hoodie.database.name=test_db
hoodie.table.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
hoodie.table.name=h0
hoodie.datasource.write.hive_style_partitioning=true
hoodie.metastore.table.id=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
hoodie.table.checksum=3349919362
hoodie.table.create.schema={"type"\:"record","name"\:"h0_record","namespace"\:"hoodie.h0","fields"\:[{"name"\:"_hoodie_commit_time","type"\:["string","null"]},{"name"\:"_hoodie_commit_seqno","type"\:["string","null"]},{"name"\:"_hoodie_record_key","type"\:["string","null"]},{"name"\:"_hoodie_partition_path","type"\:["string","null"]},{"name"\:"_hoodie_file_name","type"\:["string","null"]},{"name"\:"id","type"\:["long","null"]},{"name"\:"name","type"\:["string","null"]},{"name"\:"price","type"\:["double","null"]}]}