Hudi MetaStore使用說明 - E-MapReduce

本文為您介紹如何使用E-MapReduce的Hudi MetaStore。

背景資訊

Hudi每次操作資料都會新增時間軸（instant），查詢時需要讀取所有時間軸中繼資料，以獲得在該時間點上的有效分區或檔案，其中Partition Listing和File Listing涉及大量IO操作，耗時較多。

湖格式和傳統表結構不同，有其特有的中繼資料，例如時間軸和多版本的檔案。因此，E-MapReduce提出了雲上Hudi MetaStore，託管Hudi Table的instant中繼資料，並且設計了Partition和File的生命週期管理。目前已支援基於Hudi MetaStore進行Partition Listing和File Listing加速。

前提條件

已在華東1（杭州）、華東2（上海）或華北2（北京）地區建立EMR-3.45.0及後續版本或EMR-5.11.0及後續版本的叢集，且中繼資料選擇了DLF統一中繼資料。

參數介紹

您可以在Hudi服務配置頁面的hudi.default.conf頁簽，設定以下參數，使用Hudi MetaStore。

參數	描述
hoodie.metastore.type	選擇Hudi中繼資料的實現方式： LOCAL：使用Hudi原生的中繼資料。 METASTORE：使用EMR的Hudi MetaStore的中繼資料。
hoodie.metadata.enable	取值如下： false：關閉Hudi內建的中繼資料表。說明設定為false，才可以使用EMR的Hudi MetaStore的中繼資料表。 true：使用Hudi內建的中繼資料表。

您可以根據您的使用情境，配置不同的參數：

不使用任何中繼資料表

hoodie.metastore.type=LOCAL
hoodie.metadata.enable=false

使用Hudi內建的中繼資料表

hoodie.metastore.type=LOCAL
hoodie.metadata.enable=true

使用EMR Hudi MetaStore中繼資料表（預設情況）

hoodie.metastore.type=METASTORE
hoodie.metadata.enable=false

使用樣本

以下內容以spark-sql為例，為您介紹如何使用EMR Hudi MetaStore中繼資料表，並開啟加速功能。

使用SSH方式登入叢集，詳情請參見登入叢集。

執行以下命令，進入spark-sql命令列（以Spark3為例）。

spark-sql --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
          --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

當返回資訊中包含如下資訊時，表示已進入spark-sql命令列。

spark-sql>

執行以下命令，建立表。

create table h0(
  id bigint,
  name string,
  price double
) using hudi
tblproperties (
   primaryKey = 'id',
   preCombineField = 'id'
) location '/tmp/hudi_cases/h0';

執行以下命令，插入資料。
```
insert into h0 select 1, 'a1', 10;
```
執行以下命令，退出spark-sql命令列。
```
exit;
```

執行以下命令，查看Hudi表的.hoodie目錄內的檔案hoodie.properties。

hdfs dfs -cat /tmp/hudi_cases/h0/.hoodie/hoodie.properties

當返回資訊中包含hoodie.metastore.type=METASTORE和hoodie.metastore.table.id時，表示使用Hudi MetaStore成功。

hoodie.metastore.catalog.id=
hoodie.table.precombine.field=id
hoodie.datasource.write.drop.partition.columns=false
hoodie.table.type=COPY_ON_WRITE
hoodie.archivelog.folder=archived
hoodie.timeline.layout.version=1
hoodie.table.version=5
hoodie.metastore.type=METASTORE
hoodie.table.recordkey.fields=id
hoodie.datasource.write.partitionpath.urlencode=false
hoodie.database.name=test_db
hoodie.table.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
hoodie.table.name=h0
hoodie.datasource.write.hive_style_partitioning=true
hoodie.metastore.table.id=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
hoodie.table.checksum=3349919362
hoodie.table.create.schema={"type"\:"record","name"\:"h0_record","namespace"\:"hoodie.h0","fields"\:[{"name"\:"_hoodie_commit_time","type"\:["string","null"]},{"name"\:"_hoodie_commit_seqno","type"\:["string","null"]},{"name"\:"_hoodie_record_key","type"\:["string","null"]},{"name"\:"_hoodie_partition_path","type"\:["string","null"]},{"name"\:"_hoodie_file_name","type"\:["string","null"]},{"name"\:"id","type"\:["long","null"]},{"name"\:"name","type"\:["string","null"]},{"name"\:"price","type"\:["double","null"]}]}