All Products
Search
Document Center

E-MapReduce:Configuration of DLF metadata

Last Updated:Jul 03, 2024

This topic describes the parameters that you must configure when you use the metadata of Data Lake Formation (DLF) in an Iceberg table.

The following compute engines are supported:

Spark

Alibaba Cloud Object Storage Service (OSS) is used as the file system. The default name of the catalog and the parameters that you must configure vary based on the version of your cluster.

  • EMR V3.40 or a later minor version, and EMR V5.6.0 or later

    Note

    The default name of the catalog is iceberg.

    Parameter

    Description

    Remarks

    spark.sql.extensions

    The SQL extension module of Spark.

    Set the value to org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions.

    Note

    This parameter is introduced in Iceberg 0.11.0. Only Spark 3.x supports this parameter.

    spark.sql.catalog.<catalog-name>

    The name of the catalog.

    Set the value to org.apache.iceberg.spark.SparkCatalog.

    spark.sql.catalog.<catalog-name>.catalog-impl

    The class name of the catalog.

    Set the value to org.apache.iceberg.aliyun.dlf.hive.DlfCatalog.

  • EMR V3.39.X and EMR V5.5.X

    Note

    The default name of the catalog is dlf.

    Parameter

    Description

    Remarks

    spark.sql.extensions

    The SQL extension module of Spark.

    Set the value to org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions.

    Note

    This parameter is introduced in Apache Iceberg 0.11.0. Only Apache Spark 3.x supports this parameter.

    spark.sql.catalog.<catalog-name>

    The name of the catalog.

    Set the value to org.apache.iceberg.spark.SparkCatalog.

    spark.sql.catalog.<catalog-name>.catalog-impl

    The class name of the catalog.

    Set the value to org.apache.iceberg.aliyun.dlf.hive.DlfCatalog.

  • EMR V3.38.X, EMR V5.3.X, and EMR V5.4.X

    Note

    The default name of the catalog is dlf_catalog.

    Parameter

    Description

    Remarks

    spark.sql.extensions

    The SQL extension module of Spark.

    Set the value to org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions.

    Note

    This parameter is introduced in Apache Iceberg 0.11.0. Only Apache Spark 3.x supports this parameter.

    spark.sql.catalog.<catalog-name>

    The name of the catalog.

    Set the value to org.apache.iceberg.spark.SparkCatalog.

    spark.sql.catalog.<catalog-name>.catalog-impl

    The class name of the catalog.

    Set the value to org.apache.iceberg.aliyun.dlf.DlfCatalog.

    spark.sql.catalog.<catalog-name>.io-impl

    The name of the class that is written to the catalog during the I/O operation.

    Set the value to org.apache.iceberg.hadoop.HadoopFileIO.

    spark.sql.catalog.<catalog-name>.oss.endpoint

    The endpoint of your OSS bucket.

    For more information, see Regions and endpoints.

    We recommend that you set this parameter to the virtual private cloud (VPC) endpoint of the OSS bucket. For example, if you select the China (Hangzhou) region, set this parameter to oss-cn-hangzhou-internal.aliyuncs.com.

    Note

    If you want to access OSS across VPCs, set this parameter to the public endpoint of the OSS bucket.

    spark.sql.catalog.<catalog-name>.warehouse

    The OSS path in which table data is stored.

    None.

    spark.sql.catalog.<catalog-name>.access.key.id

    The AccessKey ID of your Alibaba Cloud account.

    For more information about how to obtain the AccessKey ID of an Alibaba Cloud account, see Obtain an AccessKey pair.

    spark.sql.catalog.<catalog-name>.access.key.secret

    The AccessKey secret of your Alibaba Cloud account.

    For more information about how to obtain the AccessKey secret of an Alibaba Cloud account, see Obtain an AccessKey pair.

    spark.sql.catalog.<catalog-name>.dlf.catalog-id

    The ID of your Alibaba Cloud account.

    To obtain the ID of your Alibaba Cloud account, go to the Security Settings page. Obtain the ID of your Alibaba Cloud account

    spark.sql.catalog.<catalog-name>.dlf.endpoint

    The endpoint of DLF.

    We recommend that you set this parameter to the VPC endpoint of DLF. For example, if you select the China (Hangzhou) region, set this parameter to dlf-vpc.cn-hangzhou.aliyuncs.com.

    Note

    You can set this parameter to the public endpoint of DLF. If you select the China (Hangzhou) region, set this parameter to dlf.cn-hangzhou.aliyuncs.com.

    spark.sql.catalog.<catalog-name>.dlf.region-id

    The ID of the region in which DLF is activated.

    Make sure that the region you specified in this parameter matches the endpoint you specified in the spark.sql.catalog.<catalog-name>.dlf.endpoint parameter.

Hive

You can configure the parameters described in the following tables based on the version of your cluster.

  • EMR V3.39.0 or a later minor version, and EMR V5.5.0 or later

    Note

    The default name of the catalog is dlf.

    Parameter

    Description

    Remarks

    iceberg.catalog.<catalog-name>.catalog-impl

    The class name of the catalog.

    Set the value to org.apache.iceberg.aliyun.dlf.hive.DlfCatalog.

  • EMR V3.38.X, EMR V5.3.X, and EMR V5.4.X

    Note

    The default name of the catalog is dlf_catalog.

    Parameter

    Description

    Remarks

    iceberg.catalog

    The name of the catalog.

    Set the value to a custom name.

    iceberg.catalog.<catalog-name>.type

    The type of the catalog.

    Set the value to custom.

    iceberg.catalog.<catalog-name>.catalog-impl

    The class name of the catalog.

    Set the value to org.apache.iceberg.aliyun.dlf.DlfCatalog.

    iceberg.catalog.<catalog-name>.io-impl

    The name of the class that is written to the catalog during the I/O operation.

    Set the value to org.apache.iceberg.hadoop.HadoopFileIO.

    iceberg.catalog.<catalog-name>.warehouse

    The warehouse path in which table data is stored.

    Table data can be stored in Hadoop Distributed File System (HDFS) or OSS.

    iceberg.catalog.<catalog-name>.access.key.id

    The AccessKey ID of your Alibaba Cloud account.

    For more information about how to obtain the AccessKey ID of an Alibaba Cloud account, see Obtain an AccessKey pair.

    iceberg.catalog.<catalog-name>.access.key.secret

    The AccessKey secret of your Alibaba Cloud account.

    For more information about how to obtain the AccessKey secret of an Alibaba Cloud account, see Obtain an AccessKey pair.

    iceberg.catalog.<catalog-name>.dlf.catalog-id

    The ID of your Alibaba Cloud account.

    To obtain the ID of your Alibaba Cloud account, go to the Security Settings page. Obtain the ID of your Alibaba Cloud account

    iceberg.catalog.<catalog-name>.dlf.endpoint

    The endpoint of DLF.

    We recommend that you set this parameter to the VPC endpoint of DLF. For example, if you select the China (Hangzhou) region, set this parameter to dlf-vpc.cn-hangzhou.aliyuncs.com.

    Note

    You can set this parameter to the public endpoint of DLF. If you select the China (Hangzhou) region, set this parameter to dlf.cn-hangzhou.aliyuncs.com.

    iceberg.catalog.<catalog-name>.dlf.region-id

    The ID of the region in which DLF is activated.

    Make sure that the region you specified in this parameter matches the endpoint you specified in the iceberg.catalog.<catalog-name>.dlf.endpoint parameter.