All Products
Search
Document Center

E-MapReduce:Configure metadata management for a Spark cluster

Last Updated:May 17, 2023

You can configure a Spark cluster on the EMR on ACK page of the E-MapReduce (EMR) console to manage the metadata of the cluster by using Data Lake Formation (DLF) or a self-managed Hive metastore. This topic describes how to configure metadata management for a Spark cluster on the EMR on ACK page of the EMR console.

Background information

DLF offers high availability and easy maintenance. This service is suitable for centralized metadata management in the following scenarios:

  • All your EMR clusters are deployed in the production environment. If you use DLF, you do not need to maintain independent metadatabases.

  • Multiple big data compute engines are used, such as MaxCompute, Hologres, and Machine Learning Platform for AI.

  • Multiple EMR clusters are created.

Prerequisites

  • A Spark cluster is created on the EMR on ACK page of the EMR console. For more information, see Step 1: Create a cluster.

  • If you use DLF to manage metadata: DLF is activated. For more information, see Quick start.

  • If you use a self-managed Hive metastore to manage metadata: A self-managed Hive metastore is created and can be accessed by the Container Service for Kubernetes (ACK) clusters that you created.

Method 1: (Recommended) Manage metadata by using DLF

  1. Go to the Cluster Details tab.

    1. Log on to the EMR on ACK console.

    2. On the EMR on ACK page, find the Spark cluster that you want to manage and click its name.

  2. In the Cluster Details tab, click Enable next to Data Lake Formation (DLF).

  3. In the Enable DLF message, click OK.

    After the configurations are complete, the data of jobs submitted to the Spark cluster is automatically imported to DLF.

Method 2: Manage metadata by using a self-managed Hive metastore

  1. Go to the Configure tab.

    1. Log on to the EMR on ACK console.

    2. On the EMR on ACK page, find the Spark cluster that you want to manage and click Configure in the Actions column.

  2. On the Configure tab, click the spark-defaults.conf tab.

  3. Add a custom configuration item.

    1. On the spark-defaults.conf tab, click Add Configuration Item.

    2. In the dialog box that appears, set the Key parameter to spark.hadoop.hive.metastore.uris and the Value parameter to thrift://<IP address of the self-managed Hive metastore>:9083.

      This configuration item specifies the uniform resource identifier (URI) that is used to access the Hive metastore based on the Thrift protocol. Modify the Value parameter based on your business requirements.

    3. Click OK.

    4. In the dialog box that appears, enter a reason in the Execution Reason field, and then click Save.

  4. Deploy client configurations.

    1. In the lower part of the Configure tab, click Deploy Client Configuration.

    2. In the dialog box that appears, enter a reason in the Execution Reason field and click OK.

    3. In the Confirm message, click OK.

    After the configurations are complete, the data of jobs submitted to the Spark cluster is automatically imported to the self-managed Hive metastore.