All Products
Search
Document Center

Data Lake Formation:Data catalog

Last Updated:Jul 30, 2024

This topic describes the basic capabilities of a data catalog.

Definition

A data catalog is the top-level entity of Data Lake Formation (DLF) metadata and can contain multiple databases.

Applicable scenarios

Data catalogs are mainly applicable to scenarios where metadata is isolated. For example, multiple E-MapReduce (EMR) clusters are bound to different data catalogs, and metadata in the EMR clusters is isolated from each other.

Basic operations

Create a data catalog

  1. Log on to the DLF console.

  2. In the left-side navigation pane, choose Metadata > Metadata.

  3. On the page that appears, click the Catalog List tab.

  4. On the Catalog List tab, click New Catalog.

  5. On the New Catalog page, set the following parameters. Then, click OK.

    • Catalog ID: the unique ID of the data catalog. This parameter is required. The value of this parameter must be unique.

    • Description: the description of the data catalog. This parameter is optional.

    • Location: the default storage path of database data. This parameter is optional. Only Object Storage Service (OSS) paths are supported.

  6. 1659087281709-91e5da89-bc9a-423f-940a-a9a3e1ca873d

Query a data catalog

  1. Log on to the DLF console.

  2. In the left-side navigation pane, choose Metadata > Metadata.

  3. On the page that appears, click the Catalog List tab. On the Catalog List tab, query the desired data catalog.

1659087480477-9d2dc6ba-1406-4765-96cb-92c37faddfe5

Modify a data catalog

  1. Log on to the DLF console.

  2. In the left-side navigation pane, choose Metadata > Metadata.

  3. On the page that appears, click the Catalog List tab.

  4. On the Catalog List tab, find the ID of the data catalog that you want to modify and click Edit in the Actions column.

  5. On the Modify Catalog page, modify the following parameters. Then, click OK.

    • Description: the description of the data catalog. This parameter is optional.

    • Location: the default storage path of database data. This parameter is optional. Only OSS paths are supported.

  6. 1659087611120-e524d27f-76f7-4fcd-bd15-ce98071c1852

Delete a data catalog

Warning

After a data catalog is deleted, data in the data catalog cannot be restored. Proceed with caution.

  1. Log on to the DLF console.

  2. In the left-side navigation pane, choose Metadata > Metadata.

  3. On the page that appears, click the Catalog List tab.

  4. On the Catalog List tab, find the ID of the data catalog that you want to delete and click Delete in the Actions column.

  5. In the message that appears, click Delete to delete the data catalog.

Operations related to compute engines

Change the ID of a DLF data catalog that is bound to an EMR cluster

Important

After you change the ID of a DLF data catalog that is bound to an EMR cluster, the EMR cluster is bound to the new catalog ID. In this case, you cannot perform operations on databases or tables in the original data catalog that is bound to the EMR cluster, or all running jobs that are related to read and write operations on databases or tables in the original data catalog that is bound to the EMR cluster fail. Before you change the data catalog ID, consider the impacts of the change.

  • Modify Hive engine configurations

    • Log on to the EMR console. In the left-side navigation pane, click EMR on ECS. On the EMR on ECS page, click the ID of the desired EMR cluster. On the page that appears, click the Services tab. On the Services tab, find Hive and click Configure. On the page that appears, click the hive-site.xml tab. On the hive-site.xml tab, click Add Configuration Item. In the Add Configuration Item dialog box, set the following parameters. Then, click OK.

    • Key: Enter dlf.catalog.id.

      Value: Enter the new data catalog ID.

  • 1659088992698-d4b5def7-326b-4439-b233-8c675b173fbe

      • Perform the following steps to complete the modification of Hive engine configurations:

        • Save the configurations.

        • Deploy client configurations.

    • 1659089286630-f4bb07e8-c240-479e-a2a8-be2f82aa43de

      • On the Services tab, move the pointer over More and select Restart to restart the Hive service.

    • 1659089384607-e61b7722-2213-436c-9d1d-0715aeb4025e

      • After the restart is successful, the state of the Hive service is Healthy. This indicates that the data catalog ID is changed.

Note
  • Refer to the preceding modification steps of Hive engine configurations to modify Spark engine configurations. After the modification is successful, restart the Spark service.

    • If the EMR version is EMR 5.6.0, an EMR 5.x version earlier than 5.6.0, EMR 3.40.0, or an EMR 3.x version earlier than 3.40.0, you do not need to modify Spark engine configurations, and you can use the modified Hive engine configurations. In this case, you need to only modify Hive engine configurations.

  • Refer to the preceding modification steps of Hive engine configurations to modify Presto engine configurations. After the modification is successful, restart the Presto service. Note that you must click the hive.properties tab to modify Presto engine configurations. Presto engine configurations can be modified only in EMR 5.8.0 and EMR 5.x versions later than 5.8.0 or in EMR 3.42.0 and EMR 3.x versions later than 3.42.0.

  • You do not need to modify Impala engine configurations, and you can use the modified Hive engine configurations. In this case, you need to only modify Hive engine configurations.