All Products
Search
Document Center

DataWorks:Metadata collection

Last Updated:Dec 17, 2024

DataWorks Data Map provides the metadata collection feature that allows you to collect metadata from various data sources to Data Map, manage the collected metadata in a centralized manner, and view the collected metadata by data source in Data Map. This topic describes how to create a crawler to collect metadata from each data source to DataWorks.

Prerequisites

A data source is added to a workspace. For information about how to add a data source, see the topics in the Data source management directory.

Overview

After you add a data source to a workspace, DataWorks can collect metadata of the data source. After you enable the metadata collection feature in Data Map, DataWorks collects full existing metadata at a time, collects incremental metadata every day, and then aggregates the full and incremental metadata to Data Map. Then, you can perform various operations on the metadata in Data Map. For example, you can check the overview of data, manage tables by category and group, and view data lineages.

Note
  • If the default collection plan does not meet your business requirements, you can modify the collection plan. For more information, see Manage metadata crawlers.

  • After you associate a MaxCompute data source or an E-MapReduce (EMR) data source that uses Data Lake Formation (DLF) for metadata storage with DataStudio, the system automatically performs O&M operations on the crawler that is used to collect metadata from the MaxCompute or EMR data source. You do not need to manually manage the crawler.

Supported data source types and metadata collection methods

Data source type

Metadata collection method

Whether the crawler is available in Data Map

Metadata update timeliness

Table/Field

Partition

Data lineage

MaxCompute

  • Associate a data source with DataStudio

  • Automatic metadata collection

No

Regular project: real-time

External project: T+1

Region in the Chinese mainland: real-time

Region outside China: T+1

T+1

EMR (Metadata storage method: DLF)

Note

Make sure that EMR_HOOK is enabled for a cluster.

  • Register an open source cluster in SettingCenter

  • Automatic metadata collection

No

Real-time

Real-time

Real-time

EMR (Metadata storage method: HMS or RDS)

Note

Make sure that EMR_HOOK is enabled for a cluster.

  • Register an open source cluster in SettingCenter

  • Automatic metadata collection

Yes

Real-time

Real-time

Real-time

Hologres

  • Associate a data source with DataStudio

  • Manual metadata collection

Yes

Depends on the custom collection plan

Not supported

Real-time

AnalyticDB for PostgreSQL

  • Associate a data source with DataStudio

  • Manual metadata collection

Yes

Depends on the custom collection plan

Not supported

Real-time

AnalyticDB for MySQL

  • Associate a data source with DataStudio

  • Manual metadata collection

Yes

Depends on the custom collection plan

Not supported

Real-time

Note

You must submit a ticket to enable the data lineage feature for an AnalyticDB for MySQL instance.

AnalyticDB for Spark

  • Associate a computing resource with DataStudio

    Note

    You can associate AnalyticDB for Spark computing resources only with Data Studio (new version).

  • Manual metadata collection

    Note

    AnalyticDB for Spark and AnalyticDB for MySQL use the same entry point for metadata collection.

Yes

Real-time

Not supported

Real-time

CDH Hive

  • Register an open source cluster in SettingCenter

  • Automatic metadata collection

Yes

Depends on the custom collection plan

Real-time

Real-time

DLF

Automatic metadata collection

No

Real-time

Real-time

Not applicable

Other data source types, such as MySQL, PostgreSQL, SQL Server, Oracle, Tablestore, StarRocks, and ClickHouse

  • Add a data source in SettingCenter

  • Manual metadata collection

Yes

Depends on the custom collection plan

Not supported

Not supported

Limits

  • You can collect only the metadata of data sources that you configured in the workspaces to which the current logon account belongs. If you want to collect metadata of data sources in another workspace, you can contact the workspace administrator to add your account to the workspace as a member. For more information, see Add workspace members and assign roles to them.

  • If you want to collect metadata of a data source for which whitelist-based access control is enabled, you must add the CIDR blocks or IP addresses of DataWorks in the region where the related workspace resides to the IP address whitelist of the data source. For more information, see Configure IP address whitelists for metadata collection.

  • We recommend that you do not collect metadata of a data source that resides in a different region from your workspace. If you want to collect metadata across regions, configure a public network address when you create a data source. For more information, see Add and manage data sources.

  • You cannot use a MySQL metadata crawler to collect the metadata of an OceanBase data source.

Procedure

  1. Go to the DataMap page in the DataWorks console.

  2. In the left-side navigation pane of the DataMap page, click Collect Metadata.

    On the page that appears, you can manage the metadata crawlers of the associated data source types from the data source perspective. If no data source is available, you can click Create Data Source to go to the Data Sources page and create a data source in SettingCenter.

View metadata crawlers

  • Overall statistics on metadata collection

    On the Collect Metadata page, you can view the overall information about metadata collection from the data source perspective. You can view the number of data sources for which a crawler is created.整体统计

  • Metadata collection details

    To view the details of metadata collection of a data source type, you can click Manage in the upper-right corner of the data source type. On the Data Sources for Which Crawler Is Created tab, select a workspace and view the following information about a crawler in the workspace: Status, Execution Plan, Last Run At, Last Running Time/s, Average Running Time/s, and Tables Found During Last Run.明细列表

Manage metadata crawlers

Click Manage in the upper-right corner of the desired data source type. The Data Sources for Which Crawler Is Created tab appears. On this tab, you can view the list of data sources of the selected data source type or the list of data sources for which a crawler is created in the selected workspace. You can perform the following operations on an existing crawler.

Run a metadata crawler

You can manually run a metadata crawler. To run a metadata crawler, find the desired data source on the Data Sources for Which Crawler Is Created tab and click Run in the Actions column of the data source to collect the metadata of the data source once.

Modify the collection plan of a metadata crawler

Go to the Data Sources for Which Crawler Is Created tab, find the desired data source, and then click Edit in the Actions column of the data source to modify the collection plan of the metadata crawler. The collection plans include manual metadata collection and periodic metadata collection.

  • Manual metadata collection: After you configure a metadata crawler for the desired data source and configure this collection plan for the crawler, you must manually trigger the crawler to collect metadata of the data source to Data Map and update the collected metadata based on your business requirements.

  • Periodic metadata collection: After you configure a metadata crawler for the desired data source and configure this collection plan for the crawler, you do not need to manually trigger the crawler to run. The system periodically collects metadata of the data source to Data Map and updates the collected metadata based on the collection plan.

Delete a metadata crawler

Go to the Data Sources for Which Crawler Is Created tab, find the desired data source, and then click Remove in the Actions column of the data source to delete the metadata crawler of the data source. After you delete the metadata crawler of the data source, the data source is moved to the Data Sources for Which No Crawler Is Created tab and the metadata of the data source is no longer collected.

Create a metadata crawler

After you add a data source or register a cluster to a workspace, you can go to Data Map to enable the metadata collection feature. You can view information about metadata collection for the data source or cluster on the Data Sources for Which Crawler Is Created tab.

If you want to recollect the metadata of a data source after you delete the metadata crawler of the data source, you can create a metadata crawler for the data source on the Data Sources for Which No Crawler Is Created tab.

  1. Click Data Sources for Which No Crawler Is Created.

  2. Find the desired data source and click Create Crawler in the Actions column of the data source. In the Configure Collection Plan dialog box, configure the parameters.

    Note

    Parameters that you need to configure in the Configure Collection Plan dialog box vary based on the data source type.

    配置采集计划

    Parameter or section

    Description

    Resource Group Name

    Select the resource group that is connected to the data source whose metadata you want to collect. You can select one of the following resource groups in Data Map based on your business requirements:

    • Default resource group named default

    • Your exclusive resource group for scheduling

    • Your exclusive resource group for Data Integration

    • Your serverless resource group

    Test Network Connectivity

    After you select a resource group, if you want to re-test the network connectivity between the resource group and the data source whose metadata you want to collect, you can click Test Network Connectivity. If the message The connectivity test failed. is displayed, you can refer to the following instructions to locate the cause:

    Collection Plan

    The metadata collection plan. Valid values: Manual Crawling, Monthly, Weekly, Daily, and Hourly. The collection plan that is generated varies based on the collection cycle. The system collects metadata from the data source based on the collection cycle that you specify.

    • Manual Crawling: You can manually trigger the crawler to collect metadata of the data source to Data Map and update the collected metadata based on your business requirements.

    • Monthly: The system automatically collects metadata of the data source once at a specified point in time on several specific days of each month.

      Important

      Specific months do not have the 29th, 30th, or 31st day. We recommend that you do not select the last few days of a month.

    • Weekly: The system automatically collects metadata of the data source once at a specified point in time on several specific days of each week.

      If you do not configure the Time parameter, the system automatically collects metadata of the data source once at 00:00:00 on the specific days of each week.

    • Daily: The system automatically collects metadata of the data source once at a specified point in time of each day.

    • Hourly: The system automatically collects metadata of the data source once on the Nth minute of each hour.

  3. Verify that the configurations of the crawler are correct and click Confirmation.

    The system collects metadata of the data source based on the configured collection plan. If you select Manual Crawling, you can find the desired data source on the Data Sources for Which Crawler Is Created tab and click Run in the Actions column of the data source to manually collect the metadata of the data source based on your business requirements.

What to do next

After the metadata is collected, you can perform various operations on the metadata in Data Map. For example, you can check the overview of data, manage tables by category and group, and view data lineage. For more information, see View resource information, Search for tables, and Table management from the business perspective: Data albums.