This topic was translated by AI and is currently in queue for revision by our editors. Alibaba Cloud does not guarantee the accuracy of AI-translated content. Request expedited revision

Metadata collection

Updated at: 2025-04-18 18:08

DataWorks Data Map provides the Metadata Collection feature, which allows you to centrally manage metadata from different systems. You can view the metadata collected from various data sources in Data Map. This topic describes how to create a crawler to collect metadata from various data sources to DataWorks.

Prerequisites

A data source is added to a workspace. For more information about how to add a data source, see Computing resource management.

Overview

After you add a data source to a workspace, DataWorks can collect metadata of the data source. After you enable the metadata collection feature in Data Map, DataWorks collects all existing metadata at once, collects incremental metadata daily, and then aggregates the full and incremental metadata to Data Map. You can then perform various operations on the metadata in Data Map. For example, you can check the data overview, manage tables by category and group, and view data lineages.

Note
  • If the default collection plan does not meet your requirements, you can modify the collection plan of a crawler. For more information, see Manage metadata crawlers.

  • After you associate a MaxCompute data source or an E-MapReduce (EMR) data source that uses Data Lake Formation (DLF) for metadata storage with DataStudio, the system automatically performs O&M operations on the crawler that is used to collect metadata from the MaxCompute or EMR data source. You do not need to manually manage the crawler.

  • If you have created a physical table in a data source but cannot find the table in DataStudio, you can manually collect metadata from the data source to resolve this issue.

Supported data source types and metadata collection methods

Data source type

Metadata collection method

Whether the crawler can be viewed in Data Map

Metadata update timeliness

Table/Field

Partition

Data lineage

Data source type

Metadata collection method

Whether the crawler can be viewed in Data Map

Metadata update timeliness

Table/Field

Partition

Data lineage

AnalyticDB for PostgreSQL

  • Associate a data source with DataStudio

  • Manual metadata collection

Yes

Depends on the custom collection plan

Not supported

real-time

AnalyticDB for MySQL

  • Associate a data source with DataStudio

  • Manual metadata collection

Yes

Depends on the custom collection plan

Not supported

real-time

Note

You need to submit a ticket to enable the data lineage feature for your AnalyticDB for MySQL instance.

AnalyticDB for Spark

  • Associate a computing resource with DataStudio

    Note

    You can associate AnalyticDB for Spark computing resources only with Data Studio (new version).

  • Manual metadata collection

    Note

    AnalyticDB for Spark and AnalyticDB for MySQL use the same entry point for metadata collection.

Yes

real-time

Not supported

real-time

CDH Hive

  • Register an open source cluster in SettingCenter

  • Automatic metadata collection

Yes

Depends on the custom collection plan

real-time

real-time

DLF

Automatic metadata collection

No

real-time

real-time

N/A

E-MapReduce (DLF)

Note

You need to enable EMR_HOOK for the cluster.

  • Register an open source cluster in SettingCenter

  • Automatic metadata collection

No

real-time

real-time

real-time

E-MapReduce (HMS / RDS)

Note

You need to enable EMR_HOOK for the cluster.

  • Register an open source cluster in SettingCenter

  • Automatic metadata collection

Yes

real-time

real-time

real-time

Hologres

  • Associate a data source with DataStudio

  • Manual metadata collection

Yes

Depends on the custom collection plan

Not supported

real-time

Lindorm

  • Associate a data source with DataStudio

  • Manual metadata collection

Yes

Depends on the custom collection plan

Not supported

real-time

MaxCompute

  • Associate a data source with DataStudio

  • Automatic metadata collection

No

Regular project: real-time

External project: T+1

Region in the Chinese mainland: real-time

Region outside China: T+1

T+1

Other data source types, such as MySQL, PostgreSQL, SQL Server, Oracle, Tablestore, StarRocks, and ClickHouse

  • Add a data source in SettingCenter

  • Manual metadata collection

Yes

Depends on the custom collection plan

Not supported

Not supported

Limits

  • You can collect only the metadata of data sources that you configured in the workspaces to which the current logon account belongs. If you want to collect metadata of data sources in other workspaces, you can contact the workspace administrator to add you as a workspace member. For more information, see Add workspace members.

  • If you want to collect metadata of a data source for which whitelist-based access control is enabled, you must add the CIDR blocks or IP addresses of DataWorks in the region where the related workspace resides to the IP address whitelist of the data source. For more information, see Whitelist configuration for metadata collection from data sources with whitelist-based access control.

  • We recommend that you do not collect metadata of a data source that resides in a different region from your workspace. If you want to collect metadata across regions, configure a public network address when you create a data source. For more information, see Create and manage data sources.

  • You cannot use a MySQL metadata crawler to collect the metadata of an OceanBase data source.

Go to the metadata collection page

  1. Go to Data Map.

  2. In the left navigation bar, click Metadata Collection.

    You can manage metadata crawlers for configured data sources in Data Source View. If no data source is available, you can click Create Data Source to go to the data source configuration page and create a data source.

View metadata crawlers

  • Overall statistics

    On the Metadata Collection page, you can view the overview of metadata collection in Data Source View. The overview mainly displays the number of data sources for which crawlers are created.整体统计

  • Details list

    You can also click the Manage button in the upper-right corner of a data source type to go to the details page. On this page, you can view the Status, Collection Plan, Last Run Time, Last Run Duration, Average Run Duration, and the number of tables updated and added during the last run of the crawler for the specified workspace.明细列表

Manage metadata crawlers

Click the Manage button in the upper-right corner of a data source. The Collected List tab appears by default. You can perform the following operations on existing crawlers.

Run a metadata crawler

You can manually run a metadata crawler. You can find the target data source on the Collected List tab and click Run in the Actions column to collect metadata once based on your business requirements.

Modify the collection plan of a metadata crawler

Go to the Collected List tab and click Edit in the Actions column of the target data source crawler to modify the collection plan of the crawler. You can select Manual Crawling or Periodic Crawling.

  • Manual Crawling: After you configure a metadata crawler for the target data source, you need to manually trigger the crawler to collect and update metadata as needed.

  • Periodic Crawling: After you configure a metadata crawler for the target data source, you do not need to manually trigger the crawler. The system periodically collects and updates metadata based on the configured collection plan.

Delete a metadata crawler

You can find the target data source on the Collected List tab and click Remove in the Actions column to delete the metadata crawler for the current data source. After the crawler is deleted, the data source is moved to the Not Collected List tab, and metadata is no longer collected from the data source.

Create a metadata crawler

After you add a data source or register a cluster, you can enable metadata collection in Data Map and view the metadata collection status of the target data source on the Collected List tab.

After you delete a metadata crawler, you can create a metadata crawler again on the Not Collected List tab if you want to restart metadata collection. The following procedure describes how to create a metadata crawler:

  1. Click the Not Collected List tab at the top of the list.

  2. Find the target data source, click Operation in the Metadata Acquisition column, and configure the parameters in the Configure Acquisition Plan dialog box that appears.

    Note

    Parameters that you need to configure in the Configure Collection Plan dialog box vary based on the data source type.

    配置采集计划

    Parameter

    Description

    Parameter

    Description

    Resource Group Name

    Select the resource group that is connected to the data source whose metadata you want to collect. Data Map allows you to select one of the following resource groups based on your requirements:

    • The default resource group default.

    • Your exclusive resource group for scheduling.

    • Your exclusive resource group for Data Integration.

    • Your serverless resource group.

    Connectivity Test

    After you select a resource group, you can click Test Connectivity to test the connectivity between the resource group and the data source. If Connectivity Test Failed is displayed:

    Collection Plan

    The options include Manual Crawling, Monthly, Weekly, Daily, and Hourly. The collection plan that is generated varies based on the collection cycle. The system collects metadata from the data source based on the collection cycle that you specify.

    • Manual Crawling: You can manually trigger the crawler to collect metadata of the data source to Data Map and update the collected metadata based on your business requirements.

    • Monthly: The system automatically collects metadata of the data source once at a specified point in time on several specific days of each month.

      Important

      Specific months do not have the 29th, 30th, or 31st day. We recommend that you do not select the last few days of a month.

    • Weekly: The system automatically collects metadata of the data source once at a specified point in time on several specific days of each week.

      If you do not specify a Time, the system collects metadata at 00:00:00 on the specified days of each week by default.

    • Daily: The system automatically collects metadata of the data source once at a specified point in time of each day.

    • Hourly: The system automatically collects metadata of the data source once at the Nth minute of each hour.

  3. After you confirm that the configuration is correct, click OK.

    The system collects metadata based on the configured collection plan. If you select Manual Crawling, you can go to the Collected List tab, find the target data source, and click Run in the Actions column to manually run the collection task based on your business requirements.

What to do next

After the metadata is collected, you can perform various operations on the metadata in Data Map. For example, you can check the overview of data, manage tables by category and group, and view data lineage. For more information, see Data overview, Search for tables, and Business view management: data albums.

  • On this page (1)
  • Prerequisites
  • Overview
  • Supported data source types and metadata collection methods
  • Limits
  • Go to the metadata collection page
  • View metadata crawlers
  • Manage metadata crawlers
  • Run a metadata crawler
  • Modify the collection plan of a metadata crawler
  • Delete a metadata crawler
  • Create a metadata crawler
  • What to do next
Feedback
phone Contact Us

Chat now with Alibaba Cloud Customer Service to assist you in finding the right products and services to meet your needs.

alicare alicarealicarealicare