Metadata collection

DataWorks Data Map provides the Metadata Collection feature, which allows you to centrally manage metadata from different systems. You can view the metadata collected from various data sources in Data Map. This topic describes how to create a crawler to collect metadata from various data sources to DataWorks.

Prerequisites

A data source is added to a workspace. For more information about how to add a data source, see Computing resource management.

Overview

After you add a data source to a workspace, DataWorks can collect metadata of the data source. After you enable the metadata collection feature in Data Map, DataWorks collects all existing metadata at once, collects incremental metadata daily, and then aggregates the full and incremental metadata to Data Map. You can then perform various operations on the metadata in Data Map. For example, you can check the data overview, manage tables by category and group, and view data lineages.

Note

If the default collection plan does not meet your requirements, you can modify the collection plan of a crawler. For more information, see Manage metadata crawlers.
After you associate a MaxCompute data source or an E-MapReduce (EMR) data source that uses Data Lake Formation (DLF) for metadata storage with DataStudio, the system automatically performs O&M operations on the crawler that is used to collect metadata from the MaxCompute or EMR data source. You do not need to manually manage the crawler.
If you have created a physical table in a data source but cannot find the table in DataStudio, you can manually collect metadata from the data source to resolve this issue.

Supported data source types and metadata collection methods

Data source type	Metadata collection method	Whether the crawler can be viewed in Data Map	Metadata update timeliness
Data source type	Metadata collection method	Whether the crawler can be viewed in Data Map	Table/Field	Partition	Data lineage

Data source type	Metadata collection method	Whether the crawler can be viewed in Data Map	Metadata update timeliness
Data source type	Metadata collection method	Whether the crawler can be viewed in Data Map	Table/Field	Partition	Data lineage
AnalyticDB for PostgreSQL	Associate a data source with DataStudio Manual metadata collection	Yes	Depends on the custom collection plan	Not supported	real-time
AnalyticDB for MySQL	Associate a data source with DataStudio Manual metadata collection	Yes	Depends on the custom collection plan	Not supported	real-time Note You need to submit a ticket to enable the data lineage feature for your AnalyticDB for MySQL instance.
AnalyticDB for Spark	Associate a computing resource with DataStudio Note You can associate AnalyticDB for Spark computing resources only with Data Studio (new version). Manual metadata collection Note AnalyticDB for Spark and AnalyticDB for MySQL use the same entry point for metadata collection.	Yes	real-time	Not supported	real-time
CDH Hive	Register an open source cluster in SettingCenter Automatic metadata collection	Yes	Depends on the custom collection plan	real-time	real-time
DLF	Automatic metadata collection	No	real-time	real-time	N/A
E-MapReduce (DLF) Note You need to enable EMR_HOOK for the cluster.	Register an open source cluster in SettingCenter Automatic metadata collection	No	real-time	real-time	real-time
E-MapReduce (HMS / RDS) Note You need to enable EMR_HOOK for the cluster.	Register an open source cluster in SettingCenter Automatic metadata collection	Yes	real-time	real-time	real-time
Hologres	Associate a data source with DataStudio Manual metadata collection	Yes	Depends on the custom collection plan	Not supported	real-time
Lindorm	Associate a data source with DataStudio Manual metadata collection	Yes	Depends on the custom collection plan	Not supported	real-time
MaxCompute	Associate a data source with DataStudio Automatic metadata collection	No	Regular project: real-time External project: T+1	Region in the Chinese mainland: real-time Region outside China: T+1	T+1
Other data source types, such as MySQL, PostgreSQL, SQL Server, Oracle, Tablestore, StarRocks, and ClickHouse	Add a data source in SettingCenter Manual metadata collection	Yes	Depends on the custom collection plan	Not supported	Not supported

Limits

You can collect only the metadata of data sources that you configured in the workspaces to which the current logon account belongs. If you want to collect metadata of data sources in other workspaces, you can contact the workspace administrator to add you as a workspace member. For more information, see Add workspace members.
If you want to collect metadata of a data source for which whitelist-based access control is enabled, you must add the CIDR blocks or IP addresses of DataWorks in the region where the related workspace resides to the IP address whitelist of the data source. For more information, see Whitelist configuration for metadata collection from data sources with whitelist-based access control.
We recommend that you do not collect metadata of a data source that resides in a different region from your workspace. If you want to collect metadata across regions, configure a public network address when you create a data source. For more information, see Create and manage data sources.
You cannot use a MySQL metadata crawler to collect the metadata of an OceanBase data source.

Go to the metadata collection page

Go to Data Map.
In the left navigation bar, click Metadata Collection.
You can manage metadata crawlers for configured data sources in Data Source View. If no data source is available, you can click Create Data Source to go to the data source configuration page and create a data source.

View metadata crawlers

Overall statistics
On the Metadata Collection page, you can view the overview of metadata collection in Data Source View. The overview mainly displays the number of data sources for which crawlers are created.
Details list
You can also click the Manage button in the upper-right corner of a data source type to go to the details page. On this page, you can view the Status, Collection Plan, Last Run Time, Last Run Duration, Average Run Duration, and the number of tables updated and added during the last run of the crawler for the specified workspace.

Manage metadata crawlers

Click the Manage button in the upper-right corner of a data source. The Collected List tab appears by default. You can perform the following operations on existing crawlers.

Run a metadata crawler

You can manually run a metadata crawler. You can find the target data source on the Collected List tab and click Run in the Actions column to collect metadata once based on your business requirements.

Modify the collection plan of a metadata crawler

Go to the Collected List tab and click Edit in the Actions column of the target data source crawler to modify the collection plan of the crawler. You can select Manual Crawling or Periodic Crawling.

Manual Crawling: After you configure a metadata crawler for the target data source, you need to manually trigger the crawler to collect and update metadata as needed.
Periodic Crawling: After you configure a metadata crawler for the target data source, you do not need to manually trigger the crawler. The system periodically collects and updates metadata based on the configured collection plan.

Delete a metadata crawler

You can find the target data source on the Collected List tab and click Remove in the Actions column to delete the metadata crawler for the current data source. After the crawler is deleted, the data source is moved to the Not Collected List tab, and metadata is no longer collected from the data source.

Create a metadata crawler

After you add a data source or register a cluster, you can enable metadata collection in Data Map and view the metadata collection status of the target data source on the Collected List tab.

After you delete a metadata crawler, you can create a metadata crawler again on the Not Collected List tab if you want to restart metadata collection. The following procedure describes how to create a metadata crawler:

Click the Not Collected List tab at the top of the list.

Find the target data source, click Operation in the Metadata Acquisition column, and configure the parameters in the Configure Acquisition Plan dialog box that appears.

Note

Parameters that you need to configure in the Configure Collection Plan dialog box vary based on the data source type.

配置采集计划

Parameter	Description

Parameter	Description
Resource Group Name	Select the resource group that is connected to the data source whose metadata you want to collect. Data Map allows you to select one of the following resource groups based on your requirements: The default resource group `default`. Your exclusive resource group for scheduling. Your exclusive resource group for Data Integration. Your serverless resource group.
Connectivity Test	After you select a resource group, you can click Test Connectivity to test the connectivity between the resource group and the data source. If Connectivity Test Failed is displayed: Check whether whitelist-based access control is enabled for the data source. If you want to collect metadata of a data source for which whitelist-based access control is enabled, see Whitelist configuration for metadata collection from data sources with whitelist-based access control to configure the whitelist. If whitelist-based access control is not enabled for the data source, see Resource group operations and network connectivity to establish a network connection to the data source.
Collection Plan	The options include Manual Crawling, Monthly, Weekly, Daily, and Hourly. The collection plan that is generated varies based on the collection cycle. The system collects metadata from the data source based on the collection cycle that you specify. Manual Crawling: You can manually trigger the crawler to collect metadata of the data source to Data Map and update the collected metadata based on your business requirements. Monthly: The system automatically collects metadata of the data source once at a specified point in time on several specific days of each month. Important Specific months do not have the 29th, 30th, or 31st day. We recommend that you do not select the last few days of a month. Weekly: The system automatically collects metadata of the data source once at a specified point in time on several specific days of each week. If you do not specify a Time, the system collects metadata at 00:00:00 on the specified days of each week by default. Daily: The system automatically collects metadata of the data source once at a specified point in time of each day. Hourly: The system automatically collects metadata of the data source once at the `Nth minute` of each hour.

After you confirm that the configuration is correct, click OK.
The system collects metadata based on the configured collection plan. If you select Manual Crawling, you can go to the Collected List tab, find the target data source, and click Run in the Actions column to manually run the collection task based on your business requirements.

What to do next

After the metadata is collected, you can perform various operations on the metadata in Data Map. For example, you can check the overview of data, manage tables by category and group, and view data lineage. For more information, see Data overview, Search for tables, and Business view management: data albums.

Prerequisites

Overview

Supported data source types and metadata collection methods

Limits

Go to the metadata collection page

View metadata crawlers

Manage metadata crawlers

Run a metadata crawler

Modify the collection plan of a metadata crawler

Delete a metadata crawler

Create a metadata crawler

What to do next

Sales Support

Technical Support

Connect & Report Abuse

About Alibaba Cloud

Our Global Network

Quick Start

Global Offices

Olympic Games Paris 2024 New

Stade Roland Garros – Glitz from the Past New

Place de la Concorde – “Breaking” the Barriers New

Vaires-sur-Marne Nautical Stadium – Sports with Sustainability New

International Broadcast Center – Images, Sounds, and Data that Captivate Billions New

Customer Success Stories New

Trust Center

Security & Compliance Center

Cloud Compliance Resources

Security Compliance FAQs

Product & Feature Update New

Cloud Forward

Press Room

Alibaba Cloud e-Magazine New

Alibaba Cloud in Analyst Research

Notice

Go Global Service New

Go Global Alliance with Alibaba Cloud

Asia Accelerator Hot

Information Compliance

China Gateway - MLPS 2.0 Compliance New

China Gateway - Networking

China Gateway - Global Application Acceleration New

China Gateway - Security

China Gateway - Data Security New

ICP Support Hot

China Gateway - Omnichannel Data Mid-End New

China Gateway - Organizational Data Mid-End New

China Gateway - Business Mid-End New

China Gateway - AI Service for Conversational Chatbots New

China Gateway - Online Education

China Gateway - Domain Registration

Work at Alibaba Cloud

Experienced Professionals

Students and Graduates

Free Trial

Pricing

Promo Center

Price Reduction

Pay Less and Deploy More

FinOps

Elastic Compute Service (ECS)

Simple Application Server (SAS)

Elastic GPU Service

Elastic Desktop Service (EDS)

Object Storage Service (OSS)

Cloud Enterprise Network (CEN)

Web Application Firewall (WAF)

Domain Names

Lingma

Container Compute Service (ACS)

Secure Access Service Edge (SASE)

Intelligent Media Services(IMS)

Edge Security Acceleration (ESA)(Original DCDN)

Intelligent Media Management

DingTalk Enterprise

YiDA

Alibaba Cloud Model Studio

Apsara Prime - For Easy Cloud Product Selection

Alibaba Cloud ECS - Cater All Your Cloud Hosting Needs

1TB CDN—Get Free 1 TB Outbound Traffic Plan Now

Security—Under Attack? Get Free Security Support

Short Message Service - Free Testing is Available

Elastic Compute Service (ECS) Hot

CloudBox

Compute Nest

Dedicated Host Hot

ECS Bare Metal Instance

Elastic GPU Service Featured

Simple Application Server (SAS) Hot

Auto Scaling

Cloud Phone Beta

Elastic Desktop Service (EDS) Featured

Batch Compute

Elastic High Performance Computing (E-HPC)

Super Computing Cluster (SCC)