Create a crawler to collect the metadata of each data source to DataWorks - DataWorks

DataWorks Data Map provides the metadata collection feature that allows you to collect metadata from various data sources to Data Map, manage the collected metadata in a centralized manner, and view the collected metadata by data source in Data Map. This topic describes how to create a crawler to collect metadata from each data source to DataWorks.

Prerequisites

A data source is added to a workspace. For information about how to add a data source, see the topics in the Data source management directory.

Overview

After you add a data source to a workspace, DataWorks can collect metadata of the data source. After you enable the metadata collection feature in Data Map, DataWorks collects full existing metadata at a time, collects incremental metadata every day, and then aggregates the full and incremental metadata to Data Map. Then, you can perform various operations on the metadata in Data Map. For example, you can check the overview of data, manage tables by category and group, and view data lineages.

Note

If the default collection plan does not meet your business requirements, you can modify the collection plan. For more information, see Manage metadata crawlers.
After you associate a MaxCompute data source or an E-MapReduce (EMR) data source that uses Data Lake Formation (DLF) for metadata storage with DataStudio, the system automatically performs O&M operations on the crawler that is used to collect metadata from the MaxCompute or EMR data source. You do not need to manually manage the crawler.

Supported data source types and metadata collection methods

Data source type	Metadata collection method	Whether the crawler is available in Data Map	Metadata update timeliness
Data source type	Metadata collection method	Whether the crawler is available in Data Map	Table/Field	Partition	Data lineage
MaxCompute	Associate a data source with DataStudio Automatic metadata collection	No	Regular project: real-time External project: T+1	Region in the Chinese mainland: real-time Region outside China: T+1	T+1
EMR (Metadata storage method: DLF) Note Make sure that EMR_HOOK is enabled for a cluster.	Register an open source cluster in SettingCenter Automatic metadata collection	No	Real-time	Real-time	Real-time
EMR (Metadata storage method: HMS or RDS) Note Make sure that EMR_HOOK is enabled for a cluster.	Register an open source cluster in SettingCenter Automatic metadata collection	Yes	Real-time	Real-time	Real-time
Hologres	Associate a data source with DataStudio Manual metadata collection	Yes	Depends on the custom collection plan	Not supported	Real-time
AnalyticDB for PostgreSQL	Associate a data source with DataStudio Manual metadata collection	Yes	Depends on the custom collection plan	Not supported	Real-time
AnalyticDB for MySQL	Associate a data source with DataStudio Manual metadata collection	Yes	Depends on the custom collection plan	Not supported	Real-time Note You must submit a ticket to enable the data lineage feature for an AnalyticDB for MySQL instance.
AnalyticDB for Spark	Associate a computing resource with DataStudio Note You can associate AnalyticDB for Spark computing resources only with Data Studio (new version). Manual metadata collection Note AnalyticDB for Spark and AnalyticDB for MySQL use the same entry point for metadata collection.	Yes	Real-time	Not supported	Real-time
CDH Hive	Register an open source cluster in SettingCenter Automatic metadata collection	Yes	Depends on the custom collection plan	Real-time	Real-time
DLF	Automatic metadata collection	No	Real-time	Real-time	Not applicable
Other data source types, such as MySQL, PostgreSQL, SQL Server, Oracle, Tablestore, StarRocks, and ClickHouse	Add a data source in SettingCenter Manual metadata collection	Yes	Depends on the custom collection plan	Not supported	Not supported

Limits

You can collect only the metadata of data sources that you configured in the workspaces to which the current logon account belongs. If you want to collect metadata of data sources in another workspace, you can contact the workspace administrator to add your account to the workspace as a member. For more information, see Add workspace members and assign roles to them.
If you want to collect metadata of a data source for which whitelist-based access control is enabled, you must add the CIDR blocks or IP addresses of DataWorks in the region where the related workspace resides to the IP address whitelist of the data source. For more information, see Configure IP address whitelists for metadata collection.
We recommend that you do not collect metadata of a data source that resides in a different region from your workspace. If you want to collect metadata across regions, configure a public network address when you create a data source. For more information, see Add and manage data sources.
You cannot use a MySQL metadata crawler to collect the metadata of an OceanBase data source.

Procedure

Go to the DataMap page in the DataWorks console.
In the left-side navigation pane of the DataMap page, click Collect Metadata.
On the page that appears, you can manage the metadata crawlers of the associated data source types from the data source perspective. If no data source is available, you can click Create Data Source to go to the Data Sources page and create a data source in SettingCenter.

View metadata crawlers

Overall statistics on metadata collection
On the Collect Metadata page, you can view the overall information about metadata collection from the data source perspective. You can view the number of data sources for which a crawler is created.
Metadata collection details
To view the details of metadata collection of a data source type, you can click Manage in the upper-right corner of the data source type. On the Data Sources for Which Crawler Is Created tab, select a workspace and view the following information about a crawler in the workspace: Status, Execution Plan, Last Run At, Last Running Time/s, Average Running Time/s, and Tables Found During Last Run.

Manage metadata crawlers

Click Manage in the upper-right corner of the desired data source type. The Data Sources for Which Crawler Is Created tab appears. On this tab, you can view the list of data sources of the selected data source type or the list of data sources for which a crawler is created in the selected workspace. You can perform the following operations on an existing crawler.

Run a metadata crawler

You can manually run a metadata crawler. To run a metadata crawler, find the desired data source on the Data Sources for Which Crawler Is Created tab and click Run in the Actions column of the data source to collect the metadata of the data source once.

Modify the collection plan of a metadata crawler

Go to the Data Sources for Which Crawler Is Created tab, find the desired data source, and then click Edit in the Actions column of the data source to modify the collection plan of the metadata crawler. The collection plans include manual metadata collection and periodic metadata collection.

Manual metadata collection: After you configure a metadata crawler for the desired data source and configure this collection plan for the crawler, you must manually trigger the crawler to collect metadata of the data source to Data Map and update the collected metadata based on your business requirements.
Periodic metadata collection: After you configure a metadata crawler for the desired data source and configure this collection plan for the crawler, you do not need to manually trigger the crawler to run. The system periodically collects metadata of the data source to Data Map and updates the collected metadata based on the collection plan.

Delete a metadata crawler

Go to the Data Sources for Which Crawler Is Created tab, find the desired data source, and then click Remove in the Actions column of the data source to delete the metadata crawler of the data source. After you delete the metadata crawler of the data source, the data source is moved to the Data Sources for Which No Crawler Is Created tab and the metadata of the data source is no longer collected.

Create a metadata crawler

After you add a data source or register a cluster to a workspace, you can go to Data Map to enable the metadata collection feature. You can view information about metadata collection for the data source or cluster on the Data Sources for Which Crawler Is Created tab.

If you want to recollect the metadata of a data source after you delete the metadata crawler of the data source, you can create a metadata crawler for the data source on the Data Sources for Which No Crawler Is Created tab.

Click Data Sources for Which No Crawler Is Created.

Find the desired data source and click Create Crawler in the Actions column of the data source. In the Configure Collection Plan dialog box, configure the parameters.

Note

Parameters that you need to configure in the Configure Collection Plan dialog box vary based on the data source type.

配置采集计划

Parameter or section	Description
Resource Group Name	Select the resource group that is connected to the data source whose metadata you want to collect. You can select one of the following resource groups in Data Map based on your business requirements: Default resource group named `default` Your exclusive resource group for scheduling Your exclusive resource group for Data Integration Your serverless resource group
Test Network Connectivity	After you select a resource group, if you want to re-test the network connectivity between the resource group and the data source whose metadata you want to collect, you can click Test Network Connectivity. If the message The connectivity test failed. is displayed, you can refer to the following instructions to locate the cause: Check whether whitelist-based access control is enabled for the data source. For information about how to configure IP address whitelists for metadata collection from a data source, see Configure IP address whitelists for metadata collection. If whitelist-based access control is not enabled for the data source, you can establish a network connection between the resource group and data source. For more information, see Network connectivity and operations on resource groups.
Collection Plan	The metadata collection plan. Valid values: Manual Crawling, Monthly, Weekly, Daily, and Hourly. The collection plan that is generated varies based on the collection cycle. The system collects metadata from the data source based on the collection cycle that you specify. Manual Crawling: You can manually trigger the crawler to collect metadata of the data source to Data Map and update the collected metadata based on your business requirements. Monthly: The system automatically collects metadata of the data source once at a specified point in time on several specific days of each month. Important Specific months do not have the 29th, 30th, or 31st day. We recommend that you do not select the last few days of a month. Weekly: The system automatically collects metadata of the data source once at a specified point in time on several specific days of each week. If you do not configure the Time parameter, the system automatically collects metadata of the data source once at 00:00:00 on the specific days of each week. Daily: The system automatically collects metadata of the data source once at a specified point in time of each day. Hourly: The system automatically collects metadata of the data source once on the `Nth` minute of each hour.

Verify that the configurations of the crawler are correct and click Confirmation.
The system collects metadata of the data source based on the configured collection plan. If you select Manual Crawling, you can find the desired data source on the Data Sources for Which Crawler Is Created tab and click Run in the Actions column of the data source to manually collect the metadata of the data source based on your business requirements.

What to do next

After the metadata is collected, you can perform various operations on the metadata in Data Map. For example, you can check the overview of data, manage tables by category and group, and view data lineage. For more information, see View resource information, Search for tables, and Table management from the business perspective: Data albums.