DataWorks Data Map provides the metadata collection feature that allows you to collect metadata from various data sources to Data Map, manage the collected metadata in a centralized manner, and view the collected metadata by data source in Data Map. This topic describes how to create a crawler to collect metadata from each data source to DataWorks.
Prerequisites
A data source is added to a workspace. For information about how to add a data source, see the topics in the Data source management directory.
Overview
After you add a data source to a workspace, DataWorks can collect metadata of the data source. After you enable the metadata collection feature in Data Map, DataWorks collects full existing metadata at a time, collects incremental metadata every day, and then aggregates the full and incremental metadata to Data Map. Then, you can perform various operations on the metadata in Data Map. For example, you can check the overview of data, manage tables by category and group, and view data lineages.
If the default collection plan does not meet your business requirements, you can modify the collection plan. For more information, see Manage metadata crawlers.
After you associate a MaxCompute data source or an E-MapReduce (EMR) data source that uses Data Lake Formation (DLF) for metadata storage with DataStudio, the system automatically performs O&M operations on the crawler that is used to collect metadata from the MaxCompute or EMR data source. You do not need to manually manage the crawler.
Supported data source types and metadata collection methods
Data source type | Metadata collection method | Whether the crawler is available in Data Map | Metadata update timeliness | ||
Table/Field | Partition | Data lineage | |||
MaxCompute |
| No | Regular project: real-time External project: T+1 | Region in the Chinese mainland: real-time Region outside China: T+1 | T+1 |
EMR (Metadata storage method: DLF) Note Make sure that EMR_HOOK is enabled for a cluster. |
| No | Real-time | Real-time | Real-time |
EMR (Metadata storage method: HMS or RDS) Note Make sure that EMR_HOOK is enabled for a cluster. |
| Yes | Real-time | Real-time | Real-time |
Hologres |
| Yes | Depends on the custom collection plan | Not supported | Real-time |
AnalyticDB for PostgreSQL |
| Yes | Depends on the custom collection plan | Not supported | Real-time |
AnalyticDB for MySQL |
| Yes | Depends on the custom collection plan | Not supported | Real-time Note You must submit a ticket to enable the data lineage feature for an AnalyticDB for MySQL instance. |
CDH Hive |
| Yes | Depends on the custom collection plan | Real-time | Real-time |
DLF | Automatic metadata collection | No | Real-time | Real-time | Not applicable |
Other data source types, such as MySQL, PostgreSQL, SQL Server, Oracle, Tablestore, StarRocks, and ClickHouse |
| Yes | Depends on the custom collection plan | Not supported | Not supported |
Limits
You can collect only the metadata of data sources that you configured in the workspaces to which the current logon account belongs. If you want to collect metadata of data sources in another workspace, you can contact the workspace administrator to add your account to the workspace as a member. For more information, see Add workspace members and assign roles to them.
If you want to collect metadata of a data source for which whitelist-based access control is enabled, you must add the CIDR blocks or IP addresses of DataWorks in the region where the related workspace resides to the IP address whitelist of the data source. For more information, see Configure IP address whitelists for metadata collection.
We recommend that you do not collect metadata of a data source that resides in a different region from your workspace. If you want to collect metadata across regions, configure a public network address when you create a data source. For more information, see Add and manage data sources.
You cannot use a MySQL metadata crawler to collect the metadata of an OceanBase data source.
Procedure
In the left-side navigation pane of the DataMap page, click Collect Metadata.
On the page that appears, you can switch between Data Source Perspective and Workspace Perspective to view or manage metadata collection from the selected perspective.
If you select Data Source Perspective, you can view the types of data sources that you configured in the workspaces to which the current logon account belongs and manage crawlers by data source type.
If you select Workspace Perspective, you can view the workspaces to which the current logon account belongs and manage the crawlers for data sources by workspace. If no data source is available in a workspace, you can click Create Data Source to go to the Data Source page and create a data source in SettingCenter.
View metadata crawlers
Overall statistics on metadata collection
On the Collect Metadata page, you can switch between Data Source Perspective and Workspace Perspective to view the overall information about metadata collection. You can view the number of data sources for which a crawler is created in the selected perspective.
Metadata collection details
To view the details of metadata collection, you can click the desired data source type or workspace, or click Manage in the upper-right corner of the desired data source type or workspace on the Data Source Perspective or Workspace Perspective tab. On the Data Sources for Which Crawler Is Created tab, you can view the following information about a crawler: Status, Execution Plan, Last Run At, Last Running Time/s, Average Running Time/s, and Tables Found During Last Run.
Manage metadata crawlers
Click Manage in the upper-right corner of the desired data source type or workspace. The Data Sources for Which Crawler Is Created tab appears. On this tab, you can view the list of data sources of the selected data source type or the list of data sources for which a crawler is created in the selected workspace. You can perform the following operations on an existing crawler.
Run a metadata crawler
You can manually run a metadata crawler. To run a metadata crawler, find the desired data source on the Data Sources for Which Crawler Is Created tab and click Run in the Actions column of the data source to collect the metadata of the data source once.
Modify the collection plan of a metadata crawler
Go to the Data Sources for Which Crawler Is Created tab, find the desired data source, and then click Edit in the Actions column of the data source to modify the collection plan of the metadata crawler. The collection plans include manual metadata collection and periodic metadata collection.
Manual metadata collection: After you configure a metadata crawler for the desired data source and configure this collection plan for the crawler, you must manually trigger the crawler to collect metadata of the data source to Data Map and update the collected metadata based on your business requirements.
Periodic metadata collection: After you configure a metadata crawler for the desired data source and configure this collection plan for the crawler, you do not need to manually trigger the crawler to run. The system periodically collects metadata of the data source to Data Map and updates the collected metadata based on the collection plan.
Delete a metadata crawler
Go to the Data Sources for Which Crawler Is Created tab, find the desired data source, and then click Remove in the Actions column of the data source to delete the metadata crawler of the data source. After you delete the metadata crawler of the data source, the data source is moved to the Data Sources for Which No Crawler Is Created tab and the metadata of the data source is no longer collected.
Create a metadata crawler
After you add a data source or register a cluster to a workspace, you can go to Data Map to enable the metadata collection feature. You can view information about metadata collection for the data source or cluster on the Data Sources for Which Crawler Is Created tab.
If you want to recollect the metadata of a data source after you delete the metadata crawler of the data source, you can create a metadata crawler for the data source on the Data Sources for Which No Crawler Is Created tab.
Click Data Sources for Which No Crawler Is Created.
Find the desired data source and click Create Crawler in the Actions column of the data source. In the Configure Collection Plan dialog box, configure the parameters.
NoteParameters that you need to configure in the Configure Collection Plan dialog box vary based on the data source type.
Parameter
Description
Resource Group Name
Select the resource group that is connected to the data source whose metadata you want to collect. You can select one of the following resource groups in Data Map based on your business requirements:
Default resource group named
default
Your exclusive resource group for scheduling
Your exclusive resource group for Data Integration
Test Network Connectivity
After you select a resource group, if you want to re-test the network connectivity between the resource group and the data source whose metadata you want to collect, you can click Test Network Connectivity. If the message The connectivity test failed. is displayed, you can refer to the following instructions to locate the cause:
Check whether whitelist-based access control is enabled for the data source. For information about how to configure IP address whitelists for metadata collection from a data source, see Configure IP address whitelists for metadata collection.
If whitelist-based access control is not enabled for the data source, you can establish a network connection between the resource group and data source. For more information, see Network connectivity and operations on resource groups.
Collection Plan
The metadata collection plan. Valid values: Manual Crawling, Monthly, Weekly, Daily, and Hourly. The collection plan that is generated varies based on the collection cycle. The system collects metadata from the data source based on the collection cycle that you specify.
Manual Crawling: You can manually trigger the crawler to collect metadata of the data source to Data Map and update the collected metadata based on your business requirements.
Monthly: The system automatically collects metadata of the data source once at a specified point in time on several specific days of each month.
ImportantSpecific months do not have the 29th, 30th, or 31st day. We recommend that you do not select the last few days of a month.
Weekly: The system automatically collects metadata of the data source once at a specified point in time on several specific days of each week.
If you do not configure the Time parameter, the system automatically collects metadata of the data source once at 00:00:00 on the specific days of each week.
Daily: The system automatically collects metadata of the data source once at a specified point in time of each day.
Hourly: The system automatically collects metadata of the data source once on the
Nth
minute of each hour.
Verify that the configurations of the crawler are correct and click Confirmation.
The system collects metadata of the data source based on the configured collection plan. If you select Manual Crawling, you can find the desired data source on the Data Sources for Which Crawler Is Created tab and click Run in the Actions column of the data source to manually collect the metadata of the data source based on your business requirements.
What to do next
After the metadata is collected, you can perform various operations on the metadata in Data Map. For example, you can check the overview of data, manage tables by category and group, and view data lineage. For more information, see View resource information, Search for tables, and Table management from the business perspective: Data albums.