Metadata collection in DataWorks Data Map allows you to centrally manage metadata from various data sources. Collected metadata is visible in Data Map. This topic describes how to create a crawler to collect metadata.
Overview
Metadata collection is essential for building an enterprise-level data map and managing data assets. It uses crawlers to automatically extract technical metadata (databases, tables, columns), data lineage, and partition information from DataWorks data sources (such as MaxCompute, Hologres, MySQL, and CDH Hive) across workspaces in the same region. This metadata is aggregated into DataWorks Data Map to provide a unified data view.
Metadata collection allows you to:
Build a unified data view: Break down data silos and centrally manage multi-source heterogeneous metadata.
Enable data discovery and search: Allow data consumers to quickly and accurately find the data they need.
Analyze full-link lineage: Trace data origins and destinations to facilitate impact analysis and troubleshooting.
Empower data governance: Perform data classification, grading, access control, quality monitoring, and lifecycle management based on complete metadata.
Billing
By default, each collection task consumes 0.25 CUs × task runtime. For more information, see Resource group fees. Each successful collection generates a scheduling instance. For more information, see Scheduling instance fees.
Limitations
If the data source uses whitelist access control, you must configure the database whitelist. For more information, see Metadata Collection Whitelist.
Cross-region metadata collection is not recommended. Ensure DataWorks and the data source are in the same region. To collect metadata across regions, use a public IP address when creating the data source. For more information, see Data Source Management.
The MySQL metadata crawler does not support OceanBase data sources.
Metadata collection is not supported for AnalyticDB for MySQL data sources with SSL enabled.
Entry point
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, click Go to Data Map.
In the left navigation pane, click
to go to the metadata collection page.
Built-in crawlers
Built-in crawlers are preconfigured and run automatically by DataWorks in near real-time. They collect core metadata integrated with DataWorks. You do not need to create them; you only need to manage their scope.
If you cannot find the target table in Data Map, go to to manually sync the table.
MaxCompute default crawler
This crawler collects metadata from MaxCompute projects in your account. On the details page, use Modify Data Scope to select projects and Permission Configurations to set metadata visibility within the tenant.
In the Built-in section of the metadata collection page, find the MaxCompute Default Crawler card and click Details.
The MaxCompute Default Crawler details page contains the Basic Information and Data Scope tabs.
Basic Information: Displays basic attributes of the crawler, such as collection type and mode. This information is read-only.
Data Scope: Manages which MaxCompute projects to collect.
Modify collection scope:
Click Data Scope and click Modify Data Scope.
In the dialog box, select or clear the MaxCompute projects to collect.
ImportantThe default scope includes all MaxCompute projects bound to workspaces in the current region under the current tenant. After the scope is modified, only metadata objects within the scope are visible in Data Map. Metadata that is not selected will be invisible.
Click OK to save the changes.
Configure metadata visibility:
In the Data Scope list, find the target project and click Actions in the Permission Configurations column.
Select a visibility policy based on your data governance requirements:
Public Within Tenant: All members in the tenant can search for and view metadata of this project.
Only members in the associated workspace can search and view.: Only members of specific workspaces can access metadata of this project, ensuring data isolation.
DLF Default Crawler
To support real-time collection of DLF metadata, you must grant the Data Reader permission to the Service Linked Role AliyunServiceRoleForDataworksOnEmr in the DLF console.
The DLF Default Crawler collects metadata from Data Lake Formation (DLF) within your account.
In the Built-in section of the metadata collection page, find the DLF Default Crawler card and click Details to view basic information.
Click the Data Scope tab to view the list of DLF Catalogs included in the collection scope and their table counts.
By default, all accessible Catalogs (including DLF and DLF-Legacy versions) are collected.
Custom crawlers
Custom crawlers provide unified metadata management across environments and engines.
For conventional data sources
Supports custom crawlers for traditional structured or semi-structured data sources such as Hologres, StarRocks, MySQL, Oracle, and CDH Hive. The system parses the physical database table structure to automatically extract and synchronize metadata such as field attributes, indexes, and partitions.
For metadata-type data sources (Catalog)
Supports direct collection of metadata-type data sources for non-DLF managed, self-declared native lake format metadata, such as Paimon Catalog.
Create custom crawler
In the custom crawler list section of the metadata collection page, click Create Metadata Collection.
Select collection type: On the type selection page, select the target data source type to collect, such as Hologres or StarRocks.
Configure basic information and resource group:
Basic Configurations:
Select Workspace: Select the workspace containing the data source.
Select Data Source: Select a created target data source from the drop-down list. After selection, the system automatically displays details of the data source.
Name: Enter a name for the crawler for future identification. The default name is the same as the data source name.
Resource Group Configuration:
Resource Group: Select a resource group to run the collection task.
Test Network Connectivity: This step is critical. Click Test Network Connectivity to ensure the resource group can successfully access the data source.
ImportantCheck the data source for whitelist restrictions. If you need to collect metadata with whitelist access control enabled, see Overview of network connectivity solutions and Configure a whitelist to configure whitelist permissions.
If the data source does not have whitelist restrictions, see Network connectivity and operations on resource groups for network connectivity configuration.
If the connectivity test fails with error:
backend service call failed: test connectivity failed.not support data type, contact technical support to upgrade the resource group.
Configure metadata collection:
Collection Scope: Define the databases (Database/Schema) to collect. If the data source is database-granular, the corresponding database is selected by default. You can select additional databases outside the data source.
ImportantA database can be configured in only one crawler. If a database cannot be selected, it is already being collected by another crawler.
If you narrow the collection scope, metadata outside the scope becomes unsearchable in Data Map.
Configure Intelligent Enhancement Settings and Collection Plan:
Intelligent Enhancement Settings (Beta):
AI Collection Description: When enabled, the system uses LLMs to automatically generate business descriptions for your tables and fields after metadata collection, greatly improving metadata readability and usability. After collection is complete, you can view the AI-generated information (such as table remarks and field descriptions) on the details page of the table object in Data Map.
Collection Plan:
Trigger Mode: Select Manual or Periodic.
Manual: The crawler runs only when manually triggered. This applies to one-time or on-demand collection.
Periodic: Configure a scheduled task (such as monthly, daily, weekly, or hourly). The system will automatically update metadata periodically.
To configure a minute-level scheduled task, select hourly collection and check all minute options to achieve a 5-minute interval task.
ImportantPeriodic collection is supported only for production environment data sources.
Save configuration: Click Save or Save and Run to complete the creation of the crawler.
Manage custom crawlers
After a crawler is created, it appears in the custom list. You can perform the following management operations:
List operations: In the list, you can directly Run, Stop, or Delete the crawler. Use the Filter and Search features at the top to quickly locate the target crawler.
ImportantDeleting a metadata crawler removes its collected metadata objects from Data Map. Users cannot search for or view these objects. Caution: This action cannot be undone.
View details and logs: Click the crawler name to view its details.
Basic Information: View all configuration items of the crawler.
Data Scope: View or Modify Data Scope.
If viewed before collection, the table count and latest update time will be empty.
The following data sources do not support scope modification: EMR Hive, CDH Hive, Lindorm, ElasticSearch, Tablestore (OTS), MongoDB, and AnalyticDB for Spark within AnalyticDB for MySQL.
Run Logs: Track the execution history of each collection task. You can view the start time, duration, status, and volume of collected data. When a task fails, clicking View Logs is the key entry point for locating and resolving issues.
Manually execute collection: In the upper-right corner, click Collect Metadata to immediately trigger a collection task. Use this to immediately view a newly created table in Data Map.
Next steps
After metadata is collected, you can use Data Map to:
Search for your collected tables in Data Map and view their details, field information, partitions, and data preview. For more information, see Metadata details.
Analyze upstream and downstream lineage relationships of tables to understand the full data processing link. For more information, see View lineages.
Add assets to data albums to organize and manage your data from a business perspective. For more information, see Data albums.
FAQ
Q: Collection times out or fails for database sources like MySQL?
A: Ensure that the vSwitch CIDR Block of the resource group is added to the whitelist.
Collection scope and timeliness
Data tables
Data Source Type | Collection Mode | Collection granularity | Update timeliness | ||
Table/field | Partition | Lineage | |||
MaxCompute | System default auto-collection | Instance | Standard project: Real-time External project: T+1 | Chinese mainland regions: Real-time Overseas regions: T+1 | Real-time |
Data Lake Formation (DLF) | Instance | Real-time | Real-time | Lineage is supported for DLF metadata of Serverless Spark, Serverless StarRocks, and Serverless Flink engines; other engines are not supported. Important For EMR clusters, you must enable EMR_HOOK. | |
Hologres | Manually create crawler | Database | Depends on schedule | Real-time | |
EMR Hive | Instance | Depends on schedule | Depends on schedule | Real-time Important You must enable EMR_HOOK for the cluster. | |
CDH Hive | Instance | Depends on schedule | Real-time | Real-time | |
StarRocks | Database |
| Real-time Important Lineage collection is supported only in Instance Mode. Lineage cannot be collected in Connection String Mode. | ||
AnalyticDB for MySQL | Database | Depends on schedule | Real-time Note You need to submit a ticket to enable the data lineage feature for AnalyticDB for MySQL instances. | ||
AnalyticDB for Spark | Instance | Real-time | Real-time | ||
AnalyticDB for PostgreSQL | Database | Depends on schedule | Real-time | ||
Lindorm | Instance | Depends on schedule | Real-time | ||
Tablestore (OTS) | Instance | Depends on schedule | |||
MongoDB | Instance | Depends on schedule | |||
ElasticSearch | Instance | Depends on schedule | T+1 update | ||
Paimon Catalog | Catalog | Depends on schedule | Depends on schedule | ||
Other data source types (MySQL, PostgreSQL, SQL Server, Oracle, ClickHouse, SelectDB, etc.) | Database | Depends on schedule | |||
AnalyticDB for Spark and AnalyticDB for MySQL use the same metadata collection entry point.
Task code
Data Map supports code search and quick location. The following table describes the supported scope.
Code source | Collection scope | Trigger method |
Data Studio | Data Studio - Create node and edit code | Auto collection |
Data Studio (Legacy) | Data Studio (Legacy) - Create node and edit code | |
Data Analysis | Data Analysis - Create SQL query and edit code | |
DataService Studio | DataService Studio - Create API data push service |
API assets
Data Map supports viewing DataService Studio API metadata:
API Type | Collection scope | Trigger method |
Generated API (Codeless UI) | DataService Studio - Create API via codeless UI | Auto collection |
Generated API (Code editor) | DataService Studio - Create API via code editor | |
Registered API | DataService Studio - Register API | |
Service Orchestration | DataService Studio - Create Service Orchestration |
AI assets
Data Map supports viewing and managing AI assets, and provides AI asset lineage to trace the origin, usage, and evolution of data and models. The following table describes the support for AI assets.
Type | Collection scope | Trigger method |
Dataset |
| Auto collection |
AI Model | PAI - Model training task/Register model/Deploy model service | |
Algorithm Task | PAI - Training task/Workflow task/Distributed training task | |
Model Service | PAI - Deploy model service (EAS deployment) |
Workspace
Data Map supports viewing workspace metadata:
Project | Collection Mode | Trigger method |
Workspace | DataWorks - Create workspace | Auto collection |