All Products
Search
Document Center

DataWorks:Metadata collection

Last Updated:Jan 23, 2026

Metadata collection in DataWorks Data Map allows you to centrally manage metadata from various data sources. Collected metadata is visible in Data Map. This topic describes how to create a crawler to collect metadata.

Overview

Metadata collection is essential for building an enterprise-level data map and managing data assets. It uses crawlers to automatically extract technical metadata (databases, tables, columns), data lineage, and partition information from DataWorks data sources (such as MaxCompute, Hologres, MySQL, and CDH Hive) across workspaces in the same region. This metadata is aggregated into DataWorks Data Map to provide a unified data view.

Metadata collection allows you to:

  • Build a unified data view: Break down data silos and centrally manage multi-source heterogeneous metadata.

  • Enable data discovery and search: Allow data consumers to quickly and accurately find the data they need.

  • Analyze full-link lineage: Trace data origins and destinations to facilitate impact analysis and troubleshooting.

  • Empower data governance: Perform data classification, grading, access control, quality monitoring, and lifecycle management based on complete metadata.

Billing

By default, each collection task consumes 0.25 CUs × task runtime. For more information, see Resource group fees. Each successful collection generates a scheduling instance. For more information, see Scheduling instance fees.

Limitations

  • If the data source uses whitelist access control, you must configure the database whitelist. For more information, see Metadata Collection Whitelist.

  • Cross-region metadata collection is not recommended. Ensure DataWorks and the data source are in the same region. To collect metadata across regions, use a public IP address when creating the data source. For more information, see Data Source Management.

  • The MySQL metadata crawler does not support OceanBase data sources.

  • Metadata collection is not supported for AnalyticDB for MySQL data sources with SSL enabled.

Entry point

  1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Governance > Data Map. On the page that appears, click Go to Data Map.

  2. In the left navigation pane, click image to go to the metadata collection page.

Built-in crawlers

Built-in crawlers are preconfigured and run automatically by DataWorks in near real-time. They collect core metadata integrated with DataWorks. You do not need to create them; you only need to manage their scope.

Important

If you cannot find the target table in Data Map, go to My Data > My Tools > Refresh Table Metadata to manually sync the table.

MaxCompute default crawler

This crawler collects metadata from MaxCompute projects in your account. On the details page, use Modify Data Scope to select projects and Permission Configurations to set metadata visibility within the tenant.

  1. In the Built-in section of the metadata collection page, find the MaxCompute Default Crawler card and click Details.

  2. The MaxCompute Default Crawler details page contains the Basic Information and Data Scope tabs.

    • Basic Information: Displays basic attributes of the crawler, such as collection type and mode. This information is read-only.

    • Data Scope: Manages which MaxCompute projects to collect.

  3. Modify collection scope:

    1. Click Data Scope and click Modify Data Scope.

    2. In the dialog box, select or clear the MaxCompute projects to collect.

      Important

      The default scope includes all MaxCompute projects bound to workspaces in the current region under the current tenant. After the scope is modified, only metadata objects within the scope are visible in Data Map. Metadata that is not selected will be invisible.

    3. Click OK to save the changes.

  4. Configure metadata visibility:

    • In the Data Scope list, find the target project and click Actions in the Permission Configurations column.

    • Select a visibility policy based on your data governance requirements:

      • Public Within Tenant: All members in the tenant can search for and view metadata of this project.

      • Only members in the associated workspace can search and view.: Only members of specific workspaces can access metadata of this project, ensuring data isolation.

DLF Default Crawler

Important

To support real-time collection of DLF metadata, you must grant the Data Reader permission to the Service Linked Role AliyunServiceRoleForDataworksOnEmr in the DLF console.

The DLF Default Crawler collects metadata from Data Lake Formation (DLF) within your account.

  1. In the Built-in section of the metadata collection page, find the DLF Default Crawler card and click Details to view basic information.

  2. Click the Data Scope tab to view the list of DLF Catalogs included in the collection scope and their table counts.

    By default, all accessible Catalogs (including DLF and DLF-Legacy versions) are collected.

Custom crawlers

Custom crawlers provide unified metadata management across environments and engines.

  • For conventional data sources

    Supports custom crawlers for traditional structured or semi-structured data sources such as Hologres, StarRocks, MySQL, Oracle, and CDH Hive. The system parses the physical database table structure to automatically extract and synchronize metadata such as field attributes, indexes, and partitions.

  • For metadata-type data sources (Catalog)

    Supports direct collection of metadata-type data sources for non-DLF managed, self-declared native lake format metadata, such as Paimon Catalog.

Create custom crawler

  1. In the custom crawler list section of the metadata collection page, click Create Metadata Collection.

  2. Select collection type: On the type selection page, select the target data source type to collect, such as Hologres or StarRocks.

  3. Configure basic information and resource group:

    • Basic Configurations:

      • Select Workspace: Select the workspace containing the data source.

      • Select Data Source: Select a created target data source from the drop-down list. After selection, the system automatically displays details of the data source.

      • Name: Enter a name for the crawler for future identification. The default name is the same as the data source name.

    • Resource Group Configuration:

      • Resource Group: Select a resource group to run the collection task.

      • Test Network Connectivity: This step is critical. Click Test Network Connectivity to ensure the resource group can successfully access the data source.

        Important
  4. Configure metadata collection:

    • Collection Scope: Define the databases (Database/Schema) to collect. If the data source is database-granular, the corresponding database is selected by default. You can select additional databases outside the data source.

      Important
      • A database can be configured in only one crawler. If a database cannot be selected, it is already being collected by another crawler.

      • If you narrow the collection scope, metadata outside the scope becomes unsearchable in Data Map.

  5. Configure Intelligent Enhancement Settings and Collection Plan:

    • Intelligent Enhancement Settings (Beta):

      • AI Collection Description: When enabled, the system uses LLMs to automatically generate business descriptions for your tables and fields after metadata collection, greatly improving metadata readability and usability. After collection is complete, you can view the AI-generated information (such as table remarks and field descriptions) on the details page of the table object in Data Map.

    • Collection Plan:

      • Trigger Mode: Select Manual or Periodic.

        • Manual: The crawler runs only when manually triggered. This applies to one-time or on-demand collection.

        • Periodic: Configure a scheduled task (such as monthly, daily, weekly, or hourly). The system will automatically update metadata periodically.

          To configure a minute-level scheduled task, select hourly collection and check all minute options to achieve a 5-minute interval task.
          Important

          Periodic collection is supported only for production environment data sources.

  6. Save configuration: Click Save or Save and Run to complete the creation of the crawler.

Manage custom crawlers

After a crawler is created, it appears in the custom list. You can perform the following management operations:

  • List operations: In the list, you can directly Run, Stop, or Delete the crawler. Use the Filter and Search features at the top to quickly locate the target crawler.

    Important

    Deleting a metadata crawler removes its collected metadata objects from Data Map. Users cannot search for or view these objects. Caution: This action cannot be undone.

  • View details and logs: Click the crawler name to view its details.

    • Basic Information: View all configuration items of the crawler.

    • Data Scope: View or Modify Data Scope.

      If viewed before collection, the table count and latest update time will be empty.
      The following data sources do not support scope modification: EMR Hive, CDH Hive, Lindorm, ElasticSearch, Tablestore (OTS), MongoDB, and AnalyticDB for Spark within AnalyticDB for MySQL.
    • Run Logs: Track the execution history of each collection task. You can view the start time, duration, status, and volume of collected data. When a task fails, clicking View Logs is the key entry point for locating and resolving issues.

  • Manually execute collection: In the upper-right corner, click Collect Metadata to immediately trigger a collection task. Use this to immediately view a newly created table in Data Map.

Next steps

After metadata is collected, you can use Data Map to:

  • Search for your collected tables in Data Map and view their details, field information, partitions, and data preview. For more information, see Metadata details.

  • Analyze upstream and downstream lineage relationships of tables to understand the full data processing link. For more information, see View lineages.

  • Add assets to data albums to organize and manage your data from a business perspective. For more information, see Data albums.

FAQ

  • Q: Collection times out or fails for database sources like MySQL?

    A: Ensure that the vSwitch CIDR Block of the resource group is added to the whitelist.

Collection scope and timeliness

Data tables

Data Source Type

Collection Mode

Collection granularity

Update timeliness

Table/field

Partition

Lineage

MaxCompute

System default auto-collection

Instance

Standard project: Real-time

External project: T+1

Chinese mainland regions: Real-time

Overseas regions: T+1

Real-time

Data Lake Formation (DLF)

Instance

Real-time

Real-time

Lineage is supported for DLF metadata of Serverless Spark, Serverless StarRocks, and Serverless Flink engines; other engines are not supported.

Important

For EMR clusters, you must enable EMR_HOOK.

Hologres

Manually create crawler

Database

Depends on schedule

Not supported

Real-time

EMR Hive

Instance

Depends on schedule

Depends on schedule

Real-time

Important

You must enable EMR_HOOK for the cluster.

CDH Hive

Instance

Depends on schedule

Real-time

Real-time

StarRocks

Database

  • Instance Mode: Real-time.

  • Connection String Mode: Depends on schedule.

Not supported

Real-time

Important

Lineage collection is supported only in Instance Mode. Lineage cannot be collected in Connection String Mode.

AnalyticDB for MySQL

Database

Depends on schedule

Not supported

Real-time

Note

You need to submit a ticket to enable the data lineage feature for AnalyticDB for MySQL instances.

AnalyticDB for Spark

Instance

Real-time

Not supported

Real-time

AnalyticDB for PostgreSQL

Database

Depends on schedule

Not supported

Real-time

Lindorm

Instance

Depends on schedule

Not supported

Real-time

Tablestore (OTS)

Instance

Depends on schedule

Not supported

Not supported

MongoDB

Instance

Depends on schedule

Not supported

Not supported

ElasticSearch

Instance

Depends on schedule

Not supported

T+1 update

Paimon Catalog

Catalog

Depends on schedule

Depends on schedule

Not supported

Other data source types (MySQL, PostgreSQL, SQL Server, Oracle, ClickHouse, SelectDB, etc.)

Database

Depends on schedule

Not supported

Not supported

Note

AnalyticDB for Spark and AnalyticDB for MySQL use the same metadata collection entry point.

Task code

Data Map supports code search and quick location. The following table describes the supported scope.

Code source

Collection scope

Trigger method

Data Studio

Data Studio - Create node and edit code

Auto collection

Data Studio (Legacy)

Data Studio (Legacy) - Create node and edit code

Data Analysis

Data Analysis - Create SQL query and edit code

DataService Studio

DataService Studio - Create API data push service

API assets

Data Map supports viewing DataService Studio API metadata:

API Type

Collection scope

Trigger method

Generated API (Codeless UI)

DataService Studio - Create API via codeless UI

Auto collection

Generated API (Code editor)

DataService Studio - Create API via code editor

Registered API

DataService Studio - Register API

Service Orchestration

DataService Studio - Create Service Orchestration

AI assets

Data Map supports viewing and managing AI assets, and provides AI asset lineage to trace the origin, usage, and evolution of data and models. The following table describes the support for AI assets.

Type

Collection scope

Trigger method

Dataset

  • PAI - Create dataset/Register dataset

  • DataWorks - Create dataset

Auto collection

AI Model

PAI - Model training task/Register model/Deploy model service

Algorithm Task

PAI - Training task/Workflow task/Distributed training task

Model Service

PAI - Deploy model service (EAS deployment)

Workspace

Data Map supports viewing workspace metadata:

Project

Collection Mode

Trigger method

Workspace

DataWorks - Create workspace

Auto collection