All Products
Search
Document Center

DataWorks:Data sources and synchronization

Last Updated:Feb 14, 2026

DataWorks Data Integration synchronizes data across multiple data sources, including MySQL, MaxCompute, Hologres, and Kafka. It offers solutions for various scenarios, such as T+1 batch extract, transform, and load (ETL), real-time data synchronization with latency in seconds, and full database migration.

Synchronization solution

Synchronization type

Source granularity

Target granularity

Timeliness

Scenario

Single-table batch

Single table

Single table/Partition

T+1 or Periodic

Periodic full load, Periodic incremental load

Sharded database and table batch

Sharded tables

Single table/Partition

T+1 or Periodic

Periodic full load, Periodic incremental load

Single-table real-time

Single table

Single table/Partition

Seconds to minutes

Change Data Capture (CDC)

Full-database batch

Full database or multiple tables

Multiple corresponding tables and their Partitions

One-time or Periodic

One-time or periodic full loads, one-time or periodic incremental loads, or a one-time full load followed by periodic incremental loads

Full-database real-time

Full database or multiple tables

Multiple corresponding tables and their Partitions

Seconds to minutes

Full load + Change Data Capture (CDC)

Full-database full and incremental

Full database or multiple tables

Multiple corresponding tables and their Partitions

Initial full load: Batch

Ongoing incremental loads: T+1

Initial full load with Periodic incremental loads

Recommended synchronization solutions

When selecting a data synchronization solution, consider two core factors:

  1. Data timeliness: How frequently does your business require data to be synchronized? Does it need daily batch updates or real-time updates at the second or minute level?

  2. Synchronization scope and complexity: How many tables do you need to synchronize? Is the processing logic for these tables consistent? This determines whether you need a single-table or whole-database solution.

Based on these factors, we recommend two main types of synchronization solutions: batch and real-time.

1. Batch synchronization (T+1 or periodic)

Batch solutions are ideal for use cases without strict data timeliness requirements, such as T+1 periodic batch processing.

Important

Prerequisite: To perform incremental batch synchronization, the source table must contain a field for tracking increments, such as a timestamp like gmt_modified or an auto-incrementing ID. If no such field exists, you can only perform periodic full synchronization.

1. Single-table batch

This approach is best suited for scenarios where you need to perform fine-grained processing on a small number of core, heterogeneous data tables.

  • Key benefit: Flexible processing logic.

    • Fine-grained transformations: Supports complex field mapping, data filtering, constant value assignment, function-based transformations, and even AI-assisted processing.

    • Heterogeneous source integration: Ideal for processing data from non-standard sources such as APIs and log files.

  • Key limitation: High cost at scale.

    • High configuration overhead: Configuring and maintaining individual tasks for many tables requires significant effort.

    • High resource consumption: Each task is scheduled independently. The resource consumption of 100 single-table tasks is far greater than that of one whole-database task.

Recommended solution: Batch synchronization task for a single table
2. Whole-database batch

Use this approach when you need to efficiently migrate a large number of homogeneous tables from one location to another.

  • Key benefits: High operational efficiency and low cost.

    • High efficiency: Configure hundreds of tables at once with automatic object matching to significantly improve development efficiency.

    • Cost-effective: Optimized resource scheduling results in extremely low costs. For example, the resource consumption of one whole-database task might be 2 CUs, compared to 100 CUs for 100 single-table tasks.

    • Typical use cases: Building the ODS layer of a data warehouse, performing periodic database backups, and migrating data to the cloud.

  • Key limitation: Limited processing logic.

    • Designed primarily for replication, it does not support complex, table-specific transformation logic.

Recommended solution: Batch whole-database synchronization task.

2. Real-time synchronization (second- or minute-level)

Real-time solutions are ideal for use cases that require capturing data changes (inserts, updates, and deletes) from a source to support real-time analytics and business responses.

Important

Prerequisite: The source must support Change Data Capture (CDC) or be a message queue. For example, a MySQL database must have the Binary Log enabled, or the source could be a Kafka instance.

Single-table real-time or whole-database real-time

The selection logic is similar to that for batch solutions:

  • Single-table real-time: Ideal for scenarios that involve complex processing of real-time change streams from a single, core table.

  • Whole-database real-time: The mainstream choice for building a real-time data warehouse, implementing real-time database disaster recovery, and creating a real-time data lake. It also offers significant advantages in efficiency and cost-effectiveness.

Recommended solutions: Real-time synchronization task for a single table, Real-time whole-database synchronization task

3. Special use case: CDC to append-only targets

Important

Context: The CDC data captured by real-time synchronization includes three types of operations: Insert, Update, and Delete. For append-only storage systems like MaxCompute non-Delta tables that do not natively support physical Update or Delete operations, writing the CDC stream directly can lead to data inconsistencies. For example, delete operations would not be reflected in the target table.

  • DataWorks solution: The Base + Log pattern

    • This solution is implemented as a whole-database full and incremental (near real-time) task. It works by creating a Base table (Full Snapshot) and a Log table (incremental log) at the target.

    • How it works: The CDC data stream is written to the Log table in real time. Then, on a T+1 basis, the system automatically schedules a task to merge the changes from the Log table into the Base table, generating an updated Full Snapshot. This approach writes data to the incremental table with minute-level latency. The final, consistent state is visible after the T+1 merge. It balances real-time data capture with the Eventual Consistency required for a batch data warehouse.

Recommended solution: Whole-database full and incremental (near real-time) task.

Data source capabilities

Data source

Single-table batch

Single-table real-time

Whole-database batch

Whole-database real-time

Whole-database full and incremental

Public dataset data source

Read

-

-

-

-

Amazon S3 data source

Read/Write

-

-

-

-

Amazon Redshift data source

Read/Write

-

-

-

-

AnalyticDB for MySQL 2.0 data source

Read/Write

-

-

-

-

AnalyticDB for MySQL 3.0 data source

Read/Write

Write

Read

Write

-

AnalyticDB for PostgreSQL data source

Read/Write

-

Read

-

-

ApsaraDB for OceanBase data source

Read/Write

-

-

Read

Read

Azure Blob Storage data source

Read

-

-

-

-

BigQuery data source

Read

-

-

-

-

ClickHouse data source

Read/Write

-

-

-

-

COS data source

Read

-

-

-

-

Databricks data source

Read

-

-

-

-

DataHub data source

Read/Write

Read/Write

-

Write

-

Data Lake Formation data source

Read/Write

Write

Write

Write

-

DB2 data source

Read/Write

-

Read

-

-

Doris data source

Read/Write

Write

Read

-

-

DM (Dameng) data source

Read/Write

-

Read

-

-

DRDS (PolarDB-X 1.0) data source

Read/Write

-

Read

-

-

Elasticsearch

Read/Write

Write

Write

Write

-

FTP data source

Read/Write

-

-

-

-

GBase8a

Read/Write

-

-

-

-

HBase

HBase: Read/Write

HBase 2.x SQL: Read

HBase 1.1.x SQL: Write

-

-

-

-

HDFS data source

Read/Write

-

-

-

-

Hive

Read/Write

-

Read/Write

-

-

Hologres data source

Read/Write

Read/Write

Read/Write

Write

-

HttpFile data source

Read

-

-

-

-

Kafka data source

Read/Write

Read/Write

-

Write

-

KingbaseES (Renda Jingcang) data source

Read/Write

-

-

-

-

Lindorm data source

Read/Write

Write

-

Write

-

LogHub (SLS) data source

Read/Write

Read

-

-

-

MaxCompute data source

Read/Write

Write

Write

Write

Write

MariaDB data source

Read/Write

-

-

-

-

MaxGraph data source

Write

-

-

-

-

ApsaraDB for Memcache data source

Write

-

-

-

-

MetaQ data source

Read

-

-

-

-

Milvus data source

Read/Write

-

-

-

-

MongoDB data source

Read/Write

-

-

Read

-

MySQL data source

Read/Write

Read

Read

Read

Read

OpenSearch data source

Write

-

-

-

-

Oracle data source

Read/Write

Read

Read

Read

Read

OSS data source

Read/Write

-

Write

Write

-

OSS-HDFS data source

Read/Write

-

Write

Write

-

PolarDB data source

Read/Write

Read

Read

Read

Read

PolarDB-X 2.0 data source

Read/Write

-

Read

Read

-

PostgreSQL data source

Read/Write

-

Read

Read

-

Redis data source

Write

-

-

-

-

RestAPI (HTTP) data source

Read/Write

-

-

-

-

Salesforce data source

Read/Write

-

-

-

-

SAP HANA data source

Read/Write

-

-

-

-

Sensors Data (Shen Ce) data source

Write

-

-

-

-

Snowflake data source

Read/Write

-

-

-

-

StarRocks data source

Read/Write

Write

Write

Write

-

SQL Server data source

Read/Write

-

Read

-

-

Tablestore data source

Read/Write

Write

-

-

-

Tablestore Stream data source

Read/Write

-

-

-

-

TiDB data source

Read/Write

-

-

-

-

TSDB data source

Write

-

-

-

-

Vertica

Read/Write

-

-

-

-

TOS data source

Read

-

-

-

-

Use cases

References

Use these articles to get started with Data Integration.