Data sources and synchronization solutions - DataWorks - Alibaba Cloud Documentation Center

DataWorks Data Integration synchronizes data among various data sources, including MySQL, MaxCompute, Hologres, and Kafka. It provides batch synchronization for T+1 Extract, Transform, and Load (ETL) jobs, Real-time Data Synchronization for data replication with second-level latency, and Whole-database Migration.

Synchronization solutions

Type	Source granularity	Target granularity	Timeliness	Synchronization scenario
Single-table batch	Single table	Single table/partition	T+1 or periodic	Periodic full or incremental synchronization
Sharding batch	Multiple tables with identical schema	Single table/partition	T+1 or periodic	Periodic full or incremental synchronization
Single-table real-time	Single table	Single table/partition	Second-to-minute latency	Change Data Capture (CDC)
Whole-database batch	Whole database or multiple tables	Matching tables and partitions	One-time or periodic	One-time or periodic full/incremental synchronization. Supports an initial full synchronization followed by periodic incremental updates.
Whole-database real-time	Whole database or multiple tables	Matching tables and partitions	Second-to-minute latency	Full synchronization + Change Data Capture (CDC)
Whole-database full and incremental	Whole database or multiple tables	Matching tables and partitions	Initial full synchronization: Batch processing Subsequent incremental synchronization: T+1	Full synchronization + Change Data Capture (CDC)

Synchronization approaches

When choosing a data synchronization approach, consider these two core questions:

Latency requirements: How often do you need to synchronize data? Is a daily update (batch) sufficient, or do you need real-time updates in seconds or minutes (real-time)?
Scale and complexity: How many tables do you need to synchronize? Is the processing logic for these tables consistent (single table vs. whole database)?

Based on these factors, we recommend two main categories of synchronization approaches: batch synchronization and real-time synchronization.

1. Batch synchronization (T+1/periodic)

A batch approach is for scenarios that are not time-sensitive (for example, T+1) and require periodic batch processing.

Important

Prerequisite: To implement incremental batch synchronization, the source table must contain a field to track incremental changes, such as a timestamp like gmt_modified or an auto-incrementing ID. If no such field exists, you must perform periodic full synchronization.

1.1. Single-table batch

Use this approach when you need to perform fine-grained processing on a small number of core, heterogeneous data tables.

Key advantage: Flexible processing logic.
- Fine-grained transformation: Supports complex field mapping, data filtering, constant value assignment, function-based transformations, and even AI-assisted processing.
- Heterogeneous source integration: The optimal choice for handling non-standard data sources such as APIs and log files.
Key limitation: High cost at scale.
- High configuration overhead: Synchronizing a large number of tables requires significant effort to configure and maintain each task individually.
- High resource consumption: Each task is scheduled independently. The resource consumption of 100 single-table tasks greatly exceeds that of one whole-database task.

Recommended approach for Single-table batch: Batch synchronization task for a single table

1.2. Whole-database batch

Use this approach when you need to efficiently "move" a large number of homogeneous data tables from one location to another.

Key advantage: High O&M efficiency and low cost.
- High efficiency: Configure hundreds of tables in a single operation with automatic object matching, dramatically improving development efficiency.
- Cost-effectiveness: Resources are scheduled and optimized holistically, resulting in very low costs. For example, the resource consumption of one whole-database task might be 2 CUs, compared to 100 CUs for 100 single-table tasks.
- Typical use cases: Building an Operational Data Store (ODS) layer for a data warehouse, creating periodic database backups, and migrating data to the cloud.
Key limitation: Limited processing logic.
- This approach is primarily for data replication and does not support complex transformation logic for individual tables.

Recommended approach for Whole-database batch: Batch whole-database synchronization task.

2. Real-time synchronization (second- to minute-level)

A real-time approach is for scenarios that require capturing real-time data changes (inserts, updates, and deletes) from the source to support real-time analytics and business responses.

Important

Prerequisite: The source must support Change Data Capture (CDC) or be a message queue itself. For example, MySQL must have binary logging enabled, or the source could be a Kafka instance.

Single-table or whole-database real-time

The decision logic is similar to the batch approach:

Single-table real-time: Suitable for use cases that involve complex processing of real-time change streams from a single, core table.
Whole-database real-time: The mainstream choice for building a real-time data warehouse, implementing real-time database disaster recovery, and integrating with real-time data lakes. It also offers significant advantages in efficiency and cost-effectiveness.

Recommended real-time approaches: Real-time synchronization task for a single table, Real-time whole-database synchronization task

3. Real-time CDC to append-only targets

Important

Context: Real-time synchronization captures CDC data, which includes Insert, Update, and Delete operations. For Append-Only storage systems like non-Delta tables in MaxCompute that do not natively support physical Update or Delete operations, writing the CDC stream directly can lead to inconsistent data states. For example, the target table will not reflect a delete operation.

DataWorks solution: Base + Log pattern
- This solution is implemented as a whole-database full and incremental task. It works by creating a Base table (for full snapshots) and a Log table (for incremental logs) at the target.
- How it works: The CDC data stream is written to the Log table in real time. Then, on a T+1 schedule, the system automatically runs a task to Merge the changes from the Log table into the Base table, generating an updated full snapshot. With this approach, data is written to the Log table in minutes, while the final, merged state is visible only after the T+1 task completes. This balances real-time data capture with the eventual consistency required by a batch data warehouse.

Recommended approach: Whole-database full and incremental (near real-time) task.

Data source read/write capabilities

Data source	Single-table batch	Single-table real-time	Whole-database batch	Whole-database real-time	Whole-database full and incremental
Public dataset	Read	-	-	-	-
Amazon S3	Read/Write	-	-	-	-
Amazon Redshift	Read/Write	-	-	-	-
AnalyticDB for MySQL 2.0	Read/Write	-	-	-	-
AnalyticDB for MySQL 3.0	Read/Write	Write	Read	Write	-
AnalyticDB for PostgreSQL	Read/Write	-	Read	-	-
ApsaraDB for OceanBase	Read/Write	Write	-	Read/Write	-
Azure Blob Storage	Read	-	-	-	-
BigQuery	Read	-	-	-	-
ClickHouse	Read/Write	-	Read	-	-
COS	Read	-	-	-	-
Databricks	Read	-	-	-	-
DataHub	Read/Write	Read/Write	-	Write	-
Data Lake Formation	Read/Write	Write	Write	Write	-
DB2	Read/Write	-	Read	-	-
Doris	Read/Write	Write	Read	-	-
DM (Dameng)	Read/Write	-	Read	-	-
DRDS (PolarDB-X 1.0)	Read/Write	-	Read	-	-
Elasticsearch	Read/Write	Write	Write	Write	-
FTP	Read/Write	-	-	-	-
GBase8a	Read/Write	-	-	-	-
HBase	HBase Read/Write HBase 20xsql Read HBase 11xsql Write	-	-	-	-
HDFS	Read/Write	-	-	-	-
Hive	Read/Write	-	Read/Write	-	-
Hologres	Read/Write	Read/Write	Read/Write	Write	-
HttpFile	Read	-	-	-	-
Kafka	Read/Write	Read/Write	-	Write	-
KingbaseES (Renda Jingcang)	Read/Write	-	-	-	-
Lindorm	Read/Write	Write	-	Write	-
Simple Log Service (SLS)	Read/Write	Read	-	-	-
MaxCompute	Read/Write	Write	Write	Write	Write
MariaDB	Read/Write	-	-	-	-
MaxGraph	Write	-	-	-	-
ApsaraDB for Memcache	Write	-	-	-	-
MetaQ	Read	-	-	-	-
Milvus	Read/Write	-	-	-	-
MongoDB	Read/Write	-	-	Read	-
MySQL	Read/Write	Read	Read	Read	Read
OpenSearch	Write	-	-	-	-
Oracle	Read/Write	Read	Read	Read	Read
Object Storage Service (OSS)	Read/Write	-	Write	Write	-
OSS-HDFS	Read/Write	-	Write	Write	-
PolarDB	Read/Write	Read	Read	Read	Read
PolarDB-X 2.0	Read/Write	-	Read	Read	-
PostgreSQL	Read/Write	-	Read	Read	-
Redis	Write	-	-	-	-
RestAPI	Read/Write	-	-	-	-
Salesforce	Read/Write	-	-	-	-
SAP HANA	Read/Write	-	-	-	-
Sensors Data (Shen Ce)	Write	-	-	-	-
Snowflake	Read/Write	-	-	-	-
StarRocks	Read/Write	Write	Write	Write	-
SQL Server	Read/Write	-	Read	-	-
Tablestore	Read/Write	Write	-	-	-
TiDB	Read/Write	-	-	-	-
TSDB	Write	-	-	-	-
Vertica	Read/Write	-	-	-	-
TOS	Read	-	-	-	-

DataWorks:Supported data sources and synchronization solutions

Synchronization solutions

Synchronization approaches

1. Batch synchronization (T+1/periodic)

1.1. Single-table batch

1.2. Whole-database batch

2. Real-time synchronization (second- to minute-level)

Single-table or whole-database real-time

3. Real-time CDC to append-only targets

Data source read/write capabilities

Related documents