DataWorks Data Integration synchronizes data among various data sources, including MySQL, MaxCompute, Hologres, and Kafka. It provides batch synchronization for T+1 Extract, Transform, and Load (ETL) jobs, Real-time Data Synchronization for data replication with second-level latency, and Whole-database Migration.
Synchronization solutions
Type | Source granularity | Target granularity | Timeliness | Synchronization scenario |
Single-table batch | Single table | Single table/partition | T+1 or periodic | Periodic full or incremental synchronization |
Sharding batch | Multiple tables with identical schema | Single table/partition | T+1 or periodic | Periodic full or incremental synchronization |
Single-table real-time | Single table | Single table/partition | Second-to-minute latency | Change Data Capture (CDC) |
Whole-database batch | Whole database or multiple tables | Matching tables and partitions | One-time or periodic | One-time or periodic full/incremental synchronization. Supports an initial full synchronization followed by periodic incremental updates. |
Whole-database real-time | Whole database or multiple tables | Matching tables and partitions | Second-to-minute latency | Full synchronization + Change Data Capture (CDC) |
Whole-database full and incremental | Whole database or multiple tables | Matching tables and partitions | Initial full synchronization: Batch processing Subsequent incremental synchronization: T+1 | Full synchronization + Change Data Capture (CDC) |
Synchronization approaches
When choosing a data synchronization approach, consider these two core questions:
Latency requirements: How often do you need to synchronize data? Is a daily update (batch) sufficient, or do you need real-time updates in seconds or minutes (real-time)?
Scale and complexity: How many tables do you need to synchronize? Is the processing logic for these tables consistent (single table vs. whole database)?
Based on these factors, we recommend two main categories of synchronization approaches: batch synchronization and real-time synchronization.
1. Batch synchronization (T+1/periodic)
A batch approach is for scenarios that are not time-sensitive (for example, T+1) and require periodic batch processing.
Prerequisite: To implement incremental batch synchronization, the source table must contain a field to track incremental changes, such as a timestamp like gmt_modified or an auto-incrementing ID. If no such field exists, you must perform periodic full synchronization.
1.1. Single-table batch
Use this approach when you need to perform fine-grained processing on a small number of core, heterogeneous data tables.
Key advantage: Flexible processing logic.
Fine-grained transformation: Supports complex field mapping, data filtering, constant value assignment, function-based transformations, and even AI-assisted processing.
Heterogeneous source integration: The optimal choice for handling non-standard data sources such as APIs and log files.
Key limitation: High cost at scale.
High configuration overhead: Synchronizing a large number of tables requires significant effort to configure and maintain each task individually.
High resource consumption: Each task is scheduled independently. The resource consumption of 100 single-table tasks greatly exceeds that of one whole-database task.
Recommended approach for Single-table batch: Batch synchronization task for a single table
1.2. Whole-database batch
Use this approach when you need to efficiently "move" a large number of homogeneous data tables from one location to another.
Key advantage: High O&M efficiency and low cost.
High efficiency: Configure hundreds of tables in a single operation with automatic object matching, dramatically improving development efficiency.
Cost-effectiveness: Resources are scheduled and optimized holistically, resulting in very low costs. For example, the resource consumption of one whole-database task might be 2 CUs, compared to 100 CUs for 100 single-table tasks.
Typical use cases: Building an Operational Data Store (ODS) layer for a data warehouse, creating periodic database backups, and migrating data to the cloud.
Key limitation: Limited processing logic.
This approach is primarily for data replication and does not support complex transformation logic for individual tables.
Recommended approach for Whole-database batch: Batch whole-database synchronization task.
2. Real-time synchronization (second- to minute-level)
A real-time approach is for scenarios that require capturing real-time data changes (inserts, updates, and deletes) from the source to support real-time analytics and business responses.
Prerequisite: The source must support Change Data Capture (CDC) or be a message queue itself. For example, MySQL must have binary logging enabled, or the source could be a Kafka instance.
Single-table or whole-database real-time
The decision logic is similar to the batch approach:
Single-table real-time: Suitable for use cases that involve complex processing of real-time change streams from a single, core table.
Whole-database real-time: The mainstream choice for building a real-time data warehouse, implementing real-time database disaster recovery, and integrating with real-time data lakes. It also offers significant advantages in efficiency and cost-effectiveness.
Recommended real-time approaches: Real-time synchronization task for a single table, Real-time whole-database synchronization task
3. Real-time CDC to append-only targets
Context: Real-time synchronization captures CDC data, which includes Insert, Update, and Delete operations. For Append-Only storage systems like non-Delta tables in MaxCompute that do not natively support physical Update or Delete operations, writing the CDC stream directly can lead to inconsistent data states. For example, the target table will not reflect a delete operation.
DataWorks solution: Base + Log pattern
This solution is implemented as a whole-database full and incremental task. It works by creating a
Base table(for full snapshots) and aLog table(for incremental logs) at the target.How it works: The CDC data stream is written to the
Log tablein real time. Then, on a T+1 schedule, the system automatically runs a task to Merge the changes from theLog tableinto theBase table, generating an updated full snapshot. With this approach, data is written to the Log table in minutes, while the final, merged state is visible only after the T+1 task completes. This balances real-time data capture with the eventual consistency required by a batch data warehouse.
Recommended approach: Whole-database full and incremental (near real-time) task.
Data source read/write capabilities
Data source | Single-table batch | Single-table real-time | Whole-database batch | Whole-database real-time | Whole-database full and incremental |
Read | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | Write | Read | Write | - | |
Read/Write | - | Read | - | - | |
Read/Write | Write | - | Read/Write | - | |
Read | - | - | - | - | |
Read | - | - | - | - | |
Read/Write | - | Read | - | - | |
Read | - | - | - | - | |
Read | - | - | - | - | |
Read/Write | Read/Write | - | Write | - | |
Read/Write | Write | Write | Write | - | |
Read/Write | - | Read | - | - | |
Read/Write | Write | Read | - | - | |
Read/Write | - | Read | - | - | |
Read/Write | - | Read | - | - | |
Elasticsearch | Read/Write | Write | Write | Write | - |
Read/Write | - | - | - | - | |
GBase8a | Read/Write | - | - | - | - |
HBase | HBase Read/Write HBase 20xsql Read HBase 11xsql Write | - | - | - | - |
Read/Write | - | - | - | - | |
Hive | Read/Write | - | Read/Write | - | - |
Read/Write | Read/Write | Read/Write | Write | - | |
Read | - | - | - | - | |
Read/Write | Read/Write | - | Write | - | |
Read/Write | - | - | - | - | |
Read/Write | Write | - | Write | - | |
Read/Write | Read | - | - | - | |
Read/Write | Write | Write | Write | Write | |
Read/Write | - | - | - | - | |
Write | - | - | - | - | |
Write | - | - | - | - | |
Read | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | Read | - | |
Read/Write | Read | Read | Read | Read | |
Write | - | - | - | - | |
Read/Write | Read | Read | Read | Read | |
Read/Write | - | Write | Write | - | |
Read/Write | - | Write | Write | - | |
Read/Write | Read | Read | Read | Read | |
Read/Write | - | Read | Read | - | |
Read/Write | - | Read | Read | - | |
Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | Write | Write | Write | - | |
Read/Write | - | Read | - | - | |
Read/Write | Write | - | - | - | |
Read/Write | - | - | - | - | |
Write | - | - | - | - | |
Vertica | Read/Write | - | - | - | - |
Read | - | - | - | - |
Related documents
This guide lists essential Data Integration documents to help you get started.
For data source configuration, see Data source management.
To configure synchronization tasks, see the following:
For more use cases, see the following:
Use case: Batch synchronization for a single table
Use case: Real-time whole-database synchronization
For answers to common questions about Data Synchronization, see the Data Integration FAQ.