Data Integration

0.0.201

This topic was translated by AI and is currently in queue for revision by our editors. Alibaba Cloud does not guarantee the accuracy of AI-translated content. Request expedited revision

DataWorks Data Integration supports data synchronization in complex network environments. You can Configure a batch synchronization node by using the codeless UI on the DataStudio page to periodically synchronize offline data. You can also create a real-time synchronization node on the DataStudio page to synchronize incremental data from a single table or a database in real time. This topic provides an overview of data synchronization.

Background information

In addition to data synchronization nodes that are created on the DataStudio page, DataWorks also allows you to create various data synchronization solutions in Data Integration, such as a data synchronization solution used to synchronize both full and incremental data and a data synchronization solution used for batch synchronization of all data from a database. For more information about data synchronization solutions in Data Integration, see Supported data source types and data synchronization solutions.

Limits

You can create a data synchronization node on the DataStudio page only after you are assigned the Development role. For information about how to add a RAM user to a workspace as a member and assign a role to the RAM user, see Add a RAM user to a workspace as a member and assign roles to the member.

Batch synchronization feature

Use scenarios
The batch synchronization feature allows you to synchronize data from a single table to another single table or synchronize data from tables in sharded databases to a single table. When you configure a batch synchronization node, you can use scheduling parameters to implement periodical synchronization of full and incremental data to specific partitions in a destination table. You can also use the data backfill feature provided in Operation Center to synchronize historical data to specific tables or specific partitions in a destination database or data warehouse based on the configurations of the batch synchronization node.
Supported data sources
Data Integration supports batch synchronization of data among more than 40 types of data sources, such as relational databases, unstructured storage systems, big data storage systems, and message queues. DataWorks allows you to synchronize data between structured or semi-structured data sources by defining sources and destinations and using Reader and Writer plug-ins provided by Data Integration.

Feature description

Description	Reference


Description	Reference
Data Integration provides Reader and Writer plug-ins that can be used to read data from sources and write data to destinations. You can add the desired data sources to DataWorks, and select the data sources when you create a batch synchronization node to determine the source from which you want to read data and the destination to which you want to write data.	Supported data source types, Reader plug-ins, and Writer plug-ins Overview of the batch synchronization feature
After you add the desired data sources to DataWorks, you can configure a batch synchronization node for the data sources by using the codeless user interface (UI).	Configure a batch synchronization node by using the codeless UI (2.0)
In the following scenarios, you must configure a batch synchronization node by using the code editor: The data source that you want to use cannot be added to DataWorks. The data source that you want to use does not support the codeless UI. The parameters of the Reader or Writer plug-in that you want to use can be configured only by using the code editor.	Configure a batch synchronization node by using the code editor (2.0)

Real-time synchronization feature

The real-time synchronization feature allows you to combine multiple types of data sources to form a star-shaped data synchronization link. You can synchronize data between different types of data sources in real time. You can configure the input and output of a real-time synchronization node to synchronize data from a single table to another single table or synchronize all data from a database to a destination. For more information, see Data source types that support real-time synchronization and Overview of the real-time synchronization feature.

Scheduling configurations of a data synchronization node

Scheduling dependencies between nodes

Batch synchronization node
- Ancestor node of a batch synchronization node: Data synchronization nodes can depend on the data lineages supported by DataWorks. You can configure the root node of the workspace to which the batch synchronization node belongs or a zero load node as the ancestor node of the batch synchronization node. This way, the batch synchronization node can be scheduled by the root node or zero load node.
- Descendant node of a batch synchronization node: If you want to configure an SQL node to depend on a batch synchronization node and make sure that the system can automatically establish a scheduling dependency between the nodes based on the automatic parsing feature, we recommend that you configure the table generated by the batch synchronization node as the output of the node in the Project name.Table name format.
Descendant node of a real-time synchronization node
DataWorks allows you to use only data in tables that are generated by auto triggered nodes to configure scheduling dependencies. If a node needs to depend on a real-time synchronization node and process the table data generated by the real-time synchronization node, you cannot configure scheduling dependencies for the node based on table lineages. To configure scheduling dependencies for the node, you can configure the root node of the workspace to which the node belongs or a zero load node as the ancestor node of the node. This way, the node can be scheduled by the root node or zero load node.
Note
To ensure that a real-time synchronization node can generate data as expected, you can configure a monitoring rule for the node.

Scheduling parameter configuration of a batch synchronization node

DataWorks provides the built-in variable ${bizdate} for batch synchronization nodes. By default, the scheduling parameter $bizdate is assigned to the built-in variable ${bizdate} as a value.

For information about how to use scheduling parameters in data synchronization, see the Description for using scheduling parameters in data synchronization section in Description for using scheduling parameters in data synchronization.
For information about the use scenarios of scheduling parameters in data synchronization, see Common use scenarios of scheduling parameters.

Feedback