The batch synchronization feature of Data Integration provides readers and writers for you to read data from and write data to data sources. You can specify a source and a destination for your batch synchronization task, and configure scheduling parameters for the task. This way, you can use the task to synchronize full data or incremental data from the source to the destination. This topic describes the capabilities provided by the batch synchronization feature.
Limits
The batch synchronization feature provided by DataWorks does not support data synchronization across time zones. If the data sources of a batch synchronization task reside in a different time zone from the resource group that is used to run the task, errors may occur during data synchronization.
Billing
Data synchronization tasks in Data Integration occupy resources for Data Integration. You are charged for the resources that you use. The scheduling system issues batch synchronization tasks in Data Integration to the related exclusive resource groups for scheduling and uses the resource groups to schedule the tasks. During this process, scheduling fees are generated. For more information, see Billing overview.
NoteFor more information about the billing details of scheduling fees, see Billing overview.
For more information about the task issuing mechanism, see Overview.
If you configure the public IP address of a data source when you add the data source to DataWorks, and you use the data source for your batch synchronization task, Internet traffic is generated when you run the task. You are charged for the generated Internet traffic. For more information about the billing details of Internet traffic, see Billing of Internet traffic.
Overview
The following figure shows the capabilities of the batch synchronization feature.
Capability | Description |
Data synchronization between heterogeneous data sources | Data Integration supports data synchronization between more than 40 types of data sources, including relational databases, unstructured storage systems, big data storage systems, and message queues. You can specify a source and a destination for your batch synchronization task and use the related reader and writer to synchronize data between the data sources. All structured data sources and semi-structured data sources are supported. For more information, see Supported data source types and data synchronization solutions. |
Data synchronization from or to data sources that are deployed in complex network environments | The batch synchronization feature supports data synchronization from or to Alibaba Cloud data sources, data centers, self-managed data sources that are hosted on Elastic Compute Service (ECS) instances, and data sources that do not belong to Alibaba Cloud. You can select appropriate network connectivity solutions to establish network connections between your resource group and data sources based on the network environments in which the data sources are deployed. Before you configure a data synchronization task, you must make sure that network connections are established between your resource group for Data Integration and data sources. For more information about how to establish a network connection between a resource group and a data source, see Network connectivity solutions. |
Data synchronization scenarios | The batch synchronization feature allows you to synchronize data from a single table to another single table or synchronize data from tables in sharded databases to a single table. You can configure scheduling parameters for a batch synchronization task and use the task to periodically synchronize full data and incremental data in the source to the related partition in the destination table. You can also configure scheduling parameters for a batch synchronization task and use the data backfill feature provided in Operation Center to backfill the historical data of a specific period of time for the task. This way, you can use the task to synchronize the historical data to the specified partition or table in the destination database or data warehouse. For more information about scheduling parameters, see Supported formats of scheduling parameters. Note
|
Task configuration methods | You can use one of the following methods to configure a batch synchronization task:
Note For more information about the settings that are supported for configuring a batch synchronization task, see Configurations for a batch synchronization task. |
O&M for batch synchronization tasks |
|
Configurations for a batch synchronization task
Configuration | Description |
Synchronize full or incremental data | You can configure a filter condition and scheduling parameters when you configure a batch synchronization task to synchronize incremental data from the source. The parameters that need to be configured to implement incremental synchronization vary based on the reader type. For more information, see Configure a batch synchronization task to synchronize only incremental data. |
Configure field mappings, and add fields to a source table and assign values to the fields | When you configure a batch synchronization task, you can configure mappings between fields in the source and fields in the destination. The values of the fields in the source are written to the fields of the same data type in the destination based on the mappings.
|
Specify the maximum number of parallel threads that can be used and the maximum transmission rate |
|
Enable the distributed execution mode | Batch synchronization tasks for specific types of data sources can be run in distributed execution mode. If you enable the distributed execution mode for a batch synchronization task when you configure the task, the system splits the task into slices and uses multiple machines to run the task at the same time. In this case, the more ECS instances, the higher the data synchronization speed. If you have a high requirement for data synchronization performance, you can run your batch synchronization task in distributed execution mode. If you run a batch synchronization task in distributed execution mode, fragment resources of ECS instances can be utilized. This helps improve resource utilization. Note Support for the distributed execution mode varies based on the data source type. For more information about whether a data source supports the distributed execution mode, see the topic for the related reader or writer and parameters displayed in the DataWorks console. |
Specify the maximum number of dirty data records that are allowed | Data Integration allows the generation of dirty data records by default. Data Integration also allows you to specify the maximum number of dirty data records that are allowed during data synchronization and define the impacts of dirty data records.
Note Dirty data is the data that is meaningless to business, does not match the specified data type, or leads to an exception during data synchronization. If an exception occurs when a single data record is written to the destination data source, the data record is considered dirty data. Data records that fail to be written to a destination are considered as dirty data. For example, when a batch synchronization task attempts to write VARCHAR-type data in a source to an INT-type field in a destination, a data conversion error occurs, and the data fails to be written to the destination. In this case, the data is dirty data. When you configure a batch synchronization task, you can specify whether dirty data can be generated. You can also specify the maximum number of dirty data records that are allowed during data synchronization. If the number of generated dirty data records exceeds the upper limit that you specify, the synchronization task fails. |
Description for using scheduling parameters in data synchronization
Batch synchronization of data in a table
If you use a batch synchronization task to synchronize data, you can configure scheduling parameters for the task to specify the path and scope of the data that you want to synchronize and the location to which you want to write the data. The method used to configure scheduling parameters for a batch synchronization task is the same as the method used to configure scheduling parameters for other types of tasks.
When a batch synchronization task is run, the scheduling parameters configured for the task are replaced with the actual values based on the value formats of the scheduling parameters. Then, the batch synchronization task synchronizes data based on the values.
Example: You want to configure a batch synchronization task to synchronize the order data that is generated on the previous day in an order table in a MySQL data source to the partition for the current day in a destination MaxCompute table every day. The source order table contains the field gmt_created that specifies the time when an order is created, and the destination MaxCompute table contains the partition field ds.
The incremental order data in the source order table is obtained based on a filter condition that contains WHERE.
bizdate_yesterday specifies the date on which an incremental order is created. The date is one day earlier than the date on which the task is scheduled to run. The value format of the parameter is ${yyyy-mm-dd}.
bizdate_today specifies the date on which data of an incremental order is synchronized. The task is scheduled to run on the day indicated by this date. The value format of the parameter is $[yyyy-mm-dd].
bizdate_today and bizdate_yesterday are the names of scheduling parameters. You can specify the names based on your business requirements. When the task is run, the bizdate_today and bizdate_yesterday parameters are replaced with actual values based on the value formats of the parameters.
The partition in the destination MaxCompute table is also specified by a scheduling parameter. $bizdate specifies the data timestamp of the task. When the task is run, the partition filter expression configured for the task is replaced with the data timestamp specified by the scheduling parameter. For more information about how to configure and use scheduling parameters, see Configure and use scheduling parameters.
Batch synchronization of all data in a database
For a batch synchronization task that is used to synchronize all data in a database, only the following scheduling parameters can be configured:
bizdate=${yyyymmdd} year=$[yyyy] month=$[mm] day=$[dd] hour=$[hh24]
When you configure the task, the following variables must be defined: ${bizdate}, ${year},${month}, ${day}, ${hour}
.
Example: If you want to configure a data synchronization task to synchronize full data from a source database at a time and periodically synchronize incremental data from the source database to MaxCompute, you can configure the filter condition STR_TO_DATE('${bizdate}', '%Y%m%d') <= columnName AND columnName < DATE_ADD(STR_TO_DATE('${bizdate}', '%Y%m%d'), interval 1 day)
to obtain the daily incremental data that you want to periodically synchronize to MaxCompute.