All Products
Search
Document Center

DataWorks:Overview of the batch synchronization feature

最終更新日:Jul 11, 2024

The batch synchronization feature of Data Integration provides readers and writers for you to read data from and write data to data sources. You can specify a source and a destination for your batch synchronization task, and configure scheduling parameters for the task. This way, you can use the task to synchronize full data or incremental data from the source to the destination. This topic describes the capabilities provided by the batch synchronization feature.

Limits

The batch synchronization feature provided by DataWorks does not support data synchronization across time zones. If the data sources of a batch synchronization task reside in a different time zone from the resource group that is used to run the task, errors may occur during data synchronization.

Billing

  • Data synchronization tasks in Data Integration occupy resources for Data Integration. You are charged for the resources that you use. The scheduling system issues batch synchronization tasks in Data Integration to the related exclusive resource groups for scheduling and uses the resource groups to schedule the tasks. During this process, scheduling fees are generated. For more information, see Billing overview.

    Note
    • For more information about the billing details of scheduling fees, see Billing overview.

    • For more information about the task issuing mechanism, see Overview.

  • If you configure the public IP address of a data source when you add the data source to DataWorks, and you use the data source for your batch synchronization task, Internet traffic is generated when you run the task. You are charged for the generated Internet traffic. For more information about the billing details of Internet traffic, see Billing of Internet traffic.

Overview

The following figure shows the capabilities of the batch synchronization feature.离线同步能力

Capability

Description

Data synchronization between heterogeneous data sources

Data Integration supports data synchronization between more than 40 types of data sources, including relational databases, unstructured storage systems, big data storage systems, and message queues. You can specify a source and a destination for your batch synchronization task and use the related reader and writer to synchronize data between the data sources. All structured data sources and semi-structured data sources are supported. For more information, see Supported data source types and data synchronization solutions.

Data synchronization from or to data sources that are deployed in complex network environments

The batch synchronization feature supports data synchronization from or to Alibaba Cloud data sources, data centers, self-managed data sources that are hosted on Elastic Compute Service (ECS) instances, and data sources that do not belong to Alibaba Cloud. You can select appropriate network connectivity solutions to establish network connections between your resource group and data sources based on the network environments in which the data sources are deployed. Before you configure a data synchronization task, you must make sure that network connections are established between your resource group for Data Integration and data sources. For more information about how to establish a network connection between a resource group and a data source, see Network connectivity solutions.

Data synchronization scenarios

The batch synchronization feature allows you to synchronize data from a single table to another single table or synchronize data from tables in sharded databases to a single table. You can configure scheduling parameters for a batch synchronization task and use the task to periodically synchronize full data and incremental data in the source to the related partition in the destination table. You can also configure scheduling parameters for a batch synchronization task and use the data backfill feature provided in Operation Center to backfill the historical data of a specific period of time for the task. This way, you can use the task to synchronize the historical data to the specified partition or table in the destination database or data warehouse. For more information about scheduling parameters, see Supported formats of scheduling parameters.

Note
  • Data synchronization from tables in sharded databases is supported for database types such as MySQL, SQL Server, Oracle, PostgreSQL, PolarDB, and AnalyticDB. For more information, see Scenario: Configure a batch synchronization task to synchronize data from tables in sharded databases.

  • The batch synchronization feature can be used to synchronize data only from a single table or tables in sharded databases to a single table. If you want to synchronize data from tables in multiple databases to multiple tables, you can use a full-database batch synchronization task. For more information about how to select a data synchronization feature, see Overview.

Task configuration methods

You can use one of the following methods to configure a batch synchronization task:

Note

For more information about the settings that are supported for configuring a batch synchronization task, see Configurations for a batch synchronization task.

O&M for batch synchronization tasks

  • Monitor the status of a batch synchronization task: You can monitor the status of a batch synchronization task and configure monitoring and alerting settings for the task based on a condition such as Incomplete, Error, or Complete. You can also configure DataWorks to send alert notifications to the specified alert recipient by email, text message, DingTalk chatbot, or webhook URL. For more information, see Create a custom alert rule.

  • Monitor the quality of table data: You can monitor the quality of table data that is synchronized to a destination. You can configure monitoring rules only for tables of specific types of databases. For more information, see Overview.

  • Isolate the same data source in different environments: You can add the same data source separately for the development environment and production environment. When you configure a batch synchronization task, the data source in the development environment is used. When you commit the task to and run the task in the production environment, the data source in the production environment is used. You can use the data source isolation feature to isolate the same data source in different environments.

Configurations for a batch synchronization task

任务配置

Configuration

Description

Synchronize full or incremental data

You can configure a filter condition and scheduling parameters when you configure a batch synchronization task to synchronize incremental data from the source. The parameters that need to be configured to implement incremental synchronization vary based on the reader type. For more information, see Configure a batch synchronization task to synchronize only incremental data.

Configure field mappings, and add fields to a source table and assign values to the fields

When you configure a batch synchronization task, you can configure mappings between fields in the source and fields in the destination. The values of the fields in the source are written to the fields of the same data type in the destination based on the mappings.

  • Methods used to configure field mappings:

    • If you configure a batch synchronization task by using the codeless UI, you can map fields in the source to fields with the same names in the destination, map the fields in a row of the source to the fields in the same row of the destination, or customize mappings between all or specific fields in the source and all or specific fields in the destination. Data in the source fields that do not have mapped destination fields is not synchronized. Make sure that destination fields that do not have mapped source fields have default values or the default values of these destination fields are NULL. Otherwise, data may fail to be written to the destination.

    • If you configure a batch synchronization task by using the code editor, the system establishes mappings between fields in the source and fields in the destination based on the fields that you specify when you configure the related reader and writer. The number of fields to which you want to write data must be the same as the number of fields from which you want to read data. If the numbers are different, the batch synchronization task fails.

  • Add fields to a source table and assign values to the fields: You can add fields, such as constants and variables, to a source table. If you add variables to a source table as fields, you can assign values to the variables.

Specify the maximum number of parallel threads that can be used and the maximum transmission rate

  • When you configure a batch synchronization task, you can specify the maximum number of parallel threads that can be used to read data from the source and write data to the destination.

  • When you configure a batch synchronization task, you can specify the maximum transmission rate to prevent heavy read workloads on the source or heavy write workloads on the destination.

    Note

    If you do not specify the maximum transmission rate when you configure a batch synchronization task, data is transmitted at the maximum transmission rate that is allowed by the hardware.

Enable the distributed execution mode

Batch synchronization tasks for specific types of data sources can be run in distributed execution mode. If you enable the distributed execution mode for a batch synchronization task when you configure the task, the system splits the task into slices and uses multiple machines to run the task at the same time. In this case, the more ECS instances, the higher the data synchronization speed. If you have a high requirement for data synchronization performance, you can run your batch synchronization task in distributed execution mode. If you run a batch synchronization task in distributed execution mode, fragment resources of ECS instances can be utilized. This helps improve resource utilization.

Note

Support for the distributed execution mode varies based on the data source type. For more information about whether a data source supports the distributed execution mode, see the topic for the related reader or writer and parameters displayed in the DataWorks console.

Specify the maximum number of dirty data records that are allowed

Data Integration allows the generation of dirty data records by default. Data Integration also allows you to specify the maximum number of dirty data records that are allowed during data synchronization and define the impacts of dirty data records.

  • If you do not allow the generation of dirty data records and dirty data records are generated during data synchronization, the batch synchronization task fails.

  • If you allow the generation of dirty data records and specify the maximum number of dirty data records that are allowed during data synchronization, one of the following situations occurs:

    • If the number of dirty data records that are generated is less than the upper limit, the dirty data records are ignored and not written to the destination, and the batch synchronization task can continue to run.

    • If the number of dirty data records that are generated is greater than the upper limit, the batch synchronization task fails.

Note

Dirty data is the data that is meaningless to business, does not match the specified data type, or leads to an exception during data synchronization. If an exception occurs when a single data record is written to the destination data source, the data record is considered dirty data. Data records that fail to be written to a destination are considered as dirty data. For example, when a batch synchronization task attempts to write VARCHAR-type data in a source to an INT-type field in a destination, a data conversion error occurs, and the data fails to be written to the destination. In this case, the data is dirty data. When you configure a batch synchronization task, you can specify whether dirty data can be generated. You can also specify the maximum number of dirty data records that are allowed during data synchronization. If the number of generated dirty data records exceeds the upper limit that you specify, the synchronization task fails.

Description for using scheduling parameters in data synchronization

Batch synchronization of data in a table

If you use a batch synchronization task to synchronize data, you can configure scheduling parameters for the task to specify the path and scope of the data that you want to synchronize and the location to which you want to write the data. The method used to configure scheduling parameters for a batch synchronization task is the same as the method used to configure scheduling parameters for other types of tasks.

When a batch synchronization task is run, the scheduling parameters configured for the task are replaced with the actual values based on the value formats of the scheduling parameters. Then, the batch synchronization task synchronizes data based on the values.

Example: You want to configure a batch synchronization task to synchronize the order data that is generated on the previous day in an order table in a MySQL data source to the partition for the current day in a destination MaxCompute table every day. The source order table contains the field gmt_created that specifies the time when an order is created, and the destination MaxCompute table contains the partition field ds.

The incremental order data in the source order table is obtained based on a filter condition that contains WHERE.

  • bizdate_yesterday specifies the date on which an incremental order is created. The date is one day earlier than the date on which the task is scheduled to run. The value format of the parameter is ${yyyy-mm-dd}.

  • bizdate_today specifies the date on which data of an incremental order is synchronized. The task is scheduled to run on the day indicated by this date. The value format of the parameter is $[yyyy-mm-dd].

  • bizdate_today and bizdate_yesterday are the names of scheduling parameters. You can specify the names based on your business requirements. When the task is run, the bizdate_today and bizdate_yesterday parameters are replaced with actual values based on the value formats of the parameters.

The partition in the destination MaxCompute table is also specified by a scheduling parameter. $bizdate specifies the data timestamp of the task. When the task is run, the partition filter expression configured for the task is replaced with the data timestamp specified by the scheduling parameter. For more information about how to configure and use scheduling parameters, see Configure and use scheduling parameters.

Batch synchronization of all data in a database

For a batch synchronization task that is used to synchronize all data in a database, only the following scheduling parameters can be configured:

bizdate=${yyyymmdd} year=$[yyyy] month=$[mm] day=$[dd] hour=$[hh24]

When you configure the task, the following variables must be defined: ${bizdate}, ${year},${month}, ${day}, ${hour}.

Example: If you want to configure a data synchronization task to synchronize full data from a source database at a time and periodically synchronize incremental data from the source database to MaxCompute, you can configure the filter condition STR_TO_DATE('${bizdate}', '%Y%m%d') <= columnName AND columnName < DATE_ADD(STR_TO_DATE('${bizdate}', '%Y%m%d'), interval 1 day) to obtain the daily incremental data that you want to periodically synchronize to MaxCompute.