DataWorks provides the codeless user interface (UI) for you to develop a synchronization task without requiring you to write code. You need to only select the data sources that you want to use for data synchronization and configure the required scheduling parameters to periodically synchronize full or incremental data from a single source table or tables in sharded source databases to a single destination table. This topic describes how to configure a batch synchronization task by using the codeless UI. The configurations that are required vary based on the data source type. For more information, see Supported data source types, readers, and writers.
Background information
The batch synchronization feature of Data Integration provides readers and writers for you to read data from and write data to data sources. You can configure batch synchronization tasks for different types of data sources by using the codeless UI or code editor to synchronize data from a single source table or tables in sharded source databases to a single destination table. For more information, see Overview of the batch synchronization feature.
Usage notes
You cannot configure batch synchronization tasks for specific types of data sources by using the codeless UI. If the system displays a message indicating that the current data source does not support the codeless UI when you configure a batch synchronization task, you can click the icon in the top toolbar of the configuration tab of the task in DataStudio to switch to the code editor and configure the task by using the code editor. For more information, see Configure a batch synchronization task by using the code editor.
The codeless UI is easy to use but provides only limited features. If you want to make finer-grained configurations for your batch synchronization task, you can click the Conversion script icon in the top toolbar of the configuration tab of the task in DataStudio to switch to the code editor and configure the task by using the code editor.
Prerequisites
The data sources that you want to use are prepared. Before you configure a synchronization task, you must add the database from which you want to read data and the database to which you want to write data to the desired workspace as data sources on the Data Sources page in Management Center in the DataWorks console. This way, when you configure a synchronization task, you can select the data sources. For information about how to configure readers and writers of different data source types, see the topics in the The list of data sources directory.
Note
Before you configure a batch synchronization task, you must carefully read the parameter descriptions in the topics for the related data source types. This ensures that you can successfully configure a batch synchronization task.
For information about the data source types that are supported by batch synchronization and the addition of data sources, see Supported data source types, readers, and writers.
For information about the items that you need to understand before you add a data source, see Overview.
An exclusive resource group for Data Integration that meets your business requirements is purchased. For more information, see Create and use an exclusive resource group for Data Integration.
Network connections are established between the exclusive resource group for Data Integration and the data sources. For more information, see Network connectivity solutions.
Go to the DataStudio page
Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Step 1: Create a batch synchronization task
Create a workflow. For more information, see Create a workflow.
Create a batch synchronization task.
You can use one of the following methods to create a batch synchronization task:
Method 1: Log on to the DataWorks console and go to the Scheduled Workflow pane of the DataStudio page. In the Scheduled Workflow pane, find the created workflow and click its name. Right-click Data Integration and choose .
Method 2: Log on to the DataWorks console and go to the Scheduled Workflow pane of the DataStudio page. In the Scheduled Workflow pane, find the created workflow and double-click its name. In the Data Integration section of the workflow editing tab that appears, drag Offline synchronization to the canvas on the right.
In the Create Node dialog box, configure the parameters to create a batch synchronization task.
Step 2: Establish network connections between the exclusive resource group for Data Integration and the data sources
Select the source, destination, and exclusive resource group for Data Integration, and establish network connections between the resource group and data sources.
Note
The shared resource group for Data Integration (debugging) of DataWorks discontinues. We recommend that you use serverless resource groups. For more information about serverless resource groups, see Use serverless resource groups. For more information about the discontinuation notice of the shared resource group for Data Integration, see Notice on the discontinuation of the shared resource group for Data Integration (debugging).
If you use a serverless resource group to run a synchronization task, you can specify an upper limit for the number of CUs that can be used to run a synchronization task. If an out of memory (OOM) error is reported for the synchronization task due to insufficient resources, you can appropriately change the upper limit.
Step 3: Configure the source and destination
In the data source selection section, select the tables from which you want to read data and the tables to which you want to write data, and specify a filter condition when you configure the source.
Important
The operations that you can perform when you configure a batch synchronization task vary based on the reader or writer type. The following tables describe the common operations that you can perform when you configure a batch synchronization task. For information about the operations supported by a reader or writer and how to perform the operations, see the topic for the related reader or writer. For more information, see Supported data source types, readers, and writers.
Step 4: Configure mappings between source fields and destination fields
After the source and destination are configured, you must configure mappings between source fields and destination fields. After the mappings between source fields and destination fields are configured, the batch synchronization task writes the values of the source fields to the destination fields of the same data type based on the mappings.
The data type of source fields may be different from that of destination fields. In this case, the values of the source fields cannot be written to the destination fields. The values that fail to be written to the destination are considered as dirty data. You can refer to the operations described in the Configure channel control policies step to specify the maximum number of dirty data records that are allowed during data synchronization.
Note
If a source field has no mapped destination field, data in the source field cannot be synchronized to the destination.
If the mappings that are automatically established by the system do not meet your business requirements, you must manually modify the mappings.
You can map source fields to destination fields whose names are the same as those of source fields or map fields in a row of the source to the fields in the same row of the destination. When you configure field mappings, you can also perform the following operations:
Add fields to a source table and assign values to the fields: You can click Add a row in the source field list to add fields to the source table. The fields can be constants, scheduling parameters, and built-in variables that are enclosed in single quotation marks ('), such as '123', '${Scheduling parameter name}', and '#{Built-in variable name}#'.
Note
If you add scheduling parameters to the source table as fields, you can assign values to the scheduling parameters when you configure scheduling properties for the batch synchronization task. For information about scheduling parameters, see Supported formats of scheduling parameters.
Manually add built-in variables to a source table as fields and map the fields to the fields in the destination table. The added built-in variables are synchronized to the destination table during data synchronization.
The following table describes the built-in variables that are available for different readers.
Built-in variable | Description | Reader |
Built-in variable | Description | Reader |
'#{DATASOURCE_NAME_SRC}#' | The name of a source. | MySQL Reader MySQL Reader (Sharding) PolarDB Reader PolarDB Reader (Sharding) PostgreSQL Reader |
'#{DB_NAME_SRC}#' | The name of the database to which a source table belongs. | MySQL Reader MySQL Reader (Sharding) PolarDB Reader PolarDB Reader (Sharding) PostgreSQL Reader |
'#{SCHEMA_NAME_SRC}#' | The name of the schema to which a source table belongs. | PolarDB Reader PolarDB Reader (Sharding) PostgreSQL Reader |
'#{TABLE_NAME_SRC}#' | The name of a source table. | MySQL Reader MySQL Reader (Sharding) PolarDB Reader PolarDB Reader (Sharding) PostgreSQL Reader |
Edit fields in a source table: You can click the icon in the source field list to perform the following operations:
Use a function that is supported by the source to process fields in the source table. For example, you can use the Max(id) function to implement synchronization of data in the row with the largest ID in the source table.
If only some fields in the source table are displayed when you configure field mappings, edit the fields in the source table.
Note
Functions are not supported if you configure a batch synchronization task that uses MaxCompute Reader.
Step 5: Configure channel control policies
You can configure channel control policies to define attributes for data synchronization. For information about the related parameters, see Channel control settings for batch synchronization.
Parameter | Description |
Task Expected Maximum Concurrency | The maximum number of parallel threads that the batch synchronization task uses to read data from the source or write data to the destination. Note The actual number of parallel threads that are used during data synchronization may be less than or equal to the specified threshold due to the specifications of the exclusive resource group for Data Integration. You are charged for the exclusive resource group for Data Integration based on the number of parallel threads that are used. For more information, see Performance metrics. DataWorks uses resource groups for scheduling to issue batch synchronization tasks in Data Integration to resource groups for Data Integration and run the tasks by using the resource groups for Data Integration. You are charged for using resource groups for scheduling to schedule batch synchronization tasks based on the number of tasks. For more information about the task issuing mechanism, see Mechanism for issuing tasks that are run on old-version resource groups.
|
Synchronization rate | Specifies whether to enable throttling. If you enable throttling, you can specify a maximum transmission rate to prevent heavy read workloads on the source. The minimum value of this parameter is 1 MB/s. If you do not enable throttling, data is transmitted at the maximum transmission rate allowed by the hardware based on the specified maximum number of parallel threads.
Note The bandwidth is a metric provided by Data Integration and does not represent the actual traffic of an elastic network interface (ENI). In most cases, the ENI traffic is one to two times the channel traffic. The actual ENI traffic depends on the serialization of the data storage system. |
Policy for Dirty Data Records | The maximum number of dirty data records allowed. Important If a large amount of dirty data is generated during data synchronization, the overall data synchronization speed is affected. If this parameter is not configured, dirty data records are allowed during data synchronization, and the batch synchronization task can continue to run if dirty data records are generated. If you set this parameter to 0, no dirty data records are allowed. If dirty data records are generated during data synchronization, the batch synchronization task fails. If you specify a value that is greater than 0 for this parameter, the following situations occur: If the number of dirty data records that are generated during data synchronization is less than or equal to the value that you specified, the dirty data records are ignored and are not written to the destination, and the batch synchronization task continues to run. If the number of dirty data records that are generated during data synchronization is greater than the value that you specified, the batch synchronization task fails.
Note Dirty data indicates data that is meaningless to business, does not match the specified data type, or leads to an exception during data synchronization. If an exception occurs when a single data record is written to the destination, the data record is considered as dirty data. Data records that fail to be written to a destination are considered as dirty data. For example, when a batch synchronization task attempts to write VARCHAR-type data in a source to an INT-type field in a destination, a data conversion error occurs, and the data fails to be written to the destination. In this case, the data is dirty data. When you configure a batch synchronization task, you can control whether dirty data is allowed. You can also specify the maximum number of dirty data records that are allowed during data synchronization. If the number of generated dirty data records exceeds the upper limit that you specified, the batch synchronization task fails and exits. |
Distributed Execution | Specifies whether to enable the distributed execution mode for the batch synchronization task. If you enable the distributed execution mode for a batch synchronization task, the system splits the task into slices and distributes them to multiple Elastic Compute Service (ECS) instances for parallel running. In this case, the more ECS instances, the higher the data synchronization speed. If you do not enable the distributed execution mode for a batch synchronization task, the specified maximum number of parallel threads is used only for a single ECS instance to run the task.
If you have a high requirement for data synchronization performance, you can use the distributed execution mode to run your batch synchronization task. If you run your batch synchronization task in distributed execution mode, fragment resources of ECS instances can be utilized. This improves resource utilization. Important If you use an exclusive resource group and the resource group contains only one ECS instance, we recommend that you do not run your batch synchronization task in distributed execution mode. If one ECS instance can meet your business requirements for data transmission speed, you do not need to enable the distributed execution mode. This can simplify the execution mode of your task. The distributed execution mode can be enabled only if the maximum number of parallel threads that you specified is greater than or equal to 8. If you enable the distributed execution mode for a batch synchronization task, more resources will be occupied. If an out of memory (OOM) error is reported during the running of the batch synchronization task, you can disable the distributed execution mode.
|
Note
In addition to the preceding configurations, the overall data synchronization speed of a batch synchronization task is also affected by factors such as the performance of the source and the network environment for data synchronization. For information about the data synchronization speed and performance tuning of a batch synchronization task, see Speed up or slow down the batch synchronization process.
Step 6: Configure scheduling properties for the batch synchronization task
If you want DataWorks to periodically schedule your batch synchronization task, you must configure scheduling properties for the task. This step describes how to configure scheduling properties for a batch synchronization task. You can go to the configuration tab of the batch synchronization task and click Properties in the right-side navigation pane to configure scheduling properties for the task. For information about how to use scheduling parameters, see Description for using scheduling parameters in data synchronization.
Configure scheduling parameters: If you use variables when you configure the batch synchronization task, you can assign scheduling parameters to the variables as values.
Configure time properties: The time properties define the mode in which the batch synchronization task is scheduled in the production environment. In the Schedule section of the Properties tab of the task, you can configure properties such as the instance generation mode, scheduling type, and scheduling cycle for the task.
Configure the resource property: The resource property defines the exclusive resource group for scheduling that is used to issue the batch synchronization task to the related exclusive resource group for Data Integration. You can select the exclusive resource group for scheduling that you want to use in the Resource Group section of the Properties tab.
Note
DataWorks uses resource groups for scheduling to issue batch synchronization tasks in Data Integration to resource groups for Data Integration and uses the resource groups for Data Integration to run the tasks. You are charged for using the resource groups for scheduling to schedule batch synchronization tasks. For information about the task issuing mechanism, see Mechanism for issuing tasks that are run on old-version resource groups.
Step 7: Commit and deploy the batch synchronization task
If you want to periodically run the batch synchronization task, you must deploy the task to the production environment. For more information about how to deploy a task, see Deploy tasks.
What to do next
After the batch synchronization task is deployed to the production environment, you can go to Operation Center in the production environment to view the task. For information about how to perform O&M operations for a batch synchronization task, such as running and managing the task, monitoring the status of the task, and performing O&M for the resource group that is used to run the task, see O&M for batch synchronization tasks.