Create a real-time synchronization task in DataStudio - DataWorks

After you prepare data sources, network environments, and resources, you can create a real-time synchronization task to synchronize incremental data from a single table or a database in the source to the destination. This topic describes how to create a real-time synchronization task to synchronize incremental data from a single table or a database in a source to a destination in real time and how to view the running information of the task.

Prerequisites

The data sources that you want to use are prepared. Before you configure a data synchronization task, you must prepare the data sources from which you want to read data and to which you want to write data. This way, when you configure a data synchronization task, you can select the data sources. For information about the data source types that support real-time synchronization and the configuration of a data source, see Data source types that support real-time synchronization.
Note
For information about the items that you need to understand before you add a data source, see Overview.
An exclusive resource group for Data Integration that meets your business requirements is purchased. For more information, see Create and use an exclusive resource group for Data Integration.
Network connections are established between the exclusive resource group for Data Integration and the data sources. For more information, see Network connectivity solutions.

Go to the DataStudio page

You must go to the DataStudio page in the DataWorks console to create and configure a real-time synchronization task.

Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. Then, choose Data Modeling and Development > DataStudio in the left-side navigation pane. On the page that appears, select the desired workspace from the drop-down list and click Go to DataStudio.

Procedure

Step 1: Create a real-time synchronization task
Step 2: Configure a resource group
Step 3: Configure the real-time synchronization task
Step 4: Commit and deploy the real-time synchronization task

Step 1: Create a real-time synchronization task

Create a workflow. For more information, see Create a workflow.

Create a real-time data synchronization task.

You can use one of the following methods to create a real-time synchronization task.
- Method 1: In the Scheduled Workflow pane of the DataStudio page, find the desired workflow in the Business Flow section and click the name of the workflow. Then, right-click Data Integration and choose Create Node > Real-time Synchronization.
- Method 2: In the Scheduled Workflow pane of the DataStudio page, find the desired workflow in the Business Flow section and double-click the name of the workflow. In the Data Integration section of the workflow configuration tab that appears, drag Real-time synchronization to the canvas on the right.

In the Create Node dialog box, configure the parameters that are described in the following table.

Parameter	Description
Node Type	The type of the task. Default value: Real-time Synchronization.
Sync Method	If you want to create a real-time synchronization task that is used to synchronize incremental data from a single table, set this parameter to End-to-end ETL. This method allows you to synchronize incremental data from one or more tables to a single destination table. Note If you use this synchronization method, data can be written to only one destination table. If you want to write data to multiple destination tables, you can use one of the following solutions: If you want to filter data, replace strings, or mask data during data synchronization, you can create multiple real-time synchronization tasks and use each of the tasks to synchronize incremental data to a single table in real time. If you want to synchronize incremental data from multiple source tables to multiple destination tables, you can create multiple real-time synchronization tasks. For specific types of data sources, you can also create a real-time synchronization task to synchronize all incremental data from a database. If you want to synchronize full data at a time and then synchronize incremental data in real time to a destination, you can create a synchronization task in Data Integration. For more information, see Configure a synchronization task in Data Integration. If you want to create a real-time synchronization task that is used to synchronize all incremental data from a database, select a data synchronization method used to synchronize database changes, such as Migration to MaxCompute.
Path	The directory in which you want to store the real-time synchronization task.
Name	The task name cannot exceed 128 characters in length and can contain only letters, digits, underscores (_), and periods (.).

Step 2: Configure a resource group

You can use only exclusive resource groups for Data Integration to run real-time data synchronization tasks. You can perform the following operations to configure a resource group for a real-time synchronization task: Double-click the name of the created task. In the right-side navigation pane of the configuration tab of the task, click the Basic Configuration tab. On the Basic Configuration tab, select the exclusive resource group for Data Integration that is connected to the data source from the Resource Group drop-down list.

Note

We recommend that you run a real-time synchronization task and a batch synchronization task on different resource groups. If you run the tasks on the same resource group, the two tasks compete for resources and affect each other. For example, CPU resources, memory resources, and networks used by the tasks may affect each other. In this case, the batch synchronization task may slow down, or the real-time synchronization task may be delayed. Even worse, out of memory (OOM) errors may occur due to insufficient resources.

Step 3: Configure the real-time synchronization task

Configure a real-time synchronization task to synchronize incremental data from a single table

Configure the source.
1. In the Input section of the configuration tab of the real-time synchronization task, drag the desired source type to the canvas on the right.
2. Click the source and configure the parameters in the panel that appears.
  For information about the source types that are supported for a real-time synchronization task used to synchronize incremental data from a single table and how to configure the related sources, see the following topics:
Optional: Configure a data conversion component.
If you want to convert source data into a desired format during data synchronization, you can configure a data conversion component.
1. In the Conversion section of the configuration tab of the real-time data synchronization task, drag the desired data conversion component to the canvas on the right.
  The following data conversion components are supported for a real-time synchronization task that is used to synchronize incremental data from a single table:
  - Data filtering: You can use the data filtering component to filter data in a source based on specific rules, such as the field size. Only data that meets the rules is retained.
  - String replacement: You can use the string replacement component to replace field values of the STRING data type.
  - Data masking: You can use the data masking component to mask sensitive data in a single source table specified in a real-time synchronization task and enable the task to store the masked data in a specified database.
2. Click the component and configure the parameters in the panel that appears.
Configure the destination.
1. In the Output section of the configuration tab of the real-time synchronization task, drag the desired destination type to the canvas on the right.
2. Click the destination and configure the parameters in the panel that appears.
  For information about the destination types that are supported for a real-time synchronization task used to synchronize incremental data from a single table and how to configure the related destinations, see the following topics:
Connect the source to the destination.
After the source and destination are configured, you can connect them by drawing lines. This way, data can be synchronized between the data sources based on the configurations.

Configure a real-time synchronization task to synchronize incremental data from a database

Select the tables from which you want to read data and configure mapping rules.
1. In the Source section of the Configure Source and Synchronization Rule step, configure the Type and Data source parameters.
2. Select the tables from which you want to read data.
  In the Select Source Table for Synchronization section, all tables in the selected data source are displayed in the Source Table list. You can select all or some tables from the Source Table list and click the icon to move the tables to the Selected Tables list.
  Important
  If a selected table does not have a primary key, data in the table cannot be synchronized in real time.
3. Configure mapping rules for the names of the source tables and the names of the destination tables.
  After you select the source database and table from which you want to synchronize incremental data, the real-time synchronization task automatically writes the data in the database and table to the destination schema and table that are named the same as the source database and table. If no such destination schema or table exists, the system automatically creates the schema or table in the destination. You can configure a mapping rule in the Set Mapping Rules for Table/Database Names section to specify the name of the destination schema or table to which you want to write data. You can specify a destination table name in a mapping rule to write data in multiple source tables to the same table. You can also specify prefixes in a mapping rule to write data to a database whose name starts with a different prefix from the source database or to tables whose names start with a different prefix from the source tables.
  - Conversion Rule for Table Name: This type of mapping rule allows you to use a regular expression to map the names of the destination tables to which you want to write data to the names of source tables.
    Example 1: Synchronize data from the source tables whose names start with the prefix doc_ to the destination tables whose names start with the prefix pre_.
    Example 2: Synchronize data from multiple source tables to the same destination table.
    To synchronize incremental data from table_01, table_02, and table_03 to my_table, you can configure a mapping rule of the Conversion Rule for Table Name type, and set Source to table.* and Destination to my_table.
  - Rule for Destination Table Name: This type of mapping rule allows you to use a built-in variable to specify the names of the destination tables to which you want to write data and add a prefix and a suffix to the names of the destination tables. The following built-in variables are supported:
    ${db_table_name_src_transed}: the name of the destination table that is mapped based on a mapping rule of the Conversion Rule for Table Name type
    ${db_name_src_transed}: the name of the destination schema that is mapped based on a mapping rule of the Rule for Conversion Between Source Database Name and Destination Schema Name type
    ${ds_name_src}: the name of the source
    For example, you can configure pre_${db_table_name_src_transed}_post to convert the table name my_table that is generated in the previous example to pre_my_table_post.
  - Rule for Conversion Between Source Database Name and Destination Schema Name: This type of mapping rule allows you to use a regular expression to specify the names of the destination schemas to which you want to write data.
    Example: Synchronize data from the source schemas whose names start with the prefix doc_ to the destination schemas whose names start with the prefix pre_.
Select a destination and configure destination tables or destination topics.
1. In the Set Destination Table step, configure the basic information of the destination. For example, you can configure the Write Mode and Automatic Partitioning by Time parameters. The required configurations vary based on the data source type. The parameters that are displayed in the DataWorks console prevail.
2. Click Refresh Source table and destination table mapping to map the source tables to destination tables.
  You can specify custom names for destination schemas and destination tables. You can also click Edit additional fields in the Actions column to add additional fields to a destination table and assign constants or variables to the additional fields as values. The required configurations vary based on the data source type. The parameters that are displayed in the DataWorks console prevail.
  Note
  The mapping may require a long period of time if data is synchronized from a large number of tables.
Optional: Configure DML processing rules at the table level.
DataWorks allows you to configure table-level DML processing rules for some synchronization tasks. You can configure processing rules for the messages that are generated for INSERT, UPDATE, and DELETE operations performed on a source table.
Note
Support for synchronizing data changes generated by DML operations varies based on the destination type. You can check whether a synchronization task supports the configuration of DML processing rules when you configure the synchronization task in the DataWorks console. For more information, see Supported DML and DDL operations.

Configure DDL processing rules.

DDL operations may be performed on a source table. When you configure a real-time synchronization task, you can configure processing rules for messages that are generated for different DDL operations based on your business requirements. Support for synchronizing data changes generated by DDL operations varies based on the destination type. For more information, see Supported DML and DDL operations.

Note

You can also configure processing rules for a specific destination type. To configure processing rules for a specific destination type, perform the following steps: In the left-side navigation pane of the Data Integration page, choose Configuration Options > Processing Policy for DDL Messages in Real-time Sync. On the Processing Policy for DDL Messages in Real-time Sync page, configure DDL processing rules. The following table describes processing rules for different types of DDL messages.

DDL message type	Processing rule
CreateTable	DataWorks processes a DDL message of the related type based on the following rules: Normal treatment: DataWorks sends the DDL message to the destination. Then, the destination processes the DDL message. Different destinations respond to DDL messages in different ways. Therefore, DataWorks only sends the message to the destination. Ignore: DataWorks discards the DDL message and does not deliver the message to the destination. Alert: DataWorks discards the DDL message and generates an alert in real-time synchronization logs. The alert indicates that the message is discarded due to an execution error. Error: DataWorks terminates the real-time synchronization task and sets the task status to Failed.
DropTable
AddColumn
DropColumn
RenameTable
RenameColumn
ChangeColumn
TruncateTable

Configure the resources that are required to run the real-time synchronization task.
- You can specify the maximum number of parallel threads that can be used to read data from the source and write data to the destination.
- You can specify whether dirty data is allowed during data synchronization.
  - Not allowed: If dirty data records are generated during data synchronization, the real-time synchronization task fails.
  - Allowed: If dirty data records are generated during data synchronization, the dirty data records are ignored rather than being written to the destination and the real-time synchronization task continues to run.
Click Complete.

Step 4: Commit and deploy the real-time synchronization task

Click the icon in the top toolbar to save the task.
Click the icon in the top toolbar to commit the task.
In the Submit dialog box, enter a description in the Change description field.
Click Confirm.
If you use a workspace in standard mode, you must deploy the task to the production environment after you commit the task. In the top navigation bar, click Deploy. For more information, see Deploy tasks.

What to do next

After the configuration of the real-time synchronization task is complete, you can start and manage the task on the Real-time Synchronization Tasks page in Operation Center. To go to the Real-time Synchronization Tasks page, perform the following operations: Log on to the DataWorks console and go to the Operation Center page. In the left-side navigation pane of the Operation Center page, choose Real-time Node O&M > Real-time Synchronization Tasks. For more information, see O&M for real-time synchronization tasks.

Appendix: Migrate tasks

After you configure a real-time synchronization task on the DataStudio page to synchronize data from a single table, you can click Migrate to Data Integration on the configuration tab of the task to migrate the task to the Data Integration page.

Note

Only the following types of real-time synchronization tasks can be migrated:

A real-time synchronization task used to synchronize data from a single Kafka topic to MaxCompute
A real-time synchronization task used to synchronize data from a single Kafka topic to Hologres

Find the desired real-time synchronization task used to synchronize data from a single table in the Scheduled Workflow pane of the DataStudio page, double-click the task name to go to the configuration table of the task, and then click Migrate to Data Integration on the configuration tab to migrate the task.
In the upper-left corner of the DataStudio page, click the icon and choose All Products > Data Integration. The Synchronization Task page appears. On the Synchronization Task page, view the real-time synchronization task that is migrated from DataStudio.

Note

After you migrate a real-time synchronization task to Data Integration, you can directly perform O&M operations on the task in Data Integration without the need to go to Operation Center. The real-time synchronization task is invisible in Operation Center. The migration operation does not affect the task configurations that are saved and the tasks that are running.
After you migrate a real-time synchronization task from DataStudio, the original task is moved to the recycle bin in DataStudio. You can modify the task and perform O&M operations on the task only on the Synchronization Task page in Data Integration.