Create a real-time synchronization solution to synchronize data to MaxCompute - DataWorks

You can create a real-time synchronization solution and use the solution to synchronize full and incremental data to MaxCompute. This topic describes how to create a real-time synchronization solution to synchronize data to MaxCompute.

Prerequisites

The required data sources are configured. Before you configure a data synchronization solution, you must configure the data sources from which you want to read data and to which you want to write data. This way, you can select the data sources when you configure a data synchronization solution. For information about the data source types that support the solution-based synchronization feature and the configuration of a data source, see Supported data source types and synchronization operations.
Note For information about the items that you need to understand before you configure a data source, see Overview.
An exclusive resource group for Data Integration that meets your business requirements is purchased. For more information, see Create and use an exclusive resource group for Data Integration.
Network connections between the exclusive resource group for Data Integration and data sources are established. For more information, see Network connectivity solutions.
The data source environments are prepared. Before you configure a data synchronization solution, you must create an account that can be used to access a database and grant the account the permissions required to perform specific operations on the database based on your configurations for data synchronization. For more information, see Overview.

Background information

After you run a real-time synchronization solution, the solution generates a merge node to merge full and incremental data. Full data is written to MaxCompute base tables at a time, and incremental data is written to a MaxCompute log table in real time. The system runs the merge node to merge the full and incremental data on a regular basis and writes all data to MaxCompute base tables. Full data and incremental data are merged once a day.

Item	Description
Number of tables from which you can read data	You can read data from multiple source tables and write the data to multiple destination tables. You can configure mapping rules for the source and destination. This way, you can read data from multiple tables and write the data to the same destination table.
Nodes	A real-time synchronization solution generates batch synchronization nodes to synchronize full data and real-time synchronization nodes to synchronize incremental data. The number of batch synchronization nodes that are generated by the solution varies based on the number of tables from which you can read data.
Data write	You can synchronize both full and incremental data or only incremental data to MaxCompute. The process of synchronizing full and incremental data consists of the following stages: On the day on which you configure the solution, batch synchronization nodes are run to synchronize full data to MaxCompute base tables. On the day on which you configure the solution, real-time synchronization nodes are run to synchronize incremental data to a log table of MaxCompute after full data is synchronized. On the next day, full data in the base tables and incremental data in the log table are merged, and the merged data is written to the base tables. Note After full data is synchronized, the batch synchronization nodes are frozen. A real-time synchronization solution of MaxCompute allows you to synchronize data to a partitioned or non-partitioned table. The following figure shows how to write full data and incremental data to a partitioned table:

Precautions

If you want to synchronize data to the default data source of MaxCompute, which is odps_first in most cases, a temporary AccessKey pair is used during data synchronization. The temporary AccessKey pair expires after seven days. After the AccessKey pair expires, synchronization nodes fail to run. In this case, the system automatically restarts the synchronization nodes. If you configure the related alert rules for the nodes, you can receive alert notifications.
On the day on which you configure a real-time synchronization solution, you can query only the historical full data. You can query the incremental data only after full data and incremental data are merged on the next day. For more information, see the description of the data write item in Context .
This solution generates a partition of a full MaxCompute table every day. To prevent data from occupying excessive storage resources, the default lifecycle of a full MaxCompute table is 30 days. If the lifecycle does not meet your business requirements, you can click the name of a MaxCompute table to modify the lifecycle of the table. For more information, see Step 4: Configure a destination table.
Data Integration uses the channels that are provided by MaxCompute to upload and download data. You can select a channel based on your business requirements. For more information about the types of channels that are provided by MaxCompute, see Data upload scenarios and tools.
If you want to run a real-time synchronization solution to synchronize data to MaxCompute in whole-instance mode, the specifications of the exclusive resource group for Data Integration that you use to run the solution must be at least 8 vCPUs and 16 GiB of memory.

Limits

You can use only a self-managed MaxCompute data source that resides in the same region as your workspace. If you use a self-managed MaxCompute data source that resides in a different region from your workspace, the data source can be connected to the resource group that you use. However, an error indicating that the compute engine instance does not exist is reported when the system creates a MaxCompute table during the running of the synchronization solution.

Note

If you use a self-managed MaxCompute computing resource, you must associate a MaxCompute compute engine instance with your DataWorks workspace. Otherwise, an ODPS SQL node cannot be created. As a result, a node that is used to mark the end of full synchronization cannot be created.

Billing

The real-time synchronization that is used to synchronize data to MaxCompute requires periodic merging of full and incremental data. Therefore, MaxCompute computing resources are consumed. The fees for MaxCompute computing resources are charged by MaxCompute and are positively correlated to the size of the full data and the merging cycle. For more information, see Billable items and billing methods.

Procedure

Step 1: Select a synchronization solution
Step 2: Configure network connections for data synchronization
Step 3: Configure the source and synchronization rules
Step 4: Configure a destination table
Step 5: Configure rules to process DDL messages
Step 6: Configure the resources required by the synchronization solution
Step 7: Run the synchronization solution

Step 1: Select a synchronization solution

Go to the Data Integration page in the DataWorks console and click Create Data Synchronization Solution. On the Create Data Synchronization Solution page, select a source and a destination for data synchronization from the drop-down lists. Then, select One-click real-time synchronization to MaxCompute. For more information, see Create a synchronization solution.

Step 2: Configure network connections for data synchronization

Select a source, a destination, and a resource group that is used to run nodes. Test the network connectivity to make sure that the resource group is connected to the source and destination. For more information, see Configure network connections for data synchronization.

Step 3: Configure the source and synchronization rules

In the Basic Configuration section, configure the parameters, such as the Solution Name and Location parameters, based on your business requirements.
In the Data Source section, confirm the information about the source.
In the Source Table section, select the tables from which you want to read data from the Source Table list. Then, click the icon to add the tables to the Selected Source Table list.
The Selected Source Table list displays all tables in the source. You can select all or specific tables.
In the Mapping Rules for Table Names section, click Add Rule, select a rule type, and then configure a mapping rule of the selected type.
By default, data in the source table is written to the destination table that has the same name as the source table. If no destination table that has the same name as the source table exists, the system automatically creates such a destination table. You can specify a destination table name in a mapping rule to write data in multiple source tables to the same MaxCompute table. The following rule types are supported: Conversion Rule for Table Name and Rule for Destination Table name.
- Conversion Rule for Table Name: This type of mapping rule allows you to use a regular expression to map the names of the destination tables to which you want to write data to the names of source tables.
  - Example 1: Synchronize data from the source tables whose names start with the prefix doc_ to the destination tables whose names start with the prefix pre_.
  - Example 2: Synchronize data from multiple source tables to the same destination table.
    To synchronize data from table_01, table_02, and table_03 to my_table, you can configure a mapping rule of the Conversion Rule for Table Name type, and set Source to table.* and Target to my_table.
- Rule for Destination Table name: This type of mapping rule allows you to use a built-in variable to specify the names of the destination tables to which you want to write data and add a prefix and a suffix to the names of the destination tables. The following built-in variables are supported:
  - ${db_table_name_src_transed}: the name of the destination table that is mapped based on a mapping rule of the Conversion Rule for Table Name type
  - ${db_name_src_transed}: the name of the destination schema that is mapped based on a mapping rule of the Rule for Conversion Between Source Database Name and Destination Schema Name type
  - ${ds_name_src}: the name of the source
  For example, you can configure pre_${db_table_name_src_transed}_post to convert the table name my_table that is generated in the previous example to pre_my_table_post.

Step 4: Configure a destination table

Configure the Write Mode parameter.
After full data is written to MaxCompute base tables at a time, incremental data is written to a MaxCompute log table in real time. The system runs the merge node generated by the solution to merge the synchronized full data and incremental data on a regular basis. Then, the system writes the merged data to the MaxCompute base tables.
Configure the Automatic Partitioning by Time parameter.
You can write data to a partitioned table or a non-partitioned table by configuring the Automatic Partitioning by Time parameter.
Note
If you set the Automatic Partitioning by Time parameter to Partitioned Table, you can click the icon to modify the partition settings for the destination table.

Configure mappings between the source table and the destination MaxCompute table.

Click Refresh source table and MaxCompute Table mapping to create a destination table based on the rules you configured in the Mapping Rules for Table Names section in Step 3. If no mapping rule is configured in Step 3, data in the source table is written to the destination table that has the same name as the source table. If no destination table that has the same name as the source table exists, the system automatically creates such a destination table. You can also change the method of creating a destination table and add additional fields to the destination table.

Note

The name of the destination table is generated based on the mapping rules that you configured in the Mapping Rules for Table Names section.

Operation	Description
Select a primary key for a source table	Source tables that do not have primary keys cannot be synchronized. If a source table does not have a primary key, you can click the icon in the Synchronized Primary Key column to specify one or more fields in the table as the primary key of the table.
Select the method of creating a destination table	You can set the Table creation method parameter to Create Table or Use Existing Table. Use Existing Table: If you select this method, you must select the desired destination table from the drop-down list in the MaxComputeBase Table name column. Create Table: If you select this method, the name of the destination table that is automatically created appears in the MaxComputeBase Table name column. You can click the table name to view and modify the table creation statement.
Configure the Full Synchronization parameter	You can specify whether to turn on the switch in the Full Synchronization column to synchronize full data of the source tables to the destination. If you disable full data synchronization for specific tables, you cannot perform full batch synchronization for data in the tables. You can disable full data synchronization for tables whose data is synchronized by using other batch synchronization methods.
Edit additional fields	You can click Edit additional fields in the Actions column to add additional fields to a destination table and assign values to the fields. The values can be constants or variables. Note You can add additional fields only if you select Create Table from the drop-down list in the Table creation method column. The following additional variable fields are supported by Data Integration: `EXECUTE_TIME: the execution time UPDATE_TIME: the update time DB_NAME_SRC: the name of the original database DB_NAME_SRC_TRANSED: the converted name of the database DATASOURCE_NAME_SRC: the name of the source database DATASOURCE_NAME_DEST: the name of the destination data source DB_NAME_DEST: the name of the destination database TABLE_NAME_DEST: the name of the destination table TABLE_NAME_SRC: the name of the source table`
Edit destination tables	By default, the lifecycle of MaxCompute tables that are created by using the Create Table method is 30 days, and field type conversion may occur. For example, if the data types of the fields in a destination table are different from the data types of the fields in a source table, the synchronization solution converts the fields in the source table to the data types that can be written to the destination table. You can click the name of a destination table in the MaxComputeBaseTable name column to modify the lifecycle or field type of the table. Note You can add additional fields only if you select Create Table from the drop-down list in the Table creation method column.

After you complete and confirm the configurations in the Set Destination Table step, click Next.

Step 5: Configure rules to process DDL messages

DDL operations are performed on a source. Data Integration provides default rules to process DDL messages. You can also configure processing rules for different DDL messages based on your business requirements. For more information, see Rules for processing DDL messages.

Step 6: Configure the resources required by the synchronization solution

After you create a synchronization solution, the synchronization solution generates batch synchronization nodes for full data synchronization and real-time synchronization nodes for incremental data synchronization. You must configure the parameters in the Configure Resources step.

You can configure the exclusive resource groups for Data Integration that you want to use to run real-time synchronization nodes and batch synchronization nodes, and the resource groups for scheduling that you want to use to run batch synchronization nodes. You can also click Advanced Configuration to configure the Number of concurrent writes on the target side and Allow Dirty Data Records parameters.

Note

DataWorks uses resource groups for scheduling to issue batch synchronization nodes to resource groups for Data Integration and runs the nodes on the resource groups for Data Integration. Therefore, a batch synchronization node also occupies the resources of a resource group for scheduling. You are charged fees for using the resource group for scheduling to schedule the batch synchronization nodes. For information about the node issuing mechanism, see Overview.
We recommend that you use different resource groups to run batch and real-time synchronization nodes. If you use the same resource group to run batch and real-time synchronization nodes, the nodes compete for resources and affect each other. For example, CPU resources, memory resources, and networks used by the two types of nodes may affect each other. In this case, the batch synchronization nodes may slow down, or the real-time synchronization node may be delayed. Out of memory (OOM) errors may also occur due to insufficient resources.

Step 7: Run the synchronization solution

Go to the Tasks page in Data Integration and find the created data synchronization solution.
Click Submit and Run in the Actions column to run the data synchronization solution.
Click Execution details in the Actions column to view the execution details of the data synchronization solution.

What to do next

After a data synchronization solution is configured, you can manage the solution. For example, you can add tables to or remove tables from the solution, configure alerting and monitoring settings for the nodes that are generated by the solution, and view information about the running of the nodes. For more information, see Perform O&M on a full and incremental synchronization task.

Appendix: What do I do if data fails to be written to the base tables?

Data merging process	Symptom	Cause	Solution
	If a data synchronization solution is configured on the T day, incremental data fails to be written to the T-1 partition in a log table of MaxCompute.	An exception occurs when real-time synchronization nodes are run.	Find the reasons that cause the failure of real-time synchronization nodes. Then, rerun the synchronization nodes from the checkpoint at which the nodes fail to be run. Note The retention period of ApsaraDB RDS binary logs is limited. If binary logs of specific points in time are cleared, an error indicating that the checkpoint cannot be found is reported when real-time synchronization nodes are run. Configure alerting and monitoring rules for real-time synchronization nodes.
	If the data synchronization solution is configured on the T day, full data and incremental data fails to be written to the T-2 partitions in the MaxCompute base tables.	Full data and incremental data fails to be merged on the previous day. The T-2 partitions of the base tables are deleted.	Configure alerting and monitoring rules for merge nodes. For more information, see Create a custom alert rule. If you accidentally delete the T-2 partitions of the base tables, you must backfill data in the T-2 partitions. Then, you can run the related nodes to merge the full and incremental data. For more information, see Backfill data and view data backfill instances (new version).