This topic describes how to use Data Integration to synchronize full data and incremental data from a MySQL data source to a MaxCompute data source.
Prerequisites
A MaxCompute data source and a MySQL data source are added to DataWorks. You must add a MaxCompute data source and a MySQL data source to DataWorks. This way, you can select the data sources to synchronize data between the data sources when you configure a data synchronization task. In this example, the MySQL data source
doc_mysql1
is used. For more information about how to add a MySQL data source and a MaxCompute data source to DataWorks, see Add a MySQL data source and Add a MaxCompute data source.NoteFor information about the items that you need to understand before you add a data source, see Overview.
An exclusive resource group for Data Integration that meets your business requirements is purchased. For more information, see Create and use an exclusive resource group for Data Integration.
Network connections between the exclusive resource group for Data Integration and the data sources are established. For more information, see Network connectivity solutions.
The data source environments are prepared.
MaxCompute: The MaxCompute V2.0 data type edition supports the DECIMAL data type. Before you run a synchronization task that uses the MaxCompute data source, you must enable the MaxCompute V2.0 data type edition for the data source. For more information, see MaxCompute V2.0 data type edition.
MySQL: You must enable the binary logging feature for the MySQL data source. For more information, see Prepare a MySQL environment.
Sample scenario
Data: Synchronize full data from the MySQL data source to MaxCompute base tables at a time and incremental data from the MySQL data source to a MaxCompute log table in real time, and write the incremental data to the MaxCompute base tables on a regular basis to merge the full data and incremental data.
Table: Synchronize data from the tables whose names are prefixed with
doc_
in the MySQL data source to the tables whose names are prefixed withods_
in the MaxCompute data source.Field: Add the additional field
execute_time
to the destination tables to record the time when a data record in a source table is changed.
The following table describes the mappings between the source tables and destination tables.
Source database | Source table | Destination table | Additional field |
doc_demo1 | doc_tb1 | ods_tb1 | execute_time |
doc_tb2_nopk | ods_tb2_nopk |
Procedure
Create and configure a data synchronization task
Run the data synchronization task
Perform O&M operations on the synchronization task
Create and configure a data synchronization task
Select a synchronization task type.
Log on to the DataWorks console and go to the homepage of or Tasks page in Data Integration to create a data synchronization task. Select the source type MySQL and destination type MaxCompute and select One-click Real-time Synchronization to MaxCompute.
Establish network connections.
Select
doc_mysql1
as the source, use the MaxCompute data source as the destination, and establish network connections between the exclusive resource group for Data Integration and the source and destination. The MaxCompute data source is generated after a MaxCompute project is associated with a DataWorks workspace.Select source tables and configure a mapping rule.
In the Basic Configuration section, configure the parameters such as Solution Name and Location.
In the Data Source section, view information about the source.
In the Source Table section, select the tables from which you want to read data from the Source Table list. Then, click the icon to move the tables to the Selected Tables list.
In this example, the tables
doc_tb1
anddoc_tb2_nopk
in the databasedoc_demo1
are selected.In the Set Mapping Rules for Table/Database Names section, click Add Rule, select a rule type, and then configure a mapping rule of the selected type.
By default, data in a source table is written to the destination table that is named the same as the source table. You can configure a mapping rule to specify the name of a destination table. In this example, a mapping rule of the Conversion Rule for Table Name type is configured. The mapping rule defines that data in the source tables whose names are prefixed with
doc_
is written to the destination tables whose names are prefixed withods_
.
Configure the destination tables.
Configure the Write Mode parameter.
You can set this parameter only to Write log table directly in real time. This way, incremental data in the source tables is written to a MaxCompute log table. Then, the system writes the incremental data in the MaxCompute log table to the MaxCompute base tables on a regular basis to merge the full data and incremental data.
Configure the Automatic Partitioning by Time parameter
You can configure the Automatic Partitioning by Time parameter to write data to partitioned tables or non-partitioned tables. If you set this parameter to Partition Table, you can specify the partition key column. In this example, this parameter is set to Partition Table, and the partition key column is
ds
.NoteIf you set the Automatic Partitioning by Time parameter to Partition Table, you can click the icon to specify the partition key column for the destination tables.
Map source tables to destination tables.
Click Refresh source table and MaxCompute Table mapping. The system maps the source tables to destination tables based on the mapping rule that you configured in the Set Mapping Rules for Table/Database Names section. By default, if no mapping rule is configured, data in the source tables is synchronized to the destination tables that are named the same as the source tables. If no such destination tables exist, the system automatically creates the tables in the destination. You can also modify the table generation method of a destination MaxCompute table and add additional fields to a destination MaxCompute table.
NoteThe system maps the source tables to destination tables based on the mapping rule that you configured in the Set Mapping Rules for Table/Database Names section.
Operation
Description
Synchronize data from a source table that does not have a primary key
The current synchronization task cannot be used to synchronize data from a source table that does not have a primary key. If you want to synchronize data from a source table that does not have a primary key, you must click the icon in the Synchronized Primary Key column of the source table to specify a primary key for the source table. You can use a field or a combination of multiple fields in the source table as the primary key. The system removes duplicate data based on the primary key during data synchronization.
In this example, the source table
doc_tb2_nopk
does not have a primary key, and the fieldid
is used as the primary key for the source table.Select a table generation method
You can select Create Table and Use Existing Table from the drop-down list in the Table creation method column.
If you select Use Existing Table, all existing MaxCompute tables are displayed in the drop-down list in the MaxComputeBase Table name column. You can select a table name that you need to use from the drop-down list.
If you select Create Table, the name of the destination table that is automatically created appears in the MaxComputeBase Table name column. You can click the table name to view and modify the table creation statement.
In this example, Create Table is selected.
Specify whether to synchronize full data
You can determine whether to turn on the switch in the Full Synchronization column to synchronize full data of the source tables to the destination.
If you turn off the switch in the Full Synchronization column of a source table, the data synchronization solution does not synchronize full data in the source table before real-time incremental synchronization is performed. If full data in a source table is already synchronized to a destination table at a time, you can turn off the switch for the source table.
In this example, full synchronization is performed for all source tables before real-time incremental synchronization is performed.
Add additional fields to a destination table and assign values to the fields
You can click Edit additional fields in the Actions column of a destination table to add additional fields to the table and assign values to the fields. You can assign constants and variables to additional fields as values.
NoteYou can add additional fields to a destination table only if you select Create Table from the drop-down list in the Table creation method column of the table.
In this example, the additional field
execute_time
is added to the destination tables, and the variable_execute_time_
is assigned to the additional field as the value. The field execute_time is used to record the time when a data record in a source table is changed.Check the information about the destination tables, fields in the destination tables, and mappings between the source tables and destination tables. If the configurations are correct, click Next.
Configure rules for processing DML messages.
You can configure a processing rule or configure a processing rule multiple destination tables at a time. You can configure processing rules for the following types of operations: INSERT, UPDATE, and DELETE. The following types of processing rules are supported:
Normal: If you configure this type of processing rule, a DML message from the source is sent to the destination, and the system performs the same operation on the destination as the operation that is performed on the source.
Ignore: If you configure this type of processing rule, a DML message from the source is discarded and not sent to the destination. The system does not perform any operation on the destination.
Conditional Normal Processing: If you configure this type of processing rule, you can specify a filter condition. The system first filters data based on the filter expression. Data that meets the filter expression is normally processed. Data that does not meet the filter expression is ignored.
If no processing rule is configured in this step, the Normal processing rule is used by default.
NoteIf you change the configured processing rule from Normal to Ignore or Conditional Normal Processing, data synchronized to the destination is different from data in the source.
Configure rules for processing DDL messages.
DDL operations are performed on a source. Data Integration provides default rules to process DDL messages. You can also configure processing rules for different DDL messages based on your business requirements. For more information, see Rules for processing DDL messages.
In this example, the default processing rules are used.
Configure resources required to run the data synchronization solution.
After you create a synchronization solution, the synchronization solution generates batch synchronization nodes for full data synchronization and real-time synchronization nodes for incremental data synchronization. You must configure the parameters in the Configure Resources step.
You can configure the exclusive resource groups for Data Integration that you want to use to run real-time synchronization nodes and batch synchronization nodes, and the resource groups for scheduling that you want to use to run batch synchronization nodes. You can also click Advanced Configuration to configure the Number of concurrent writes on the target side and Allow Dirty Data Records parameters.
NoteDataWorks uses resource groups for scheduling to issue the generated batch synchronization subtasks to resource groups for Data Integration and runs the subtasks on the resource groups for Data Integration. Therefore, a batch synchronization subtask also occupies the resources of a resource group for scheduling. You are charged fees for using the exclusive resource group for scheduling to schedule the batch synchronization subtasks. For information about the task issuing mechanism, see Mechanism for issuing nodes.
We recommend that you use different resource groups to run the generated batch and real-time synchronization subtasks. If you use the same resource group to run the subtasks, the subtasks compete for resources and affect each other. For example, CPU resources, memory resources, and networks used by the two types of subtasks may affect each other. In this case, the batch synchronization subtasks may slow down, or the real-time synchronization subtasks may be delayed. The batch or real-time synchronization subtasks may even be terminated by the out of memory (OOM) killer due to insufficient resources.
Run the data synchronization task
Commit and deploy the data synchronization task.
On the DataStudio page, click the icon in the upper-left corner and choose
to go to the Operation Center page.In the left-side navigation pane of the Operation Center page, choose
. On the Real-time Synchronization Tasks page, find the deployed real-time synchronization task and click Submit in the Actions column.
You need to take note of the table names and node names that are described in the following section to facilitate subsequent O&M operations. For example, you need to take note of the name of the MaxCompute log table, the names of the MaxCompute base tables, the name of the real-time synchronization node that is generated by the data synchronization solution, and the name of the merge node that is used to merge full data and incremental data.
Perform O&M operations on the synchronization task
The merge of full data and incremental data that are synchronized to MaxCompute consists of the following stages:
On the day on which the data synchronization solution is configured, the system runs the batch synchronization nodes that are generated by the solution to synchronize full data from the source tables to the MaxCompute base tables.
On the day on which the data synchronization solution is configured, the system starts to run the real-time synchronization node that is generated by the solution to synchronize incremental data from the source tables to the MaxCompute log table in real time after the full synchronization is complete.
On the next day, the system runs the merge node to write the incremental data to the MaxCompute base tables to merge the full data and incremental data. One day is required to merge the full data and incremental data.
To prevent errors from occurring during the stages and ensure that data can be normally generated, you must configure monitoring and alerting settings for the real-time synchronization node and the merge node.
Configure monitoring and alerting settings for the real-time synchronization node generated by the data synchronization task.
Go to the Real Time DI page in Operation Center. For more information, see Go to the Real Time DI page.
Find the real-time synchronization node that is generated by the data synchronization solution, click Alarm settings in the Operation column.
On the page that appears, click the Alarm Rules tab. On the Alarm Rules tab, click New rule.
Set the Indicators parameter to Status and Alarm interval parameter to 5. This way, the system sends an alert notification for the node if the heartbeat of the node is lost for 5 minutes. Configure the other parameters based on your business requirements and click Confirm. For more information, see Perform operations on a real-time synchronization node.
Configure monitoring and alerting settings for the merge node.
To prevent errors from occurring during the merge of the full data and incremental data and ensure that data in the MaxCompute base tables can be normally generated, we recommend that you configure a custom alert rule for the merge node. For more information about how to configure a custom alert rule, see Create a custom alert rule.
Obtain the name of the merge node on the execution details page of the data synchronization solution. For more information, see View the running details of a synchronization task.
Go to the Rule Management page in Operation Center. On the Rule Management page, click Create Custom Rule to create a custom alert rule. For more information, see Create a custom alert rule.
In the Create Custom Rule dialog box, set the Object Type parameter to Node and the Object parameter to the name of the merge node, and set the Trigger Condition parameter to Error. Configure the other parameters based on your business requirements and click OK.
Appendix: Query the data that is synchronized to MaxCompute
One day is required to merge the full data and incremental data that are synchronized to MaxCompute. On the day on which the data synchronization solution is configured, you can query only the full data in the MaxCompute base tables. After the full data and incremental data are merged on the next day, you can query the merged data in the MaxCompute base tables.
Query the full data that is synchronized to MaxCompute
Create an ODPS SQL node on the DataStudio page and run a command to query the full data that is synchronized to MaxCompute. Before you run the command, you must check the status of the stage named "Start batch synchronization nodes to synchronization full data" on the execution details page of the data synchronization solution in Data Integration and make sure that the stage is complete. For more information about how to create an ODPS SQL node, see Create an ad hoc query.
Query the incremental data that is synchronized to the MaxCompute log table in real time
View the name of the MaxCompute log table that is created in the stage named "Create a MaxCompute log table" on the execution details page of the data synchronization solution in Data Integration.
Find the created ODPS SQL node on the DataStudio page and run a command to query the incremental data that is synchronized to MaxCompute. For more information about how to create an ODPS SQL node, see Create an ad hoc query.
NoteFor information about the format of data in a MaxCompute log table that is used to store incremental data, see Fields used for real-time synchronization.