Real-time synchronization of all data in a MySQL database to StarRocks - DataWorks

DataWorks Data Integration allows you to run a real-time synchronization task to synchronize all data in a database to a destination. During the synchronization process, full data in the database is synchronized at a time and incremental data in the database is synchronized in real time to the destination. This topic describes how to create a real-time synchronization task to synchronize all data in a MySQL database to StarRocks.

Prerequisites

The required data sources are configured. Before you configure a synchronization task, you must configure the data sources from which you want to read data and to which you want to write data. This way, you can select the data sources when you configure a synchronization task. For information about the data source types that support real-time synchronization and the configuration of a data source, see Data source types that support real-time synchronization.
Note
   For information about the items that you need to understand before you configure a data source, see Overview.
 A new-version resource group is purchased. For more information, see Use serverless resource groups.
Network connections are established between the resource group and the data sources. For more information, see Establish a network connection between a resource group and a data source.

Limits

Real-time synchronization tasks support only new-version resource groups.

Precautions

Real-time synchronization of all data in a MySQL database to StarRocks requires that your destination StarRocks tables use the primary key model.
You can synchronize data changes generated only by TRUNCATE operations when you run a real-time synchronization task to synchronize all data in a MySQL database to StarRocks. For other types of DDL operations, you can select Ignore or Critical as the processing rule for messages generated for the operations when you configure such a synchronization task.

Procedure

Step 1: Select a synchronization type

Go to the Data Integration page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and Governance > Data Integration. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Integration.
In the left-side navigation pane of the Data Integration page, click Synchronization Task.
- In the upper part of the page that appears, select MySQL from the Source drop-down list and StarRocks from the Destination drop-down list, and click Create.
- New Node Name: Specify a name for the synchronization task based on your business requirements.
- Synchronization Method: Select StarRocks Migration from the drop-down list.
- Synchronization Mode: Select Full initialization or Incremental synchronization, or select Full initialization and Incremental synchronization at the same time based on your business requirements.

Step 2: Establish network connections

Select the prepared MySQL data source as the source, the prepared StarRocks data source as the destination, and the purchased resource group. Then, test network connectivity between the resource group and data sources. For more information, see Configure a synchronization task in Data Integration.

Step 3: Select the tables from which you want to synchronize data

In this step, you can select the tables from which you want to synchronize data in the Source Table list and click the icon to move the selected tables to the Selected Tables list. In addition, you can filter source databases and source tables by using a regular expression.

Step 4: Configure settings related to destination tables

After you select the tables from which you want to synchronize data, the selected tables are automatically displayed in the Mapping Rules for Destination Tables section. The properties of the destination tables are waiting to be mapped. You must manually define mappings between the source tables and destination tables to determine the data reading and writing relationships. Then, you can click Refresh in the Actions column. You can directly refresh mappings between source tables and destination tables. You can also refresh mappings between source tables and destination tables after you configure settings related to destination tables.

Note

You can select the rows in which source tables are displayed and click Batch Refresh Mapping Results. If the Customize Mapping Rules for Destination Table Names parameter is not configured, data in the source tables is automatically written to StarRocks tables whose names are the same as the source tables. If no tables with the same names as the source tables exist in the destination, the system automatically creates the tables in the destination.
If you want to perform operations such as customizing destination table names but you cannot find the related columns, you can click the icon to specify the columns that you want the system to display.

Customize mapping rules for destination table names

In the Customize Mapping Rules for Destination Table Names column, click Edit. You can concatenate a built-in variable and a specified string into a destination table name. You can edit built-in variables. For example, you can replace built-in variables with strings.
Modify mapping rules for destination table names.
- Modify the mapping rule for a single destination table: Find the destination table for which you want to modify the mapping rule, and perform a modification for the table in the Customize Mapping Rules for Destination Table Names column.
- Modify mapping rules for multiple destination tables at a time: Select the destination tables for which you want to modify mapping rules at a time, click Batch Modify in the lower part of the page, and then select Customize Mapping Rules for Destination Table Names. In the Configure Mapping Rules for Table Names dialog box, select a rule and click Determine.

Modify data type mappings

Default mappings exist between data types of source fields and data types of destination fields. You can click Edit Mapping of Field Data Types in the upper-right corner of the Mapping Rules for Destination Tables section to configure data type mappings between source fields and destination fields based on your business requirements. After the configuration is complete, click Apply and Refresh Mapping.

Add fields to destination tables, assign values to the new fields, and configure partition settings for destination tables

You can add fields to destination tables, assign values to the new fields, and configure partition settings for destination tables.

Add fields to destination tables.
- Add fields to a single destination table: Find the destination table to which you want to add fields and click the icon in the Destination Table Name column. In the dialog box that appears, add fields.
- Add fields to multiple destination tables at a time: Select the destination tables to which you want to add fields at a time, click Batch Modify in the lower part of the page, and then click Destination Table Schema - Batch Modify and Add Field to add fields to the selected destination tables at a time.
Configure partition settings for destination tables. Range partitioning is supported.
- No Partitioning: If you select this option for the Partitioning Method parameter, the related destination tables are non-partitioned tables.
- Partitioning by Specifying Interval: If you select this option for the Partitioning Method parameter, you must specify partition names, upper bounds for partition field values, and lower bounds for partition field values for the related destination tables.
- Partitioning by Specifying End Partition Value: If you select this option for the Partitioning Method parameter, you must specify partition names and upper bounds for partition field values for the related destination tables.
- Partitioning by Specifying Interval and Step Size: If you select this option for the Partitioning Method parameter, you must specify partition fields, start values, end values, and the partitioning step size for the related destination tables.
Assign values to the new fields.
- Assign values to the new fields that are added to a single destination table: Find the destination table in which you want to assign values to the newly added fields and click Configure in the Value assignment column. In the Additional Field dialog box, assign values to the fields.
- Assign values to the new fields that are added to multiple destination tables at a time: Select the destination tables in which you want to assign values to the newly added fields, click Batch Modify in the lower part of the page, and then click Value assignment to assign values to the fields in the selected destination tables at a time.

Configure DML processing rules

Data Integration provides default DML processing rules. You can also configure DML processing rules for destination tables based on your business requirements.

Configure DML processing rules for a single destination table: Find the destination table for which you want to configure DML processing rules and click Configure in the Configure DML Rule column to configure DML processing rules for the table.
Configure DML processing rules for multiple destination tables at a time: Select the destination tables for which you want to configure DML processing rules, click Batch Modify in the lower part of the page, and then click Configure DML Rule to configure DML processing rules for the selected destination tables at a time.

Step 5: Configure alert rules

To prevent the failure of the synchronization task from causing latency on business data synchronization, you can configure different alert rules for the real-time synchronization subtask that will be generated by the synchronization task.

In the upper-right corner of the page, click Configure Alert Rule to go to the Configure Alert Rule panel.
In the Configure Alert Rule panel, click Add Alert Rule. In the Add Alert Rule dialog box, configure the parameters to configure an alert rule.
Manage alert rules. You can enable or disable alert rules that are created. You can also specify different alert recipients based on the severity levels of alerts.

Step 6: Configure advanced parameters

Data Integration provides default values for parameters such as Maximum read connections and Task concurrency. If you want to make fine-grained configurations to meet your business requirements, you can change the values of the parameters. For example, you can specify an appropriate value for the Maximum read connections parameter to prevent the current synchronization task from imposing excessive pressure on the database and data production from being affected.

Note

To prevent unexpected errors or data quality issues, we recommend that you understand the meanings of the parameters before you change the values of the parameters.

Tab	Parameter	Description
Reader Config	Maximum read connections	The maximum number of parallel threads that can be used by the current synchronization task to read data from the source. You can configure this parameter to control the number of source database connections that can be occupied by the current synchronization task.
Writer Config	Maximum number of write connections	The maximum number of parallel threads that can be used by the current synchronization task to write data to the destination.
Runtime Config	Task concurrency	The number of parallel threads that can be used for the current synchronization task.
	Stream load data format	The data format that is used when data is synchronized to StarRocks. Valid values: `json` and `csv`. Default value: `json`. In most cases, you do not need to change the value of this parameter.
	Stream load row delimiter	If you set the Stream load data format parameter to `csv`, you can configure this parameter. Default value: `\x02`. Make sure that the data you want to synchronize does not contain the delimiter you specify. Otherwise, an error occurs during data synchronization.
	Stream load column separator	If you set the Stream load data format parameter to `csv`, you can configure this parameter. Default value: `\x01`. Make sure that the data you want to synchronize does not contain the delimiter you specify. Otherwise, an error occurs during data synchronization.

Step 7: Configure DDL processing rules

DDL operations may be performed on the source. You can click Configure DDL Capability in the upper-right corner of the page to configure rules to process DDL messages from the source based on your business requirements.

Note

You can synchronize data changes generated only by TRUNCATE operations when you run a real-time synchronization task to synchronize all data in a MySQL database to StarRocks.

Step 8: Configure a resource group

You can click Configure Resource Group in the upper-right corner of the page to view and change the resource group that is used to run the current synchronization task.

Step 9: Run the synchronization task

After the configuration of the synchronization task is complete, click Complete in the lower part of the page.
Note
In the Tips message, click Confirmation. Then, you are navigated to the Nodes section of the Data Integration page. You can find the synchronization task and click its name in the Name/ID column to view the details of the task.
In the Nodes section, find the synchronization task and click Start in the Actions column.
Click the name or ID of the synchronization task to view the detailed running process of the task.

Step 10: Perform O&M on the synchronization task

After the synchronization task is started, you can click the name of the task in the Nodes section to go to the O&M details page of the task. The O&M details page displays the overview and detailed information of the synchronization task by synchronization step.

What to do next

After the configuration of the synchronization task is complete, you can manage the synchronization task, add tables to or remove tables from the synchronization task, and configure monitoring and alert settings for the synchronization task.