Synchronize all data in a MySQL database to SelectDB in real time - DataWorks

You can use DataWorks Data Integration to synchronize all data in a MySQL database to SelectDB or Apache Doris in real time. SelectDB is a commercial real-time data warehouse system developed based on the Apache Doris kernel. A data synchronization task that is used to synchronize data to SelectDB can be configured in the same manner as a data synchronization task that is used to synchronize data to Apache Doris. This topic describes how to synchronize all data in a MySQL database to SelectDB in real time.

Prepare environments

A data synchronization task used to synchronize all data in a MySQL database to SelectDB in real time requires a serverless resource group or an exclusive resource group for Data Integration. You must purchase a serverless resource group or an exclusive resource group for Data Integration and establish network connections between the resource group and the MySQL and SelectDB data sources. We recommend that you purchase a serverless resource group.

Prepare databases

If you have a MySQL database and a SelectDB database, skip this step.

Prepare a MySQL database. In this example, an ApsaraDB RDS for MySQL database is used.
1. Log on to the ApsaraDB RDS console. In the top navigation bar, select a region. In the left-side navigation pane, click Instances. The Instances page appears.
2. On the Instances page, click Create Instance to purchase an ApsaraDB RDS for MySQL instance. For more information, see Create an ApsaraDB RDS for MySQL instance.
3. After you create an ApsaraDB RDS for MySQL instance, create a database and tables in the instance for testing. For more information, see Create accounts and databases.
Prepare a SelectDB database. In this example, an ApsaraDB for SelectDB database is used.
1. Log on to the ApsaraDB for SelectDB console. In the top navigation bar, select a region. In the left-side navigation pane, click Instances. The Instances page appears.
2. On the Instances page, click Create Instance to purchase an ApsaraDB for SelectDB instance. Then, create a database in the instance for testing. For more information, see Create an instance and Create a database and write data to the database.

Prepare a DataWorks workspace

If you have a DataWorks workspace, skip this step.

Log on to the DataWorks console. In the top navigation bar, select the desired region. Then, click Workspace in the left-side navigation pane.
On the Workspaces page, click Create Workspace. In the Create Workspace panel, select Yes for the Isolate Development and Production Environments parameter to create a DataWorks workspace in standard mode. For more information, see Create a workspace.

Prepare a resource group

Before you create a data synchronization task in DataWorks to synchronize all data in a MySQL database to SelectDB in real time, we recommend that you create a serverless resource group to provide computing and periodic scheduling resources for your data synchronization task. If you have a serverless resource group or an exclusive resource group for Data Integration, skip the step.

Go to the Resource Groups page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, click Resource Group.
On the Exclusive Resource Groups tab of the Resource Groups page, click Create Resource Group in the upper-left corner to create a serverless resource group. For more information about how to create a serverless resource group, see Create and use a serverless resource group.
After the serverless resource group is created, go back to the Resource Groups page. On the Resource Groups page, find the created resource group and click Associate Workspace in the Actions column. In the Associate Workspace panel, find the workspace that you prepared and click Associate in the Actions column to associate the resource group with the workspace.

Note

If the data source has an IP address whitelist, you must configure the IP address whitelist based on the following requirements:

If you want to access the data source over a VPC, you must add the CIDR block of the vSwitch with which the resource group is associated to the IP address whitelist of the data source.
If you want to access the data source over the Internet, you must add the EIP configured for the VPC with which the serverless resource group is associated to the IP address whitelist of the data source. If you use an old-version resource group, add the EIP of the resource group to the IP address whitelist.

For more information about the available network connectivity solutions, see Network connectivity solutions.

Step 1: (Optional) Add data sources

Add a MySQL data source

Go to the Data Sources page.
1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
2. In the left-side navigation pane of the SettingCenter page, choose Data Sources > Data Sources.
In the upper-left corner of the Data Sources page, click Add Data Source. In the Add Data Source dialog box, select MySQL. The Add MySQL Data Source page appears.

On the Add MySQL data source page, configure the parameters for the MySQL data source. The following table describes the parameters. Retain the default values of parameters that are not described in the table.

Note

In this example, an ApsaraDB RDS for MySQL database that is created by the current Alibaba Cloud account in the China (Shanghai) region is added to DataWorks as a MySQL data source. If you add another type of MySQL data source, refer to MySQL data source for data preparation.

Parameter		Description
Development Environment	Production Environment
Data Source Name		Specify a name for the MySQL data source.
Data Source Description		Enter the description of the MySQL data source.
Configuration Mode		Select Alibaba Cloud Instance Mode.
Alibaba Cloud Account		Select Current Alibaba Cloud Account.
Region		Select the region in which the MySQL database resides from the drop-down list.
Instance		Select the instance that you want to use.
Database Name		Enter the name of the MySQL database that you created.
Username		The username that is used to log on to the MySQL database.
Password		The password of the account that is used to log on to the MySQL database.
Authentication Method		Select No Authentication.

After you complete the configuration, find the serverless resource group that is associated with the DataWorks workspace in the Connection Configuration section and click Test Network Connectivity in the Connection Status column to test the network connectivity between the MySQL data source and the serverless resource group. For more information about the available network connectivity solutions, see Network connectivity solutions.
If the network connectivity test is successful, click Complete Creation. The MySQL data source is added.

Add a SelectDB data source

Go to the Data Sources page.
1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose More > Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
2. In the left-side navigation pane of the SettingCenter page, choose Data Sources > Data Sources.
In the upper-left corner of the Data Sources page, click Add Data Source. In the Add Data Source dialog box, select SelectDB. The Add SelectDB Data Source page appears.

On the Add SelectDB data source page, configure the parameters for the SelectDB data source. The following table describes the parameters. Retain the default values of parameters that are not described in the table.

Parameter		Description
Development Environment	Production Environment
Data Source Name		Specify a name for the data source.
Host Address/IP Address		Enter a public endpoint or VPC endpoint.
Port		`Set the value to 9030.`
Database Name		Enter the name of the SelectDB database that you created.
HTTP Connection Address		Specify a value in the `${Public or VPC address}:8080` format.
Username		`Set the value to admin.`
Password		Configure the password.

After you complete the configuration, find the serverless resource group that is associated with the DataWorks workspace in the Connection Configuration section and click Test Network Connectivity in the Connection Status column to test the network connectivity between the data source and the serverless resource group. For more information about the available network connectivity solutions, see Network connectivity solutions.
If the network connectivity test is successful, click Complete Creation. The SelectDB data source is added.

Step 2: Create a data synchronization task

Go to the Data Integration page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and Governance > Data Integration. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Integration.
Select the source type and destination type of the data synchronization task.
1. In the upper part of the Data Integration page, select MySQL from the Source drop-down list and SelectDB from the Destination drop-down list.
2. After you select the source type and destination type of the data synchronization task, click Create Synchronization Task to go to the Create Data Synchronization Solution page.

Configure the synchronization method.

On the Create Data Synchronization Solution page, configure the parameters in the Basic Settings and Network and Resource Configuration sections. You can also reconfigure the source type and destination type of the data synchronization task. The following table describes the parameters.

Section	Parameter	Description
Basic Settings	Source And Destination	You can reconfigure the source type and destination type of the data synchronization task specified in the preceding substep. In this example, the source type and destination type remain unchanged.
	New Node Name	By default, the system automatically names the data synchronization task that is used to synchronize all data in a MySQL database to SelectDB in real time. You can manually change the name of the data synchronization task.
	Synchronization Method	Select `SelectDB Migration` from the drop-down list.
	Synchronization Mode	If you set the Synchronization Method parameter to `SelectDB Migration`, you must configure the Synchronization Mode parameter. By default, the following values are selected: Structural migration Full initialization Incremental synchronization In this example, retain the values that are selected by default.
Network and Resource Configuration	Source	Select the MySQL data source that you added in Step 1.
	Resource Group	Select the serverless resource group that you created in the resource preparation step.
	Destination	Select the SelectDB data source that you added in Step 1.

Note

After the configuration is complete, you must click Test Connectivity to test the network connectivity between the data sources and the serverless resource group.

After you complete the configurations in the Basic Settings and Network and Resource Configuration sections, click Next.

Filter source databases and source tables based on database names and table names or based on a regular expression. Select the databases and tables from which you want to synchronize data in the Source Table list and click the icon to move the selected databases and tables to the Selected Tables list.
Configure settings related to destination tables.
After you select the tables from which you want to synchronize data, the selected tables are automatically displayed in the Mapping Rules for Destination Tables section. The properties of the destination tables are waiting to be mapped. You must manually define mappings between the source tables and destination tables to determine the data reading and writing relationships. Then, you can click Refresh in the Actions column. You can directly refresh mappings between source tables and destination tables. You can also refresh mappings between source tables and destination tables after you configure settings related to destination tables.
Note
- You can select the rows in which source tables are displayed and click Batch Refresh Mapping Results. If the Customize Mapping Rules for Destination Table Names parameter is not configured, the mapping rule in the ${Source database name}_${Source table name} format is used for the related destination tables. If no tables with the same names as the source tables exist in the destination, the system automatically creates the tables in the destination.
- You can click Edit in the Customize Mapping Rules for Destination Table Names column to configure mapping rules for destination table names based on your business requirements.
  You can concatenate a built-in variable and a specific string into a destination table name. You can edit built-in variables. For example, when you create a mapping rule, you can add a suffix to a variable that indicates a source table name to form a destination table name.
- If you want to synchronize data to a SelectDB table, you must specify the primary key of the table. By default, the SelectDB table uses the same primary key as the mapped source table. Make sure that the source table has a primary key. The primary key of a destination table is used for data deduplication when data is written to the table. Specify a primary key for the destination table based on your business requirements when you create the table.
1. Modify data type mappings.
  Default mappings exist between data types of source fields and data types of destination fields. You can click Edit Mapping of Field Data Types in the upper-right corner of the Mapping Rules for Destination Tables section to configure data type mappings between source fields and destination fields based on your business requirements. After the configuration is complete, click Apply and Refresh Mapping.
2. Add fields to a destination table and assign values to the added fields.
  If the destination SelectDB table is in the to-be-created state, you can perform the following operations to add fields to the table and assign values to the fields:
  1. Add fields to one or more destination tables.
    - Add fields to a single destination table: Find the destination table to which you want to add fields and click the icon in the Destination Table Name column. In the dialog box that appears, add fields.
    - Add fields to multiple destination tables at a time: Select the destination tables to which you want to add fields at a time, click Batch Modify in the lower part of the page, and then click Destination Table Schema - Batch Modify and Add Field.
  2. Assign values to the fields. You can perform one of the following operations to assign values to the fields:
    - Assign values to the fields that are added to a single destination table: Find the destination table in which you want to assign values to newly added fields and click Configure in the Value assignment column. In the Additional Field dialog box, assign values to the fields.
    - Assign values to the fields that are added to multiple destination tables at a time: Select the destination tables in which you want to assign values to newly added fields, click Batch Modify in the lower part of the page, and then click Value assignment to assign values to the same fields in the selected destination tables at a time.
    Note
    You can assign constants and variables to the fields that are added to a destination table. You can click the icon to switch the value assignment method.
Configure other settings.
1. Configure alert rules.
  To prevent the failure of the data synchronization task from causing latency on business data synchronization, you can configure different alert rules for the real-time synchronization subtask that will be generated by the data synchronization task.
  1. In the upper-right corner of the page, click Configure Alert Rule to go to the Configure Alert Rule panel.
  2. In the Configure Alert Rule panel, click Add Alert Rule. In the Add Alert Rule dialog box, configure the parameters to configure an alert rule.
    Note
    The alert rules that you configure in this step take effect for the real-time synchronization subtask that will be generated by the synchronization task. After the configuration of the synchronization task is complete, you can refer to Manage real-time synchronization tasks to go to the Real-time Synchronization Task page and modify alert rules configured for the real-time synchronization subtask.
  3. Manage alert rules. You can enable or disable alert rules that are created. You can also specify different alert recipients based on the severity levels of alerts.
2. Configure advanced parameters.
  Data Integration provides default values for parameters such as Maximum read connections and Task concurrency. If you want to make fine-grained configurations to meet your business requirements, you can change the values of the parameters. For example, you can specify an appropriate value for the Maximum read connections parameter to prevent the current synchronization task from imposing excessive pressure on the source database and data production from being affected.
  Note
  To prevent unexpected errors or data quality issues, we recommend that you understand the meanings of the parameters before you change the values of the parameters.
  1. In the upper-right corner of the page, click Configure Advanced Parameters.
  2. In the Configure Advanced Parameters panel, configure the parameters on the Reader Config tab based on your business requirements.
    Maximum read connections: The maximum number of parallel threads that can be used by the current data synchronization task to read data from the source. You can configure this parameter to limit the number of source database connections that can be occupied by the current data synchronization task.
3. Configure DDL processing rules.
  DDL operations may be performed on the source. You can click Configure DDL Capability in the upper-right corner of the page to configure rules to process DDL messages from the source based on your business requirements.
  Note
  For more information, see Configure rules to process DDL messages.
4. Configure a resource group.
  You can click Configure Resource Group in the upper-right corner of the page to view and change the resource group that is used to run the current synchronization task.
After the preceding configurations are complete, click Complete.

Step 3: Perform O&M operations on the data synchronization task

Start the data synchronization task

After you complete the configuration of the data synchronization task, you are redirected to the Nodes section of the Data Integration page. You can find the created data synchronization task and click Start in the Actions column to start the data synchronization task.

View the running status of the data synchronization task

After you complete the configuration of the data synchronization task, you can find the task in the Nodes section of the Data Integration page and click the task name in the Name/ID column to go to the running details page of the data synchronization task. The running details page displays the following information about the data synchronization task:

Basic information: You can view the basic information about the synchronization task, such as the data sources, resource group, and synchronization type.
Running Status: You can view the details of the data synchronization task in the schema migration stage, full synchronization stage, and real-time synchronization stage on the Schema Migration, Full Data Initialization, and Real-time Data Synchronization tabs.
- Schema Migration: This tab displays information such as whether the destination table is an automatically created table or an existing table. For an automatically created table, the DDL statement that is used to create the table is displayed.
- Full Data Initialization: This tab displays the statistical information about full synchronization, including the synchronization progress, the number of data records that are synchronized, and the start time and end time of full synchronization.
- Real-time Synchronization: This tab displays the statistical information about real-time synchronization, including the information in the Progress, DDL Records, DML Records, and Alert Information panes.
Operational Logs:: This tab displays a master-worker architecture and contains the Master and Worker panes. A data synchronization task corresponds to a master and one or more workers. You can view the details of the steps in the operation logs.
- Master: This pane displays information about task initialization, data sharding, and worker management.
- Worker: This pane displays the result of data shard reading, writing, and conversion.
Failover: This tab displays the failover information that is generated for the data synchronization task within a specific period of time. You can view the exception recovery records of the data synchronization task since the last time the task is started.
Resource Utilization: This tab displays monitoring information about resources occupied by the data synchronization task. You can view the monitoring information to learn about resource utilization and adjust the amount of resources allocated to the data synchronization task based on the resource utilization.

Rerun the data synchronization task

Directly rerun the data synchronization task.
In the Nodes section of the Data Integration page, find the data synchronization task and choose More > Rerun in the Actions column to rerun the data synchronization task without modifying the configurations of the data synchronization task.
Modify the configurations of the data synchronization task and then rerun the data synchronization task.
In the Nodes section of the Data Integration page, find the data synchronization task, modify the configurations of the data synchronization task, and then click Complete. Click Apply Updates that is displayed in the Actions column of the data synchronization task to rerun the data synchronization task for the latest configurations to take effect.