Create a real-time synchronization solution to synchronize data to Elasticsearch - DataWorks

You can create a real-time synchronization solution and use the solution to synchronize full and incremental data to Elasticsearch. This topic describes how to create a real-time synchronization solution to synchronize data to Elasticsearch.

Prerequisites

The required data sources are configured. Before you configure a data synchronization solution, you must configure the data sources from which you want to read data and to which you want to write data. This way, you can select the data sources when you configure a data synchronization solution. For information about the data source types that support the solution-based synchronization feature and the configuration of a data source, see Supported data source types and read and write operations.
Note For information about the items that you need to understand before you configure a data source, see Overview.
An exclusive resource group for Data Integration that meets your business requirements is purchased. For more information, see Create and use an exclusive resource group for Data Integration.
Network connections between the exclusive resource group for Data Integration and data sources are established. For more information, see Establish a network connection between a resource group and a data source.
The data source environments are prepared. Before you configure a data synchronization solution, you must create an account that can be used to access a database and grant the account the permissions required to perform specific operations on the database based on your configurations for data synchronization. For more information, see Overview.

Background information

A real-time synchronization solution allows upper-layer applications to search for, analyze, and develop data in real time, and is suitable for real-time monitoring of data updates in a business database.


Item	Description
Number of tables from which you can read data	You can read data from multiple source tables and write the data to multiple destination indexes. You can configure mapping rules for the source and destination. This way, you can read data from multiple tables and write the data to the same destination index.
Nodes	A real-time synchronization solution generates batch synchronization nodes to synchronize full data and real-time synchronization nodes to synchronize incremental data. The number of batch synchronization nodes that are generated by the solution varies based on the number of tables from which you can read data.
Data write	After you run a real-time synchronization solution, full data in the source is written to the destination by using batch synchronization nodes. Then, incremental data in the source is written to the destination in real time by using real-time synchronization nodes.

Procedure

Step 1: Select a synchronization solution
Step 2: Configure network connections for data synchronization
Step 3: Configure the source and synchronization rules
Step 4: Configure a destination index
Step 5: Configure rules to process DDL messages
Step 6: Configure the resources required by the synchronization solution
Step 7: Run the synchronization solution

Step 1: Select a synchronization solution

Go to the Data Integration page in the DataWorks console and click Create Data Synchronization Solution. On the Create Data Synchronization Solution page, select a source and a destination for data synchronization from the drop-down lists. Then, select One-click real-time synchronization to Elasticsearch. For more information, see Create a synchronization solution.

Step 2: Configure network connections for data synchronization

Select a source, a destination, and a resource group that is used to run nodes. Test the network connectivity to make sure that the resource group is connected to the source and destination. For more information, see Configure network connections for data synchronization.

Step 3: Configure the source and synchronization rules

In the Basic Configuration section, configure the parameters, such as the Solution Name and Location parameters, based on your business requirements.
In the Data Source section, confirm the information about the source.
In the Source Table section, select the tables from which you want to read data from the Source Table list. Then, click the icon to add the tables to the Selected Source Table list.
The Selected Source Table list displays all tables in the source. You can select all or specific tables.
In the Conversion Rule for Table Name section, click Add rule, select a rule type, and then configure a mapping rule of the selected type.
By default, data in a source table is written to an Elasticsearch index that has the same name as the source table. You can specify a destination index name in a mapping rule to write data in multiple source tables to the same Elasticsearch index. You can also specify prefixes in a mapping rule to write data in source tables with a specific prefix to Elasticsearch indexes with the same names as the source tables but a different prefix. You can use regular expressions to convert the names of the destination indexes. You can also use built-in variables to add prefixes and suffixes to the names of destination indexes. For more information, see Configure the source and synchronization rules.

Step 4: Configure a destination index

Click Refresh source table and Elasticsearch Index mapping to create a destination index based on the rules that you configured in the Mapping Rules for Table Names section in Step 3. If no mapping rule is configured in Step 3, data in the source table is written to the destination index that has the same name as the source table. If no destination index that has the same name as the source table exists, the system automatically creates such a destination index. You can also change the method of creating the destination index.

Note The name of the destination index is generated based on the mapping rules that you configured in the Mapping Rules for Table Names section.


Operation	Description
Select a primary key for a source table	If the tables in the source database have primary keys, the system removes duplicate data based on the primary keys during data synchronization. If a source table does not have a primary key, you can click the icon to specify one or more fields in the table as the primary key of the table. This way, the system removes duplicate data based on the primary key during the synchronization.
Select the method of creating a destination index	You can set the Index creation method parameter to Create Index or Use Existing Index. Create Index: If you select this method, the name of the Elasticsearch index that is automatically created appears in the Elasticsearch Index Name column. You can click the name of the index to modify the values of the parameters that are related to the index. Use Existing Index: If you select this method, you must select the desired index from the drop-down list in the Elasticsearch Index Name column.
Edit destination indexes	By default, the synchronization solution generates destination indexes based on the source tables. Therefore, field type conversion may occur. For example, if the data types of the fields in a destination index are different from the data types of the fields in a source table, the synchronization solution converts the fields in the source table to the data types that can be written to the destination index. You can click the name of a destination index in the ElasticSearchIndex Name column to modify the values of the parameters that are related to the index. Note You can edit a destination index only if you select Create Index from the drop-down list in the Index creation method column. Dynamic Mapping Status: specifies whether to dynamically synchronize new fields in the source tables to the destination Elasticsearch indexes during synchronization. Valid values: true: If the system detects that the source tables contain new fields, the system synchronizes the fields to the mapped destination Elasticsearch indexes. Then, the fields can be searched in the indexes. This is the default value. false: If the system detects that the source tables contain new fields, the system synchronizes the fields to the mapped destination Elasticsearch indexes. However, the fields cannot be searched in the indexes after synchronization. strict: If the system detects that the source tables contain new fields, the system does not synchronize the fields to the mapped destination Elasticsearch indexes and reports an error. You can view the details of the error in the node logs. For more information about dynamic mappings, see the description of the dynamic parameter for open source Elasticsearch. Shards and Replicas: the number of primary shards for each destination Elasticsearch index and the number of replica shards for each primary shard. The shards are distributed on different nodes in an Elasticsearch cluster. This way, distributed searches can be performed and the query efficiency of Elasticsearch is improved. For more information, see Terms. Note The values of the Shards and Replicas parameters cannot be changed after you configure the parameters and run the real-time synchronization solution. The default values of the Shards and Replicas parameters are 1. Partition settings: You can use a column in a source table as a partition key column. You must configure this parameter if you configure the Shards and Replicas parameters. By default, the Enable Partitioning for Elasticsearch Indexes check box is not selected. If you do not select the Enable Partitioning for Elasticsearch Indexes check box, the ID of a specific document is used to evenly route documents to shards. This prevents data skew. If you select the Enable Partitioning for Elasticsearch Indexes check box, the value of a specific column is used to insert a document to a specific shard or update a document in a specific shard. Data field structure: This section allows you to configure the types and extended attributes of the fields in the mapped destination Elasticsearch indexes. For more information, see Field data types in open source Elasticsearch. Note If you do not modify the values of the parameters that are related to a destination Elasticsearch index after the index is created, the system synchronizes data based on the default values of the parameters.

Step 5: Configure rules to process DDL messages

DDL operations are performed on a source. Data Integration provides default rules to process DDL messages. You can also configure processing rules for different DDL messages based on your business requirements. For more information, see Rules for processing DDL messages.

Step 6: Configure the resources required by the synchronization solution

After you create a synchronization solution, the synchronization solution generates batch synchronization nodes for full data synchronization and real-time synchronization nodes for incremental data synchronization. You must configure the parameters in the Configure Resources step.

You can configure the exclusive resource groups for Data Integration that you want to use to run real-time synchronization nodes and batch synchronization nodes, and the resource groups for scheduling that you want to use to run batch synchronization nodes. You can also click Advanced Configuration to configure the Number of concurrent writes on the target side and Allow Dirty Data Records parameters.

Note

DataWorks uses resource groups for scheduling to issue batch synchronization nodes to resource groups for Data Integration and runs the nodes on the resource groups for Data Integration. Therefore, a batch synchronization node also occupies the resources of a resource group for scheduling. You are charged fees for using the resource group for scheduling to schedule the batch synchronization nodes. For information about the node issuing mechanism, see Mechanism for issuing nodes.
We recommend that you use different resource groups to run batch and real-time synchronization nodes. If you use the same resource group to run batch and real-time synchronization nodes, the nodes compete for resources and affect each other. For example, CPU resources, memory resources, and networks used by the two types of nodes may affect each other. In this case, the batch synchronization nodes may slow down, or the real-time synchronization node may be delayed. Out of memory (OOM) errors may also occur due to insufficient resources.

Step 7: Run the synchronization solution

Go to the Tasks page in Data Integration and find the created data synchronization solution.
Click Submit and Run in the Actions column to run the data synchronization solution.
Click Execution details in the Actions column to view the execution details of the data synchronization solution.

What to do next

After a data synchronization solution is configured, you can manage the solution. For example, you can add tables to or remove tables from the solution, configure alerting and monitoring settings for the nodes that are generated by the solution, and view information about the running of the nodes. For more information, see Perform O&M for a data synchronization solution.