Create a batch synchronization solution to synchronize all data in a database to Elasticsearch - DataWorks

A batch synchronization solution allows you to use one of the following methods to synchronize data to Elasticsearch: Periodic Full Sync, Periodic Incremental Sync, Only One-time Full Sync, Only One-time Incremental Sync, and Incremental Sync after One-time Full Sync. This topic describes how to create a batch synchronization solution to synchronize all data in a database to Elasticsearch. In this example, Incremental Sync after One-time Full Sync is used.

Prerequisites

The required data sources are configured. Before you configure a data synchronization solution, you must configure the data sources from which you want to read data and to which you want to write data. This way, you can select the data sources when you configure a data synchronization solution. For information about the data source types that support the solution-based synchronization feature and the configuration of a data source, see Supported data source types and read and write operations.
Note For information about the items that you need to understand before you configure a data source, see Overview.
An exclusive resource group for Data Integration that meets your business requirements is purchased. For more information, see Create and use an exclusive resource group for Data Integration.
Network connections between the exclusive resource group for Data Integration and data sources are established. For more information, see Establish a network connection between a resource group and a data source.
The data source environments are prepared. Before you configure a data synchronization solution, you must create an account that can be used to access a database and grant the account the permissions required to perform specific operations on the database based on your configurations for data synchronization. For more information, see Overview.

Procedure

Step 1: Select a synchronization solution type
Step 2: Configure network connections for data synchronization
Step 3: Configure the source and mapping rules
Step 4: Configure a destination index
Step 5: Select a synchronization method
Step 6: Configure the resources required by the synchronization solution
Step 7: Run the synchronization solution

Step 1: Select a synchronization solution type

Go to the Create Data Synchronization Solution page in Data Integration in the DataWorks console. In the Synchronization Type of the Create Data Synchronization Solution page, select a source type and a destination type for data synchronization from the drop-down lists. Then, select One-click batch synchronization to Elasticsearch. For more information, see Create a synchronization solution.

Step 2: Configure network connections for data synchronization

Select a source, a destination, and a resource group that is used to run nodes. Test the network connectivity to make sure that the resource group is connected to the source and destination. For more information, see Configure network connections for data synchronization.

Step 3: Configure the source and mapping rules

In the Basic Configuration section, configure the parameters, such as the Solution Name and Location parameters, based on your business requirements.
In the Data Source section, confirm the information about the source.
In the Source Table section, select the tables from which you want to read data from the Source Table list. Then, click the icon to add the tables to the Selected Source Table list.
The Selected Source Table list displays all tables in the source. You can select all or specific tables.
In the Conversion Rule for Table Name section, click Add Rule, select a rule type, and then configure a mapping rule of the selected type.
By default, data in a source table is written to an Elasticsearch index that has the same name as the source table. You can specify a destination index name in a mapping rule to write data in multiple source tables to the same Elasticsearch index. You can also specify prefixes in a mapping rule to write data in source tables with a specific prefix to Elasticsearch indexes with the same names as the source tables but a different prefix. You can use regular expressions to convert the names of the destination indexes. You can also use built-in variables to add prefixes and suffixes to the names of destination indexes. For more information, see Configure the source and synchronization rules.

Step 4: Configure a destination index

Click Refresh source table and Elasticsearch Index mapping to create a destination index based on the rules that you configured in the Mapping Rules for Table Names section in Step 3. If no mapping rule is configured in Step 3, data in the source table is written to the destination index that has the same name as the source table. If no destination index that has the same name as the source table exists, the system automatically creates such a destination index. You can also change the method of creating the destination index.

Note The name of the destination index is generated based on the mapping rules that you configured in the Mapping Rules for Table Names section.


Operation	Description
Select a primary key for a source table	If the tables in the source database have primary keys, the system removes duplicate data based on the primary keys during data synchronization. If a source table does not have a primary key, you can click the icon to specify one or more fields in the table as the primary key of the table. This way, the system removes duplicate data based on the primary key during the synchronization.
Select the method of creating a destination index	You can set the Index creation method parameter to Create Index or Use Existing Index. Create Index: If you select this method, the name of the Elasticsearch index that is automatically created appears in the Elasticsearch Index Name column. You can click the name of the index to modify the values of the parameters that are related to the index. Use Existing Index: If you select this method, you must select the desired index from the drop-down list in the Elasticsearch Index Name column.
Edit destination indexes	By default, the synchronization solution generates destination indexes based on the source tables. Therefore, field type conversion may occur. For example, if the data types of the fields in a destination index are different from the data types of the fields in a source table, the synchronization solution converts the fields in the source table to the data types that can be written to the destination index. You can click the name of a destination index in the ElasticSearchIndex Name column to modify the values of the parameters that are related to the index. Note You can edit a destination index only if you select Create Index from the drop-down list in the Index creation method column. Dynamic Mapping Status: specifies whether to dynamically synchronize new fields in the source tables to the destination Elasticsearch indexes during synchronization. Valid values: true: If the system detects that the source tables contain new fields, the system synchronizes the fields to the mapped destination Elasticsearch indexes. Then, the fields can be searched in the indexes. This is the default value. false: If the system detects that the source tables contain new fields, the system synchronizes the fields to the mapped destination Elasticsearch indexes. However, the fields cannot be searched in the indexes after synchronization. strict: If the system detects that the source tables contain new fields, the system does not synchronize the fields to the mapped destination Elasticsearch indexes and reports an error. You can view the details of the error in the node logs. For more information about dynamic mappings, see the description of the dynamic parameter for open source Elasticsearch. Shards and Replicas: the number of primary shards for each destination Elasticsearch index and the number of replica shards for each primary shard. The shards are distributed on different nodes in an Elasticsearch cluster. This way, distributed searches can be performed and the query efficiency of Elasticsearch is improved. For more information, see Terms. Note The values of the Shards and Replicas parameters cannot be changed after you configure the parameters and run the real-time synchronization solution. The default values of the Shards and Replicas parameters are 1. Partition settings: You can use a column in a source table as a partition key column. You must configure this parameter if you configure the Shards and Replicas parameters. By default, the Enable Partitioning for Elasticsearch Indexes check box is not selected. If you do not select the Enable Partitioning for Elasticsearch Indexes check box, the ID of a specific document is used to evenly route documents to shards. This prevents data skew. If you select the Enable Partitioning for Elasticsearch Indexes check box, the value of a specific column is used to insert a document to a specific shard or update a document in a specific shard. Data field structure: This section allows you to configure the types and extended attributes of the fields in the mapped destination Elasticsearch indexes. For more information, see Field data types in open source Elasticsearch. Note If you do not modify the values of the parameters that are related to a destination Elasticsearch index after the index is created, the system synchronizes data based on the default values of the parameters.

Step 5: Select a synchronization method

In the Sync Rules step, select a synchronization method.

The following table describes the supported synchronization methods.


Method	Description
Only One-time Full Sync	If you select this method, you need to perform the synchronization operations only once to synchronize all data in the source to Elasticsearch.
Only One-time Incremental Sync	If you select this method, you need to perform synchronization operations only once to synchronize incremental data in the source to Elasticsearch based on the specified filter conditions.
Periodic Full Sync	If you select this method, you must specify a scheduling cycle for the batch synchronization solution. Then, the system synchronizes all data in the source to Elasticsearch each time the system runs the solution based on the specified scheduling cycle.
Periodic Incremental Sync	If you select this method, the system synchronizes only incremental data in the source to Elasticsearch each time the system runs the solution based on the specified filter conditions and scheduling cycle.
Incremental Sync after One-time Full Sync	If you select this method, the system first synchronizes all data to Elasticsearch. Then, the system synchronizes only incremental data in the source to Elasticsearch each time the system runs the solution based on the specified filter conditions and scheduling cycle.

Configure parameters for the synchronization method that you select.

The parameters that you need to configure in the Full Sync, Incremental Sync, and Recurrence sections vary based on the synchronization method that you select. The following tables describe the parameters.

Full Sync

The parameters in this section are required only if you set the Solution parameter to Only One-time Full Sync, Periodic Full Sync, or Incremental Sync after One-time Full Sync.


Parameter	Description
Clear Index Data Before Writing	Valid values: Yes: The original data in the destination Elasticsearch indexes is deleted before data in the source is written to the indexes. No: The original data in the destination Elasticsearch indexes is retained before data in the source is written to the indexes. Important If you set this parameter to Yes, all the original data in the destination Elasticsearch indexes is deleted before data in the source is written to the indexes. Exercise caution when you configure this parameter.
Write Policy	Valid values: Insert: The system inserts data into the destination Elasticsearch indexes during data synchronization. This is the default value. Update: If the primary key field of a source table exists in a destination Elasticsearch index, the system deletes a document from the destination Elasticsearch index and then inserts data into the index. Otherwise, the system directly inserts data into the destination Elasticsearch index.
Batch Size	The number of data records that can be written to Elasticsearch at a time. Default value: 1000. You can configure this parameter based on your actual network conditions and the amount of data that you want to synchronize to reduce network overheads.

Incremental Sync

The parameters in this section are required only if you set the Solution parameter to Only One-time Incremental Sync, Periodic Incremental Sync, or Incremental Sync after One-time Full Sync.

Note You can use scheduling parameters to specify the scope of the data that you want to synchronize and the location to which you want to write the data. For more information about how to use scheduling parameters, see Description for using scheduling parameters in data synchronization.


Parameter	Description
Write Policy	Valid values: Insert: The system inserts data into the destination Elasticsearch indexes during data synchronization. This is the default value. Update: If the primary key field of a source table exists in a destination Elasticsearch index, the system deletes a document from the destination Elasticsearch index and then inserts data into the index. Otherwise, the system directly inserts data into the destination Elasticsearch index.
Batch Size	The number of data records that can be written to Elasticsearch at a time. Default value: 1000. You can configure this parameter based on your actual network conditions and the amount of data that you want to synchronize to reduce network overheads.
Condition for Incremental Synchronization	You can use the SQL WHERE clause to extract incremental data from source tables. You need to enter only the WHERE clause in the Condition for Incremental Synchronization field. You do not need to enter the keyword WHERE. You can use built-in variables in the WHERE clause. For example, you can use the `${bdp.system.bizdate}` variable to specify the data timestamp and use the `${bdp.system.cyctime}` variable to specify the scheduling time.

Recurrence
In the Recurrence section, you need to configure the parameters that are used to run the data synchronization solution, such as Recurrence, Scheduling Period, and Pause Scheduling. The scheduling settings for a data synchronization solution that is used for batch synchronization of all data in a database are the same as the scheduling settings for a data synchronization node. For more information, see Configure time properties.

Step 6: Configure the resources required by the synchronization solution

This synchronization solution generates batch synchronization nodes for full data synchronization and incremental data synchronization. You can specify the names for the batch synchronization nodes and select resource groups for scheduling and resource groups for Data Integration. You can also view the maximum number of connections and parallel nodes allowed for the source database. If you want to perform fine-grained configurations for the nodes, you can modify related parameters in the Advanced Configuration section.

Note Batch synchronization nodes in DataWorks can run only after they are provisioned to resource groups for Data Integration by using resource groups for scheduling. Therefore, resource groups for scheduling are also required. If you run nodes on exclusive resource groups for scheduling, you are charged for scheduling instances. For more information, see Mechanism for issuing nodes.

Step 7: Run the synchronization solution

Go to the Tasks page in Data Integration and find the created data synchronization solution.
Click Submit and Run in the Actions column to run the data synchronization solution.
Click Execution details in the Actions column to view the execution details of the data synchronization solution.

What to do next

After a data synchronization solution is configured, you can manage the solution. For example, you can add tables to or remove tables from the solution, configure alerting and monitoring settings for the nodes that are generated by the solution, and view information about the running of the nodes. For more information, see Perform O&M for a data synchronization solution.