You can create a real-time synchronization solution and use the solution to synchronize full and incremental data to Elasticsearch. This topic describes how to create a real-time synchronization solution to synchronize data to Elasticsearch.
Prerequisites
- The required data sources are configured. Before you configure a data synchronization solution, you must configure the data sources from which you want to read data and to which you want to write data. This way, you can select the data sources when you configure a data synchronization solution. For information about the data source types that support the solution-based synchronization feature and the configuration of a data source, see Supported data source types and read and write operations. Note For information about the items that you need to understand before you configure a data source, see Overview.
- An exclusive resource group for Data Integration that meets your business requirements is purchased. For more information, see Create and use an exclusive resource group for Data Integration.
- Network connections between the exclusive resource group for Data Integration and data sources are established. For more information, see Establish a network connection between a resource group and a data source.
- The data source environments are prepared. Before you configure a data synchronization solution, you must create an account that can be used to access a database and grant the account the permissions required to perform specific operations on the database based on your configurations for data synchronization. For more information, see Overview.
Background information
A real-time synchronization solution allows upper-layer applications to search for, analyze, and develop data in real time, and is suitable for real-time monitoring of data updates in a business database.Item | Description |
---|---|
Number of tables from which you can read data |
|
Nodes | A real-time synchronization solution generates batch synchronization nodes to synchronize full data and real-time synchronization nodes to synchronize incremental data. The number of batch synchronization nodes that are generated by the solution varies based on the number of tables from which you can read data. |
Data write | After you run a real-time synchronization solution, full data in the source is written to the destination by using batch synchronization nodes. Then, incremental data in the source is written to the destination in real time by using real-time synchronization nodes. |
Procedure
- Step 1: Select a synchronization solution
- Step 2: Configure network connections for data synchronization
- Step 3: Configure the source and synchronization rules
- Step 4: Configure a destination index
- Step 5: Configure rules to process DDL messages
- Step 6: Configure the resources required by the synchronization solution
- Step 7: Run the synchronization solution
Step 1: Select a synchronization solution
Go to the Data Integration page in the DataWorks console and click Create Data Synchronization Solution. On the Create Data Synchronization Solution page, select a source and a destination for data synchronization from the drop-down lists. Then, select One-click real-time synchronization to Elasticsearch. For more information, see Create a synchronization solution.Step 2: Configure network connections for data synchronization
Select a source, a destination, and a resource group that is used to run nodes. Test the network connectivity to make sure that the resource group is connected to the source and destination. For more information, see Configure network connections for data synchronization.
Step 3: Configure the source and synchronization rules
- In the Basic Configuration section, configure the parameters, such as the Solution Name and Location parameters, based on your business requirements.
- In the Data Source section, confirm the information about the source.
- In the Source Table section, select the tables from which you want to read data from the Source Table list. Then, click the icon to add the tables to the Selected Source Table list.
The Selected Source Table list displays all tables in the source. You can select all or specific tables.
- In the Conversion Rule for Table Name section, click Add rule, select a rule type, and then configure a mapping rule of the selected type. By default, data in a source table is written to an Elasticsearch index that has the same name as the source table. You can specify a destination index name in a mapping rule to write data in multiple source tables to the same Elasticsearch index. You can also specify prefixes in a mapping rule to write data in source tables with a specific prefix to Elasticsearch indexes with the same names as the source tables but a different prefix. You can use regular expressions to convert the names of the destination indexes. You can also use built-in variables to add prefixes and suffixes to the names of destination indexes. For more information, see Configure the source and synchronization rules.
Step 4: Configure a destination index
Click Refresh source table and Elasticsearch Index mapping to create a destination index based on the rules that you configured in the Mapping Rules for Table Names section in Step 3. If no mapping rule is configured in Step 3, data in the source table is written to the destination index that has the same name as the source table. If no destination index that has the same name as the source table exists, the system automatically creates such a destination index. You can also change the method of creating the destination index.Operation | Description |
---|---|
Select a primary key for a source table |
|
Select the method of creating a destination index | You can set the Index creation method parameter to Create Index or Use Existing Index.
|
Edit destination indexes | By default, the synchronization solution generates destination indexes based on the source tables. Therefore, field type conversion may occur. For example, if the data types of the fields in a destination index are different from the data types of the fields in a source table, the synchronization solution converts the fields in the source table to the data types that can be written to the destination index. You can click the name of a destination index in the ElasticSearchIndex Name column to modify the values of the parameters that are related to the index. Note You can edit a destination index only if you select Create Index from the drop-down list in the Index creation method column.
Note If you do not modify the values of the parameters that are related to a destination Elasticsearch index after the index is created, the system synchronizes data based on the default values of the parameters. |
Step 5: Configure rules to process DDL messages
DDL operations are performed on a source. Data Integration provides default rules to process DDL messages. You can also configure processing rules for different DDL messages based on your business requirements. For more information, see Rules for processing DDL messages.
Step 6: Configure the resources required by the synchronization solution
After you create a synchronization solution, the synchronization solution generates batch synchronization nodes for full data synchronization and real-time synchronization nodes for incremental data synchronization. You must configure the parameters in the Configure Resources step.
You can configure the exclusive resource groups for Data Integration that you want to use to run real-time synchronization nodes and batch synchronization nodes, and the resource groups for scheduling that you want to use to run batch synchronization nodes. You can also click Advanced Configuration to configure the Number of concurrent writes on the target side and Allow Dirty Data Records parameters.
- DataWorks uses resource groups for scheduling to issue batch synchronization nodes to resource groups for Data Integration and runs the nodes on the resource groups for Data Integration. Therefore, a batch synchronization node also occupies the resources of a resource group for scheduling. You are charged fees for using the resource group for scheduling to schedule the batch synchronization nodes. For information about the node issuing mechanism, see Mechanism for issuing nodes.
- We recommend that you use different resource groups to run batch and real-time synchronization nodes. If you use the same resource group to run batch and real-time synchronization nodes, the nodes compete for resources and affect each other. For example, CPU resources, memory resources, and networks used by the two types of nodes may affect each other. In this case, the batch synchronization nodes may slow down, or the real-time synchronization node may be delayed. Out of memory (OOM) errors may also occur due to insufficient resources.
Step 7: Run the synchronization solution
- Go to the Tasks page in Data Integration and find the created data synchronization solution.
- Click Submit and Run in the Actions column to run the data synchronization solution.
- Click Execution details in the Actions column to view the execution details of the data synchronization solution.