A batch synchronization solution allows you to use one of the following methods to synchronize data to Elasticsearch: Periodic Full Sync, Periodic Incremental Sync, Only One-time Full Sync, Only One-time Incremental Sync, and Incremental Sync after One-time Full Sync. This topic describes how to create a batch synchronization solution to synchronize all data in a database to Elasticsearch. In this example, Incremental Sync after One-time Full Sync is used.
Prerequisites
- The required data sources are configured. Before you configure a data synchronization solution, you must configure the data sources from which you want to read data and to which you want to write data. This way, you can select the data sources when you configure a data synchronization solution. For information about the data source types that support the solution-based synchronization feature and the configuration of a data source, see Supported data source types and read and write operations. Note For information about the items that you need to understand before you configure a data source, see Overview.
- An exclusive resource group for Data Integration that meets your business requirements is purchased. For more information, see Create and use an exclusive resource group for Data Integration.
- Network connections between the exclusive resource group for Data Integration and data sources are established. For more information, see Establish a network connection between a resource group and a data source.
- The data source environments are prepared. Before you configure a data synchronization solution, you must create an account that can be used to access a database and grant the account the permissions required to perform specific operations on the database based on your configurations for data synchronization. For more information, see Overview.
Procedure
- Step 1: Select a synchronization solution type
- Step 2: Configure network connections for data synchronization
- Step 3: Configure the source and mapping rules
- Step 4: Configure a destination index
- Step 5: Select a synchronization method
- Step 6: Configure the resources required by the synchronization solution
- Step 7: Run the synchronization solution
Step 1: Select a synchronization solution type
Go to the Create Data Synchronization Solution page in Data Integration in the DataWorks console. In the Synchronization Type of the Create Data Synchronization Solution page, select a source type and a destination type for data synchronization from the drop-down lists. Then, select One-click batch synchronization to Elasticsearch. For more information, see Create a synchronization solution.Step 2: Configure network connections for data synchronization
Select a source, a destination, and a resource group that is used to run nodes. Test the network connectivity to make sure that the resource group is connected to the source and destination. For more information, see Configure network connections for data synchronization.
Step 3: Configure the source and mapping rules
- In the Basic Configuration section, configure the parameters, such as the Solution Name and Location parameters, based on your business requirements.
- In the Data Source section, confirm the information about the source.
- In the Source Table section, select the tables from which you want to read data from the Source Table list. Then, click the icon to add the tables to the Selected Source Table list.
The Selected Source Table list displays all tables in the source. You can select all or specific tables.
- In the Conversion Rule for Table Name section, click Add Rule, select a rule type, and then configure a mapping rule of the selected type. By default, data in a source table is written to an Elasticsearch index that has the same name as the source table. You can specify a destination index name in a mapping rule to write data in multiple source tables to the same Elasticsearch index. You can also specify prefixes in a mapping rule to write data in source tables with a specific prefix to Elasticsearch indexes with the same names as the source tables but a different prefix. You can use regular expressions to convert the names of the destination indexes. You can also use built-in variables to add prefixes and suffixes to the names of destination indexes. For more information, see Configure the source and synchronization rules.
Step 4: Configure a destination index
Click Refresh source table and Elasticsearch Index mapping to create a destination index based on the rules that you configured in the Mapping Rules for Table Names section in Step 3. If no mapping rule is configured in Step 3, data in the source table is written to the destination index that has the same name as the source table. If no destination index that has the same name as the source table exists, the system automatically creates such a destination index. You can also change the method of creating the destination index.Operation | Description |
---|---|
Select a primary key for a source table |
|
Select the method of creating a destination index | You can set the Index creation method parameter to Create Index or Use Existing Index.
|
Edit destination indexes | By default, the synchronization solution generates destination indexes based on the source tables. Therefore, field type conversion may occur. For example, if the data types of the fields in a destination index are different from the data types of the fields in a source table, the synchronization solution converts the fields in the source table to the data types that can be written to the destination index. You can click the name of a destination index in the ElasticSearchIndex Name column to modify the values of the parameters that are related to the index. Note You can edit a destination index only if you select Create Index from the drop-down list in the Index creation method column.
Note If you do not modify the values of the parameters that are related to a destination Elasticsearch index after the index is created, the system synchronizes data based on the default values of the parameters. |
Step 5: Select a synchronization method
- In the Sync Rules step, select a synchronization method. The following table describes the supported synchronization methods.
Method Description Only One-time Full Sync If you select this method, you need to perform the synchronization operations only once to synchronize all data in the source to Elasticsearch. Only One-time Incremental Sync If you select this method, you need to perform synchronization operations only once to synchronize incremental data in the source to Elasticsearch based on the specified filter conditions. Periodic Full Sync If you select this method, you must specify a scheduling cycle for the batch synchronization solution. Then, the system synchronizes all data in the source to Elasticsearch each time the system runs the solution based on the specified scheduling cycle. Periodic Incremental Sync If you select this method, the system synchronizes only incremental data in the source to Elasticsearch each time the system runs the solution based on the specified filter conditions and scheduling cycle. Incremental Sync after One-time Full Sync If you select this method, the system first synchronizes all data to Elasticsearch. Then, the system synchronizes only incremental data in the source to Elasticsearch each time the system runs the solution based on the specified filter conditions and scheduling cycle. - Configure parameters for the synchronization method that you select. The parameters that you need to configure in the Full Sync, Incremental Sync, and Recurrence sections vary based on the synchronization method that you select. The following tables describe the parameters.
- Full SyncThe parameters in this section are required only if you set the Solution parameter to Only One-time Full Sync, Periodic Full Sync, or Incremental Sync after One-time Full Sync.
Parameter Description Clear Index Data Before Writing Valid values:- Yes: The original data in the destination Elasticsearch indexes is deleted before data in the source is written to the indexes.
- No: The original data in the destination Elasticsearch indexes is retained before data in the source is written to the indexes.
Important If you set this parameter to Yes, all the original data in the destination Elasticsearch indexes is deleted before data in the source is written to the indexes. Exercise caution when you configure this parameter.Write Policy Valid values:- Insert: The system inserts data into the destination Elasticsearch indexes during data synchronization. This is the default value.
- Update: If the primary key field of a source table exists in a destination Elasticsearch index, the system deletes a document from the destination Elasticsearch index and then inserts data into the index. Otherwise, the system directly inserts data into the destination Elasticsearch index.
Batch Size The number of data records that can be written to Elasticsearch at a time. Default value: 1000. You can configure this parameter based on your actual network conditions and the amount of data that you want to synchronize to reduce network overheads.
- Incremental SyncThe parameters in this section are required only if you set the Solution parameter to Only One-time Incremental Sync, Periodic Incremental Sync, or Incremental Sync after One-time Full Sync.Note You can use scheduling parameters to specify the scope of the data that you want to synchronize and the location to which you want to write the data. For more information about how to use scheduling parameters, see Description for using scheduling parameters in data synchronization.
Parameter Description Write Policy Valid values:- Insert: The system inserts data into the destination Elasticsearch indexes during data synchronization. This is the default value.
- Update: If the primary key field of a source table exists in a destination Elasticsearch index, the system deletes a document from the destination Elasticsearch index and then inserts data into the index. Otherwise, the system directly inserts data into the destination Elasticsearch index.
Batch Size The number of data records that can be written to Elasticsearch at a time. Default value: 1000. You can configure this parameter based on your actual network conditions and the amount of data that you want to synchronize to reduce network overheads.
Condition for Incremental Synchronization You can use the SQL WHERE clause to extract incremental data from source tables. You need to enter only the WHERE clause in the Condition for Incremental Synchronization field. You do not need to enter the keyword WHERE. You can use built-in variables in the WHERE clause. For example, you can use the ${bdp.system.bizdate}
variable to specify the data timestamp and use the${bdp.system.cyctime}
variable to specify the scheduling time. - Recurrence
In the Recurrence section, you need to configure the parameters that are used to run the data synchronization solution, such as Recurrence, Scheduling Period, and Pause Scheduling. The scheduling settings for a data synchronization solution that is used for batch synchronization of all data in a database are the same as the scheduling settings for a data synchronization node. For more information, see Configure time properties.
- Full Sync
Step 6: Configure the resources required by the synchronization solution
Step 7: Run the synchronization solution
- Go to the Tasks page in Data Integration and find the created data synchronization solution.
- Click Submit and Run in the Actions column to run the data synchronization solution.
- Click Execution details in the Actions column to view the execution details of the data synchronization solution.