You can build a real-time data warehouse by using the real-time write capability of Elasticsearch.
Prerequisites
A reader or conversion node is configured. For more information about the data sources that support real-time synchronization, see Data source types that support real-time synchronization.
Limits
DataWorks allows you to add Alibaba Cloud Elasticsearch V5.X, V6.X, and V7.X clusters as data sources. Self-managed Elasticsearch clusters are not supported.
Procedure
Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
In the Scheduled Workflow pane of the DataStudio page, move the pointer over the icon and choose .
Alternatively, find the desired workflow in the Scheduled Workflow pane, right-click the workflow name, and then choose
.In the Create Node dialog box, set the Sync Method parameter to End-to-end ETL, enter a name in the Name field, and configure the Path parameter.
ImportantThe node name cannot exceed 128 characters in length and can contain only letters, digits, underscores (_), and periods (.).
Click Confirm.
On the configuration tab of the real-time synchronization node, drag Elasticsearch in the Output section to the canvas on the right, and connect the Elasticsearch node to the configured reader or conversion node in the canvas.
Click the Elasticsearch node. In the panel that appears, configure the parameters.
Parameter
Description
Data source
The name of the Elasticsearch data source that you added to DataWorks. You can select only an Elasticsearch data source.
If no Elasticsearch data source is available, click New data source on the right to go to the Data Sources page in Management Center to add an Elasticsearch data source. For more information, see Add an Elasticsearch data source.
Index
The name of the index to which you want to write data.
You can click Create Index on the right to create an index. You can directly use the default index information to create an index. Alternatively, you can modify the index name, index type, dynamic mapping status, number of primary shards, number of replica shards, and index creation statement and create an index.
Index Type: This parameter is available only for Elasticsearch V6.X, V5.X, or earlier.
Dynamic Mapping Status: This parameter is used to specify the value of the dynamic parameter. The dynamic parameter determines whether Elasticsearch Writer dynamically writes new fields to the mappings of the index.
If you use an Elasticsearch cluster whose version is earlier than V7.10, this parameter has the following valid values: true, false, and strict.
If you use an Elasticsearch cluster whose version is V7.10 or later, this parameter has the following valid values: true, false, strict, and runtime.
where:
true: indicates that Elasticsearch Writer writes new fields to the mappings of the index and the fields can be searched.
false: indicates that Elasticsearch Writer writes new fields to the mappings of the index but the fields cannot be searched.
strict: indicates that if Elasticsearch Writer detects new fields, it returns an error message and does not write the fields to the mappings of the index.
runtime: indicates that Elasticsearch Writer writes new fields to the mappings of the index as runtime fields but the fields cannot be searched.
For more information, see the dynamic parameter for open source Elasticsearch.
Shards: the number of primary shards. An index can be divided into multiple primary shards. The primary shards can be distributed among different nodes to support distributed searches. When you create an index, you must specify the number of primary shards for the index. After the index is created, you cannot change the number. For more information, see Terms.
Replicas: the number of replica shards for each primary shard. The replica shards can be used for fault tolerance and to process the read request workloads of the cluster. If the capacity of the cluster is insufficient, only a single backup is required for each primary shard, or the cluster encounters bottlenecks in write performance, set Replicas to 1.
Statement for Creating Index: The field configurations are configured in properties. You can modify the types of the fields.
Enable Partitioning for Elasticsearch Indexes
Specifies whether to enable the routing mechanism. You can customize the value of the routing parameter. The default value of routing is the ID of a document. A Hash function is used to convert the value of routing to obtain a number. The number is used to divide the number of primary shards to obtain a remainder. The remainder indicates the position of the document in the primary shards.
Set Primary Key (By_Id)
The method used to assign values to the IDs of Elasticsearch indexes during data synchronization. Valid values:
Primary Key: uses one of the columns in the source table as the primary key.
Composite Primary Key: combines multiple columns in the source table to form the primary key.
NoteIf you set this parameter to Primary Key but the source does not have a primary key, or if you set this parameter to Composite Primary Key but the source does not have a primary key column, this parameter does not take effect. In this case, random values are automatically generated and assigned to the IDs of Elasticsearch indexes. This may result in data duplication.
Mappings
The field mappings between the source and destination. The synchronization node synchronizes data based on the field mappings.
In the top toolbar of the configuration tab of the real-time synchronization node, click the icon to save the node.