This topic describes how to use the data synchronization feature of DataWorks to migrate data from an Alibaba Cloud Elasticsearch cluster to MaxCompute.
Prerequisites
MaxCompute is activated.
For more information, see Activate MaxCompute.
DataWorks is activated.
For more information, see Activate DataWorks.
A workflow is created in your workspace in the DataWorks console.
In this example, your DataWorks workspace runs in basic mode. For more information about how to create a workflow, see Create a workflow.
An Alibaba Cloud Elasticsearch cluster is created.
Before you migrate data, you must make sure that your Alibaba Cloud Elasticsearch cluster works as expected. For more information about how to create an Alibaba Cloud Elasticsearch cluster, see Getting started.
In this example, the Alibaba Cloud Elasticsearch cluster uses the following configuration:
Region: China (Shanghai)
Zone: Zone B
Version: Elasticsearch 5.5.3 with Commercial Feature
Background information
Elasticsearch is a Lucene-based search server. It provides a distributed multi-tenant search engine that supports full-text search. Elasticsearch is an open source product that is released under the Apache License. It is a mainstream search engine for enterprises.
Alibaba Cloud Elasticsearch includes Elasticsearch 5.5.3 with Commercial Feature, Elasticsearch 6.3.2 with Commercial Feature, and Elasticsearch 6.7.0 with Commercial Feature. It also contains the commercial X-Pack plug-in. You can use Alibaba Cloud Elasticsearch in scenarios such as data analysis and search. Based on open source Elasticsearch, Alibaba Cloud Elasticsearch provides enterprise-class access control, security monitoring and alerting, and automatic reporting.
Procedure
Create a source table in Elasticsearch. For more information, see Use DataWorks to synchronize data from a MaxCompute project to an Alibaba Cloud Elasticsearch cluster.
Go to the DataStudio page.
Log on to the DataWorks console. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to DataStudio.
In the top navigation bar, select the region where the target workspace resides. Find the target workspace and click in the Actions column.
Create a destination table in MaxCompute.
Right-click workflow, Select .
In the Create Table dialog box, configure Name and click Create.
NoteIf multiple MaxCompute compute engine instances are associated with the current workspace, you must select one from the Engine Instance drop-down list.
On the table editing page, click DDL Statement.
In the DDL dialog box, enter the following CREATE TABLE statement and click Generate Table Schema.
create table elastic2mc_bankdata ( age string, job string, marital string, education string, default string, housing string, loan string, contact string, month string, day of week string );
Click Submit to Production Environment.
Synchronize data.
Go to the data analytics page. Right-click the specified workflow and choose .
In the Create Node dialog box, enter a name in the Name field and click Confirm.
In the top navigation bar, choose icon.
In script mode, click icon.
In import Template dialog box SOURCE type, data source, target type and data source, and click confirm.
Configure the script.
The following code is used in this example. For more information about the code description, see Elasticsearch Reader.
{ "type": "job", "steps": [ { "stepType": "elasticsearch", "parameter": { "retryCount": 3, "column": [ "age", "job", "marital", "education", "default", "housing", "loan", "contact", "month", "day_of_week", "duration", "campaign", "pdays", "previous", "poutcome", "emp_var_rate", "cons_price_idx", "cons_conf_idx", "euribor3m", "nr_employed", "y" ], "scroll": "1m", "index": "es_index", "pageSize": 1, "sort": { "age": "asc" }, "type": "elasticsearch", "connTimeOut": 1000, "retrySleepTime": 1000, "endpoint": "http://es-cn-xxxx.xxxx.xxxx.xxxx.com:9200", "password": "xxxx", "search": { "match_all": {} }, "readTimeOut": 5000, "username": "xxxx" }, "name": "Reader", "category": "reader" }, { "stepType": "odps", "parameter": { "partition": "", "truncate": true, "compress": false, "datasource": "odps_first", "column": [ "age", "job", "marital", "education", "default", "housing", "loan", "contact", "month", "day_of_week", "duration", "campaign", "pdays", "previous", "poutcome", "emp_var_rate", "cons_price_idx", "cons_conf_idx", "euribor3m", "nr_employed", "y" ], "emptyAsNull": false, "table": "elastic2mc_bankdata" }, "name": "Writer", "category": "writer" } ], "version": "2.0", "order": { "hops": [ { "from": "Reader", "to": "Writer" } ] }, "setting": { "errorLimit": { "record": "0" }, "speed": { "throttle": false, "concurrent": 1, "dmu": 1 } } }
NoteOn the Basic Information page of the created Alibaba Cloud Elasticsearch cluster, you can view the public endpoint and port number in the Public Network Access and Public Network Port fields.
Click the icon to run the code.
View the execution result on the Runtime Logs tab.
View the result.
Right-click the workflow and choose .
In create a node dialog box, enter node name, and click submit.
On the configuration tab of the ODPS SQL node, enter the following statement:
SELECT * FROM elastic2mc_bankdata;
Click icon to run the code.
You can operation Log view the results.