Synchronize data from MySQL to Alibaba Cloud Elasticsearch - Elasticsearch

Alibaba Cloud Elasticsearch empowers you to perform full-text searches, multi-dimensional queries, and statistical analysis on data sourced from a MySQL database. This guide details how to leverage DataWorks' Data Integration service to efficiently synchronize data from an ApsaraDB RDS for MySQL instance to an Alibaba Cloud Elasticsearch cluster.

Note

This guide focuses on batch (offline) data replication. To synchronize data in real time, see Real-time synchronization from a full MySQL database to Elasticsearch.

Prerequisites

Before you begin, ensure the following conditions are met and be aware of these critical points:

An ApsaraDB RDS for MySQL instance is created. This guide uses MySQL 5.7 as an example.
An Alibaba Cloud Elasticsearch cluster is created.
The automatic index creation feature is enabled for the cluster.
A DataWorks workspace is created.

Usage notes

Alibaba Cloud Elasticsearch target only: You can only synchronize data to Alibaba Cloud Elasticsearch clusters. Self-managed Elasticsearch clusters are not supported.
Regional consistency: The ApsaraDB RDS for MySQL instance, Elasticsearch cluster, and DataWorks workspace must all be located in the same Alibaba Cloud region.
Time zone consistency: All three services (MySQL, Elasticsearch, DataWorks) should ideally be in the same time zone to prevent time-related data discrepancies during synchronization.

Billing

Elasticsearch cluster fees: See Billing rules.
Data Integration resource group fees: See Resource group fees.

Procedure

Step 1: Prepare source data in MySQL

Create your database and table in your ApsaraDB RDS for MySQL instance.

Create a database. See General workflow to use ApsaraDB RDS for MySQL.

Create a table and insert data:

-- Create table
CREATE TABLE `es_test` (
    `id` bigint(32) NOT NULL,
    `name` varchar(32) NULL,
    `age` bigint(32) NULL,
    `hobby` varchar(32) NULL,
    PRIMARY KEY (`id`)
) ENGINE=InnoDB
DEFAULT CHARACTER SET=utf8;

-- Insert data
INSERT INTO `es_test` (`id`,`name`,`age`,`hobby`) VALUES (1,'user1',22,'music');
INSERT INTO `es_test` (`id`,`name`,`age`,`hobby`) VALUES (2,'user2',23,'sport');
INSERT INTO `es_test` (`id`,`name`,`age`,`hobby`) VALUES (3,'user3',43,'game');
INSERT INTO `es_test` (`id`,`name`,`age`,`hobby`) VALUES (4,'user4',24,'run');
INSERT INTO `es_test` (`id`,`name`,`age`,`hobby`) VALUES (5,'user5',42,'basketball');

Step 2: Purchase and configure a Data Integration resource group

Data synchronization tasks run within a Data Integration resource group. You must purchase one and attach the necessary network and workspace.

Log on to the DataWorks console.
In the top menu bar, select your region. In the left navigation pane, click Resource Groups.
Click Create Resource Group, configure the required parameters, and click Buy Now.
Attach VPC for network connectivity:
1. In the Actions column of the resource group you just created, click Network Settings.
2. In the Data Scheduling & Data Integration section, click Add VPC Association and attach the VPC to which your RDS instance and Elasticsearch cluster belong. This ensures Data Integration can connect to both.
Important
After you attach a VPC, you must add the vSwitch CIDR block of the VPC to the internal-facing whitelists of the ApsaraDB RDS for MySQL and Elasticsearch clusters. For more information, see Configure IP address whitelist in ApsaraDB RDS for MySQL documentation and Manage IP address whitelists in Alibaba Cloud Elasticsearch documentation.
Attach DataWorks workspace:
1. Return to the Resource Groups page.
2. In the Actions column of the resource group, click Attach Workspace and select your target DataWorks workspace.

Step 3: Add data sources in DataWorks

Add the ApsaraDB RDS for MySQL and ElasticSearch data sources to the Data Integration service within DataWorks.

Go to the Data Integration page in DataWorks.
1. Log on to the DataWorks console.
2. In the left navigation pane, click Workspace.
3. In the Actions column of your target workspace, choose Shortcuts > Data Integration.
In the left navigation pane, click Data Source.
Add a MySQL data source.
1. On the Data Sources page, click Add Data Source.
2. On the Add Data Source page, search for and select MySQL.
3. In the Add MySQL Data Source dialog box, configure the data source parameters in the Basic Information section.
  For more information , see Configure a MySQL data source.
4. In the Connection Configuration section, click Test Connectivity. If the connectivity status is Connected, the connection is successful.
5. Click Complete.
Add an Elasticsearch data source. See Add an Elasticsearch data source for detailed instructions.

Step 4: Configure and run an batch data synchronization task

Create and run an batch (offline) synchronization task within DataWorks. This example uses the codeless UI.

Note

For code editor configuration, see Configure a sync node in the code editor, MySQL Reader, and Elasticsearch Writer.
This topic uses the legacy Data Studio page as an example to demonstrate how to create an offline sync task.

Access Data Studio.
1. Log in to the DataWorks console.
2. In the left navigation pane, click Workspace.
3. In the Actions column of your target workspace, choose Shortcuts > Data Studio.
Create a sync task.
1. On the Data Studio ( icon) tab, choose Create > Create Workflow, and follow the on-screen instructions to configure the workflow.
2. Add a node to the workflow.
3. During node creation, set node type to Data Integration, choose MySQL as the source and Elasticsearch as the destination, set it as a batch type, and enter a name.
Configure network and resources.
1. In the Data Source section, set Data Source to MySQL and select the data source you added previously.
2. In the My Resource Group section, select the Data Integration resource group you purchased.
3. In the Data Destination section, set Data Destination to Elasticsearch and select Data Source Name you added previously.
Click Next.
Configure task details.
1. In the Data Source section, select the specific MySQL table (e.g., es_test) to replicate.
2. In the Data Destination section, configure Elasticsearch-specific parameters (e.g., target index, primary key).
3. In the Field Mapping section, set the mapping relationship between the source and destination fields.
  Important
  Ensure that field mappings are explicitly validated and correctly aligned by field name—not by position or auto-inferred order. Misalignment (e.g., mapping a source string field like name to a target numeric field like id of type long) will cause Elasticsearch to reject the document, resulting in "dirty data" errors.
For more information, see Configure a task in the codeless UI.
Run the task.
1. (Optional) Configure scheduling properties for the task. In the right-side bar, click Scheduling and configure parameters. For more information, see Scheduling.
2. Save the task configuration.
3. Run the task.
  - If scheduling was configured, the task runs automatically.
  - To run immediately, click the Run icon.
  If the log contains Shell run successfully!, it indicates that the task ran successfully.

Step 5: Verify data synchronization result

Log on to the Kibana console.
In the Kibana console, choose Dev Tools.
In the Console, run the following command to view the synchronized data.
```
POST /es_test/_search?pretty
{
"query": { "match_all": {}}
}
```
Replace es_test with the index name that you specified in the data synchronization task.
If the data is synchronized successfully, the following result is returned.