Use DataWorks to synchronize data from PolarDB-X to Alibaba Cloud Elasticsearch

If your business data is stored in PolarDB for Xscale (PolarDB-X) and you want to perform full-text searches and semantic analytics on the data, you can use the Data Integration service of DataWorks to synchronize the data to Alibaba Cloud Elasticsearch. The data can be synchronized within minutes or a longer period of time. This topic describes how to use the Data Integration service to synchronize data from PolarDB-X to Alibaba Cloud Elasticsearch in offline mode.

Background information

DataWorks is an end-to-end big data development and governance platform based on big data compute engines. DataWorks provides features such as data development, task scheduling, and data management. You can create synchronization tasks in DataWorks to rapidly synchronize data from various data sources to Alibaba Cloud Elasticsearch.

The following types of data sources are supported:
- Alibaba Cloud databases: ApsaraDB RDS for MySQL, ApsaraDB RDS for PostgreSQL, ApsaraDB RDS for SQL Server, ApsaraDB for MongoDB, and ApsaraDB for HBase
- Alibaba Cloud PolarDB for Xscale (PolarDB-X) (formerly DRDS)
- Alibaba Cloud MaxCompute
- Alibaba Cloud Object Storage Service (OSS)
- Alibaba Cloud Tablestore
- Self-managed databases: HDFS, Oracle, FTP, Db2, MySQL, PostgreSQL, SQL Server, MongoDB, and HBase
The following synchronization scenarios are supported:
- Synchronize big data from a database or table to Alibaba Cloud Elasticsearch in offline mode. For more information, see Create a batch synchronization task to synchronize all data in a database to Elasticsearch.
- Synchronize full and incremental big data to Alibaba Cloud Elasticsearch in real time. For more information, see Create a real-time synchronization task to synchronize data to Elasticsearch.

Prerequisites

A PolarDB-X instance is created.
An Alibaba Cloud Elasticsearch cluster is created, and the Auto Indexing feature is enabled for the cluster. For more information, see Create an Alibaba Cloud Elasticsearch cluster and Configure the YML file.
A DataWorks workspace is created. For more information, see Create a workspace.

Note

You can synchronize data only to Alibaba Cloud Elasticsearch. Self-managed Elasticsearch is not supported.
The PolarDB-X instance, Elasticsearch cluster, and DataWorks workspace must reside in the same region.
The PolarDB-X instance, Elasticsearch cluster, and DataWorks workspace must be in the same time zone. Otherwise, if you synchronize time-related data, the data in the source and the data in the destination after the synchronization may have a time zone difference.

Billing

For information about the billing of Alibaba Cloud Elasticsearch clusters, see Elasticsearch billable items.
For information about the billing of exclusive resource groups for Data Integration, see Billing of exclusive resource groups for Data Integration (subscription).

Procedure

Step 1: Prepare source data

Insert data into a table in the PolarDB-X instance.
For more information, see Basic SQL operations. The following figure shows the test data that is used in this example.

Step 2: Create an exclusive resource group for Data Integration

Create an exclusive resource group for Data Integration, and associate the resource group with a virtual private cloud (VPC) and the created workspace. Exclusive resource groups ensure fast and stable data transmission.

Log on to the DataWorks console.
In the top navigation bar, select a region. In the left-side navigation pane, click Resource Group.
On the Exclusive Resource Groups tab of the Resource Groups page, click Create Resource Group for Data Integration of Old Version.
On the DataWorks Exclusive Resources page, set Type to Exclusive Resource Groups for Data Integration, configure Resource Group Name, and then click Buy Now. On the page that appears, click Pay.
For more information, see Create an exclusive resource group for Data Integration.
On the Exclusive Resource Groups tab, find the created resource group and click Network Settings in the Actions column to associate the resource group with a VPC. For more information, see Associate the exclusive resource group for Data Integration with a VPC.
Note
In this example, an exclusive resource group for Data Integration is used to synchronize data over a VPC. For information about how to use an exclusive resource group for Data Integration to synchronize data over the Internet, see Configure an IP address whitelist.
The exclusive resource group must be connected to the VPC where the PolarDB-X instance resides and the VPC where the Elasticsearch cluster resides. This way, data can be synchronized based on the exclusive resource group. Therefore, you must associate the exclusive resource group with the VPC, zone, and vSwitch of the PolarDB-X instance and with those of the Elasticsearch cluster. For information about how to view the VPC where the Elasticsearch cluster resides, see View the basic information of a cluster.
Important
After you associate the exclusive resource group with the VPCs, you need to add the CIDR blocks of the vSwitches to which the PolarDB-X instance and the Elasticsearch cluster belong to the private IP address whitelists of the PolarDB-X instance and Elasticsearch cluster. For more information, see Configure a public or private IP address whitelist for an Elasticsearch cluster.
Click the back icon in the upper-left corner of the page to return to the Resource Groups page.
On the Exclusive Resource Groups tab, find the resource group and click Associate Workspace in the Actions column to associate the resource group with the created workspace.
For more information, see Associate the exclusive resource group for Data Integration with a workspace.

Step 3: Add data sources

Add the PolarDB-X instance and Elasticsearch cluster to Data Integration as data sources.

Go to the Data Integration page.
1. Log on to the DataWorks console.
2. In the left-side navigation pane, click Workspace.
3. Find the workspace and choose Shortcuts > Data Integration in the Actions column.
In the left-side navigation pane of the Data Integration page, click Data source.
Add a PolarDB-X data source.
On the Data Sources page, click Add Data Source.
In the Add Data Source dialog box, search for and select DRDS.
In the Add DRDS Data Source dialog box, configure the parameters and test connectivity. After the connectivity test is passed, click Complete.
For more information, see Add a PolarDB-X data source.
Add an Elasticsearch data source in the same way. For more information, see Add an Elasticsearch data source.

Step 4: Configure and run a batch synchronization task

The exclusive resource group is used to run the batch synchronization task. The resource group obtains data from the source and writes the data to the Elasticsearch cluster.

Note

You can use the codeless UI or code editor to configure the batch synchronization task. In this example, the codeless UI is used. For information about how to use the code editor to configure the batch synchronization task, see Configure a batch synchronization task by using the code editor and Elasticsearch Writer.

Go to the DataStudio page of DataWorks.
1. Log on to the DataWorks console.
2. In the left-side navigation pane, click Workspace.
3. Find the workspace and choose Shortcuts > Data Development in the Actions column.
Create a batch synchronization task.
1. In the left-side navigation pane, choose Create > Create Workflow to create a workflow.
2. Right-click the name of the newly created workflow and choose Create Node > Offline synchronization.
3. In the Create Node dialog box, configure the Name parameter and click Confirm.

Configure the network and resources.
1. For the source part, set Source to DRDS and Data Source Name to the name of the added PolarDB-X data source.
2. For the resource group part, select the created exclusive resource group.
3. For the destination part, set Destination to Elasticsearch and Data Source Name to the name of the added Elasticsearch data source.
Click Next.
Configure the task.
1. In the Source section, select the table whose data you want to synchronize.
2. In the Destination section, configure the parameters.
3. In the Field Mapping section, configure mappings between source fields and destination fields.
  In this example, the default source fields are used. You need to only change destination fields. Click the icon to the right of destination fields. In the dialog box that appears, enter the following information:
```
{"name":"Name","type":"text"}
{"name":"Platform","type":"text"}
{"name":"Year_of_Release","type":"date"}
{"name":"Genre","type":"text"}
{"name":"Publisher","type":"text"}
{"name":"na_Sales","type":"float"}
{"name":"EU_Sales","type":"float"}
{"name":"JP_Sales","type":"float"}
{"name":"Other_Sales","type":"float"}
{"name":"Global_Sales","type":"float"}
{"name":"Critic_Score","type":"long"}
{"name":"Critic_Count","type":"long"}
{"name":"User_Score","type":"float"}
{"name":"User_Count","type":"long"}
{"name":"Developer","type":"text"}
{"name":"Rating","type":"text"}
```
4. In the Channel Control section, configure the parameters.
For more information, see Configure a batch synchronization task by using the codeless UI.
Run the task.
1. (Optional) Configure scheduling properties for the task. In the right-side navigation pane, click Properties. On the Properties tab, configure the parameters based on your business requirements. For more information about the parameters, see Scheduling configuration.
2. In the upper-left corner, click the Save icon to save the task.
3. In the upper-left corner, click the Submit icon to submit the task.
  If you configure scheduling properties for the task, the task is automatically run on a regular basis. You can also click the Run icon in the upper-left corner to run the task immediately.
  If Shell run successfully! is displayed in operational logs, the task runs successfully.

Step 5: View the synchronized data

Log on to the Kibana console of the Elasticsearch cluster.
For more information, see Log on to the Kibana console.
In the left-side navigation pane, click Dev Tools.
On the Console tab of the page that appears, run the following command to query the volume of data in the Elasticsearch cluster.
Note
You can compare the queried data volume with the volume of data in the source to check whether all data is synchronized.
```
GET drdstest/_search
{
  "query": {
    "match_all": {}
  }
}
```
If the command is successfully run, the result shown in the following figure is returned.
Run the following command to search for data by using a specific field:
```
GET drdstest/_search
{
  "query": {
    "term": {
      "Publisher.keyword": {
        "value": "Nintendo"
      }
    }
  }
}
```
If the command is successfully run, the result shown in the following figure is returned.

Background information

Prerequisites

Billing

Procedure

Step 1: Prepare source data

Step 2: Create an exclusive resource group for Data Integration

Step 3: Add data sources

Step 4: Configure and run a batch synchronization task

Step 5: View the synchronized data

Sales Support

Technical Support

Connect & Report Abuse

About Alibaba Cloud

Our Global Network

Quick Start

Global Offices

Olympic Games Paris 2024 New

Stade Roland Garros – Glitz from the Past New

Place de la Concorde – “Breaking” the Barriers New

Vaires-sur-Marne Nautical Stadium – Sports with Sustainability New

International Broadcast Center – Images, Sounds, and Data that Captivate Billions New

Customer Success Stories New

Trust Center

Security & Compliance Center

Cloud Compliance Resources

Security Compliance FAQs

Product & Feature Update New

Cloud Forward

Press Room

Alibaba Cloud e-Magazine New

Alibaba Cloud in Analyst Research

Notice

Go Global Service New

Go Global Alliance with Alibaba Cloud

Asia Accelerator Hot

Information Compliance

China Gateway - MLPS 2.0 Compliance New

China Gateway - Networking

China Gateway - Global Application Acceleration New

China Gateway - Security

China Gateway - Data Security New

ICP Support Hot

China Gateway - Omnichannel Data Mid-End New

China Gateway - Organizational Data Mid-End New

China Gateway - Business Mid-End New

China Gateway - AI Service for Conversational Chatbots New

China Gateway - Online Education

China Gateway - Domain Registration

Work at Alibaba Cloud

Experienced Professionals

Students and Graduates

Free Trial

Pricing

Promo Center

Price Reduction

Pay Less and Deploy More

FinOps

Elastic Compute Service (ECS)

Simple Application Server (SAS)

Elastic GPU Service

Elastic Desktop Service (EDS)

Object Storage Service (OSS)

Cloud Enterprise Network (CEN)

Web Application Firewall (WAF)

Domain Names

Lingma

Container Compute Service (ACS)

Secure Access Service Edge (SASE)

Intelligent Media Services(IMS)

Edge Security Acceleration (ESA)(Original DCDN)

Intelligent Media Management

DingTalk Enterprise

YiDA

Alibaba Cloud Model Studio

Apsara Prime - For Easy Cloud Product Selection

Alibaba Cloud ECS - Cater All Your Cloud Hosting Needs

1TB CDN—Get Free 1 TB Outbound Traffic Plan Now

Security—Under Attack? Get Free Security Support

Short Message Service - Free Testing is Available

Elastic Compute Service (ECS) Hot

CloudBox

Compute Nest

Dedicated Host Hot

ECS Bare Metal Instance

Elastic GPU Service Featured

Simple Application Server (SAS) Hot

Auto Scaling

Cloud Phone Beta

Elastic Desktop Service (EDS) Featured

Batch Compute

Elastic High Performance Computing (E-HPC)

Super Computing Cluster (SCC)