All Products
Search
Document Center

Elasticsearch:Use the reindex API to migrate data between Alibaba Cloud Elasticsearch clusters

Last Updated:Nov 14, 2024

If you want to migrate data between Alibaba Cloud Elasticsearch clusters, you can use the reindex API to reindex data. This topic describes how to use the reindex API to migrate data between Alibaba Cloud Elasticsearch clusters that are deployed in the original network architecture.

Scenarios

This section describes the use scenarios of the reindex API in data migration across Alibaba Cloud Elasticsearch clusters. You can select a solution based on your business data and the network architecture in which your Elasticsearch clusters are deployed.

  • You want to migrate data between Elasticsearch clusters.

  • The shards for an index in an Elasticsearch cluster are inappropriately distributed. For example, the volume of data stored in an index is excessively large, but the number of shards for the index is excessively small. In this case, you can use the reindex API to reindex data.

  • You want to modify mappings for an index that stores a large volume of data. In this case, you can use the reindex API to copy the data in the index.

The network architecture of Alibaba Cloud Elasticsearch was adjusted in October 2020. In the new network architecture, the cross-cluster reindex operation is limited. You need to use the PrivateLink service to establish private connections between VPCs before you perform the operation. The following table provides data migration solutions in different scenarios.

Note

Alibaba Cloud Elasticsearch clusters created before October 2020 are deployed in the original network architecture. Alibaba Cloud Elasticsearch clusters created in October 2020 or later are deployed in the new network architecture.

Scenario

Network architecture

Solution

Migrate data between Alibaba Cloud Elasticsearch clusters

Both clusters are deployed in the original network architecture.

reindex API. For more information, see Use the reindex API to migrate data between Alibaba Cloud Elasticsearch clusters.

One of the clusters is deployed in the original network architecture.

Note

The other cluster can be deployed in the original or new network architecture

Migrate data from a self-managed Elasticsearch cluster that runs on ECS instances to an Alibaba Cloud Elasticsearch cluster

The Alibaba Cloud Elasticsearch cluster is deployed in the original network architecture.

reindex API. For more information, see Use the reindex API to migrate data from a self-managed Elasticsearch cluster to an Alibaba Cloud Elasticsearch cluster.

The Alibaba Cloud Elasticsearch cluster is deployed in the new network architecture.

reindex API. For more information, see Migrate data from a self-managed Elasticsearch cluster to an Alibaba Cloud Elasticsearch cluster deployed in the new network architecture.

Prerequisites

  • Two Alibaba Cloud Elasticsearch clusters that are deployed in the original network architecture, the same virtual private cloud (VPC), and the same vSwitch are created. In this example, an Elasticsearch V6.7.0 cluster is used as the local cluster, and an Elasticsearch V6.3.2 cluster is used as the remote cluster.

  • Test data is prepared.

    • Local cluster

      Create a destination index in the local cluster.

      PUT dest
      {
        "settings": {
          "number_of_shards": 5,
          "number_of_replicas": 1
        }
      }
    • Remote cluster

      Prepare the data that you want to migrate in the remote cluster. In this example, the data in the "Getting started" topic is used. For more information, see Getting started.本地集群测试数据

      Important

      If you want to use an Elasticsearch cluster of V7.0 or later as a remote cluster, you must set the index type to _doc.

Procedure

  1. Log on to the Alibaba Cloud Elasticsearch console.
  2. In the left-side navigation pane, click Elasticsearch Clusters.
  3. Navigate to the desired cluster.
    1. In the top navigation bar, select the resource group to which the cluster belongs and the region where the cluster resides.
    2. On the Elasticsearch Clusters page, find the cluster and click its ID.
  4. Configure a reindex whitelist for the local cluster.

    1. In the left-side navigation pane of the page that appears, choose Configuration and Management > Cluster Configuration.

    2. On the page that appears, click Modify Configuration on the right side of YML File Configuration.

    3. In the YML File Configuration panel, enter the endpoint and port number of the remote cluster in the Other Configurations field.

      The configurations of the reindex whitelist vary based on the number of zones of the remote cluster.

      • If the remote cluster is a single-zone cluster, configure the reindex whitelist in the <Endpoint of the cluster>:9200 format. 单可用区配置示例Sample code:

        reindex.remote.whitelist: ["es-cn-09k1rgid9000g****.elasticsearch.aliyuncs.com:9200"]
      • If the remote cluster is a multi-zone cluster, the reindex whitelist must contain the IP addresses of all data nodes in the cluster and the port number of the cluster. 多可用区远程白名单配置Sample code:

        reindex.remote.whitelist: ["10.0.xx.xx:9200","10.0.xx.xx:9200","10.0.xx.xx:9200","10.15.xx.xx:9200","10.15.xx.xx:9200","10.15.xx.xx:9200"]
        Note

        You can obtain the IP addresses of all data nodes in a cluster from the Node Visualization tab on the Basic Information page of the cluster. For more information, see View the cluster status and node information.

      For information about the configuration of a reindex whitelist, see Configure a remote reindex whitelist.

    4. Select This operation will restart the cluster. Continue? and click OK.

      Then, the system restarts the Elasticsearch cluster. You can view the restart progress in the Tasks dialog box. After the cluster is restarted, the configuration is complete.

  5. In the local cluster, call the reindex API to reindex data.

    1. Log on to the Kibana console of the local cluster and go to the homepage of the Kibana console as prompted.

      For more information about how to log on to the Kibana console, see Log on to the Kibana console.
      Note In this example, an Elasticsearch V6.7.0 cluster is used. Operations on clusters of other versions may differ. The actual operations in the console prevail.
    2. In the left-side navigation pane of the page that appears, click Dev Tools.

    3. On the Console tab, run the following command to call the reindex API to reindex data:

      POST _reindex
      {
        "source": {
          "remote": {
            "host": "http://es-cn-09k1rgid9000g****.elasticsearch.aliyuncs.com:9200",
            "username": "elastic",
            "password": "your_password"
          },
          "index": "product_info",
          "query": {
            "match": {
              "productName": "Wealth management"
            }
          }
        },
        "dest": {
          "index": "dest"
        }
      }

      Category

      Parameter

      Description

      source

      host

      The URL that is used to connect to the remote cluster. The URL must contain the protocol, endpoint, and port number. Example: https://otherhost:9200. The value of the host parameter varies based on the number of zones of the remote cluster.

      • If the remote cluster is a single-zone cluster, you must configure this parameter in the http://<Endpoint of the remote cluster>:9200 format.

        Note

        You can obtain the endpoint on the Basic Information page of the remote cluster. For more information, see View the basic information of a cluster.

      • If the remote cluster is a multi-zone cluster, you must configure this parameter in the http://<IP address of a data node in the remote cluster>:9200 format.

      username

      The username that is used to connect to the remote cluster. This parameter is required only if basic authentication needs to be performed on requests that are sent to the remote cluster. The default username that is used to connect to Alibaba Cloud Elasticsearch clusters is elastic.

      Important
      • For security purposes, we recommend that you use HTTPS to send requests if basic authentication needs to be performed. Otherwise, the required password is transmitted in plaintext.

      • For Alibaba Cloud Elasticsearch clusters, you can use HTTPS in host only after you enable the protocol for the clusters. For information about how to enable HTTPS for an Elasticsearch cluster, see Enable HTTPS.

      password

      The password that is used to connect to the remote cluster. The password is specified when you create the cluster. If you forget the password, you can reset it. For information about the procedure and precautions for resetting a password, see Reset the access password for an Elasticsearch cluster.

      index

      The source index in the remote cluster.

      query

      Specifies the data that you want to migrate. For more information, see Reindex API.

      dest

      index

      The destination index in the local cluster.

      Note

      When you reindex data from a remote cluster, manual slicing and automatic slicing are not supported for the data. For more information, see Manual slicing and Automatic slicing.

      If the command is successfully run, the following result is returned:

      {
        "took" : 51,
        "timed_out" : false,
        "total" : 2,
        "updated" : 2,
        "created" : 0,
        "deleted" : 0,
        "batches" : 1,
        "version_conflicts" : 0,
        "noops" : 0,
        "retries" : {
          "bulk" : 0,
          "search" : 0
        },
        "throttled_millis" : 0,
        "requests_per_second" : -1.0,
        "throttled_until_millis" : 0,
        "failures" : [ ]
      }
  6. Run the following command to view the migrated data:

    GET dest/_search

    If the command is successfully run, the following result is returned:

    • Single-zone cluster查看迁移成功的数据

    • Multi-zone cluster查看迁移成功的数据

Summary

The configurations that are required to migrate data from a single-zone cluster are similar to the configurations that are required to migrate data from a multi-zone cluster. The following table describes the differences.

Cluster type

Configuration of the reindex whitelist

Configuration of the host parameter

Single-zone cluster

Endpoint of the cluster:9200

https://Endpoint of the cluster:9200

Multi-zone cluster

Combination of the IP addresses of all data nodes in the cluster and the port number of the cluster

https://IP address of a data node in the cluster:9200

Additional information

When you use the reindex API to reindex data, you can specify a batch size and timeout periods.

  • Batch size

    A remote Elasticsearch cluster uses a heap to cache index data. The default batch size is 100 MB. If an index in the remote cluster contains large documents, you must change the batch size to a smaller value.

    In the following example, size is set to 10.

    POST _reindex
    {
      "source": {
        "remote": {
          "host": "http://otherhost:9200"
        },
        "index": "source",
        "size": 10,
        "query": {
          "match": {
            "test": "data"
          }
        }
      },
      "dest": {
        "index": "dest"
      }
    }
  • Timeout periods

    You can use socket_timeout to specify a timeout period for socket reads. The default value of socket_timeout is 30s. You can use connect_timeout to specify a timeout period for connections. The default value of connect_timeout is 1s.

    In the following example, socket_timeout is set to 1m, and connect_timeout is set to 10s.

    POST _reindex
    {
      "source": {
        "remote": {
          "host": "http://otherhost:9200",
          "socket_timeout": "1m",
          "connect_timeout": "10s"
        },
        "index": "source",
        "query": {
          "match": {
            "test": "data"
          }
        }
      },
      "dest": {
        "index": "dest"
      }
    }