All Products
Search
Document Center

ApsaraDB for SelectDB:Migrate data from an Elasticsearch cluster

Last Updated:Sep 23, 2024

This topic describes how to use a catalog to migrate offline data from an Elasticsearch cluster to an ApsaraDB for SelectDB instance.

Prerequisites

  • The IP addresses of all nodes in your Elasticsearch cluster are added to the IP address whitelist of an SelectDB instance. For more information, see Configure an IP address whitelist.

  • You have basic knowledge of catalogs and know the operations that you can perform on catalogs. For more information, see Data lakehouse.

Sample environment

In this example, data is migrated from the product_info index of the Elasticsearch cluster to the test_es2SelectDB table in an SelectDB database named es_db. You can modify the corresponding parameters based on your business requirements. Sample environment:

  • Destination database: es_db

  • Destination table: test_es2SelectDB

  • Data source: the product_info index of the Elasticsearch cluster

Prepare the sample source data

For more information, see Getting started.

Procedure

  1. Connect to the SelectDB instance. For more information, see Connect to an instance.

  2. Create a catalog for the Elasticsearch cluster.

    Note

    After a catalog is created for the Elasticsearch cluster, the system automatically generates a database named default_db in the catalog. All files in the Elasticsearch cluster are mapped to tables in the database.

    CREATE CATALOG es_catalog PROPERTIES (
        "type"="es",
        "hosts"="http://127.0.0.1:9200",
        "user"="test_user",
        "password"="test_passwd",
        "nodes_discovery"="false"
    );

    Parameters

    Parameter

    Required

    Default value

    Description

    es_catalog

    Yes

    No default value

    The name of the catalog. Change the name based on your business requirements.

    type

    Yes

    No default value

    The catalog type. Set the value to es.

    hosts

    Yes

    No default value

    The URL that is used to access the Elasticsearch cluster. You can set this parameter to the URL of the Server Load Balancer (SLB) instance of the Elasticsearch cluster.

    Format: http://<IP address>:<Port number>.

    user

    No

    No default value

    The account that is used to access the Elasticsearch cluster.

    password

    No

    No default value

    The password that is used to access the Elasticsearch cluster.

    doc_value_scan

    No

    true

    Specifies whether to enable the column-oriented storage feature of Elasticsearch or Apache Lucene for querying field values. Valid values:

    Note

    ApsaraDB for SelectDB can use this feature to automatically check whether column-oriented storage is enabled for the fields to be queried and query the values of all fields for which column-oriented storage is enabled. This feature significantly improves the query performance if you attempt to query only a small number of columns.

    • true

    • false

    keyword_sniff

    No

    true

    Specifies whether to detect fields of the TEXT type in Elasticsearch and perform queries by using the corresponding KEYWORD fields. Valid values:

    • true: The system matches the entire value of a field of the TEXT type before the value is tokenized into terms.

    • false: The system matches against individual terms that are generated after the value of a field of the TEXT type is tokenized.

    nodes_discovery

    No

    true

    Specifies whether to enable the node discovery feature of Elasticsearch. Valid values:

    • true: allows ApsaraDB for SelectDB to discover all available data nodes in the Elasticsearch cluster and connect to the data nodes.

    • false: does not allow ApsaraDB for SelectDB to access all available data nodes in the Elasticsearch cluster. ApsaraDB for SelectDB can access the data nodes only by using the URL of the Elasticsearch cluster.

    Important

    An Alibaba Cloud Elasticsearch cluster can be accessed only by using a URL. If your data source is an Alibaba Cloud Elasticsearch cluster, you must set this parameter to false.

    mapping_es_id

    No

    false

    Specifies whether to map the _id field that is specified as the primary key in the Elasticsearch index. Valid values: If the primary key of the index is automatically specified by Elasticsearch, you must set this parameter to true.

    Note

    If you do not specify the _id field as the primary key, Elasticsearch assigns a globally unique primary key _id to each file.

    • true: maps the field. In this case, the value of the field can be queried.

    • false: does not map the field. In this case, the value of the field cannot be queried.

    like_push_down

    No

    true

    Specifies whether to convert LIKE queries to wildcard queries and push them down to the Elasticsearch cluster. Valid values:

    • true: converts LIKE queries to wildcard queries.

      Note
      • This may increase the CPU consumption of the Elasticsearch cluster and compromise the query performance in specific scenarios.

      • You can set this parameter to true if you prioritize a high recall rate over high accuracy.

    • false: does not convert LIKE queries to wildcard queries.

    include_hidden_index

    No

    false

    Specifies whether to include the hidden indexes in query conditions. Valid values:

    • false

    • true

  3. Query the catalog.

    SHOW CATALOGS; --Check whether the catalog is created.

    The following output is returned:

    +--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+
    | CatalogId    | CatalogName  | Type     | IsCurrent | CreateTime              | LastUpdateTime      | Comment                |
    +--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+
    | 436009309195 | es_catalog | es      |           | 2024-08-06 17:09:08.058 | 2024-07-19 18:04:37 |                        |
    |            0 | internal     | internal | yes       | UNRECORDED              | NULL                | Doris internal catalog |
    +--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+
  4. Optional. Switch to the es_catalog external catalog.

    You can query and access the data in the es_catalog external catalog in the same way as you query and access data in the internal catalog. For more information about queries, see the "Query data from an Elasticsearch data source" section of the Elasticsearch data source topic.

    Note

    ApsaraDB for SelectDB can only read data in the external catalog.

    SWITCH es_catalog;
  5. Optional. Switch to the internal catalog.

    If you do not perform Step 4, skip this step.

    SWITCH internal;
  6. Optional. Create the destination database.

    If you have created the destination database, skip this step.

    CREATE database es_db;
  7. Switch to the destination database.

    USE es_db;
  8. Create the destination table.

    If you have created the destination table, check whether the data types of the columns in the destination table are mapped to those in the source table of the Elasticsearch cluster.Column type mappings

    If you have not created the destination table, make sure that the data types of the columns in the destination table are mapped to those in the source table of the Elasticsearch cluster. For more information about data type mappings, see the "Column type mappings" section of the Elasticsearch data source topic.

    CREATE TABLE test_Es2SelectDB
    (
        `annual_rate` varchar(200),
        `describe` text,
        `productName` varchar(200)
    )
    DISTRIBUTED BY HASH(productName) BUCKETS 4
    PROPERTIES("replication_num" = "1");
  9. Migrate data.

    INSERT INTO test_Es2SelectDB SELECT *  FROM es_catalog.default_db.product_info;
  10. Query the imported data in the destination table.

    SELECT *  FROM test_Es2SelectDB;

Migrate incremental data.

In actual production environments, Elasticsearch data is mainly divided into offline data and incremental data. To migrate incremental data, you can refer to the following migration solutions:

  • To migrate logs, if you want to retain only recent logs for data analysis, you can write the log data to the Elasticsearch cluster and SelectDB instance to accumulate log data until the amount of log data is sufficient for analysis.

  • To query incremental data, you can use the Unique key model provided by SelectDB to write the incremental data to the Elasticsearch cluster and SelectDB instance. You can migrate historical data in the same way as you migrate offline data. If historical data and incremental data are duplicated during migration, the Unique key model automatically removes duplicates based on the primary key to ensure data uniqueness.

Compatibility with ELK

SelectDB provides customized and enhanced Filebeat and Logstash plug-ins to efficiently collect and process log data. In addition, you can collect log data by importing data. For more information, see the following topics: