This topic describes how to use a catalog to migrate offline data from an Elasticsearch cluster to an ApsaraDB for SelectDB instance.
Prerequisites
The IP addresses of all nodes in your Elasticsearch cluster are added to the IP address whitelist of an SelectDB instance. For more information, see Configure an IP address whitelist.
You have basic knowledge of catalogs and know the operations that you can perform on catalogs. For more information, see Data lakehouse.
Sample environment
In this example, data is migrated from the product_info index of the Elasticsearch cluster to the test_es2SelectDB table in an SelectDB database named es_db. You can modify the corresponding parameters based on your business requirements. Sample environment:
Destination database: es_db
Destination table: test_es2SelectDB
Data source: the product_info index of the Elasticsearch cluster
Prepare the sample source data
For more information, see Getting started.
Procedure
Connect to the SelectDB instance. For more information, see Connect to an instance.
Create a catalog for the Elasticsearch cluster.
NoteAfter a catalog is created for the Elasticsearch cluster, the system automatically generates a database named
default_db
in the catalog. All files in the Elasticsearch cluster are mapped to tables in the database.CREATE CATALOG es_catalog PROPERTIES ( "type"="es", "hosts"="http://127.0.0.1:9200", "user"="test_user", "password"="test_passwd", "nodes_discovery"="false" );
Parameters
Parameter
Required
Default value
Description
es_catalog
Yes
No default value
The name of the catalog. Change the name based on your business requirements.
type
Yes
No default value
The catalog type. Set the value to es.
hosts
Yes
No default value
The URL that is used to access the Elasticsearch cluster. You can set this parameter to the URL of the Server Load Balancer (SLB) instance of the Elasticsearch cluster.
Format: http://<IP address>:<Port number>.
user
No
No default value
The account that is used to access the Elasticsearch cluster.
password
No
No default value
The password that is used to access the Elasticsearch cluster.
doc_value_scan
No
true
Specifies whether to enable the column-oriented storage feature of Elasticsearch or Apache Lucene for querying field values. Valid values:
NoteApsaraDB for SelectDB can use this feature to automatically check whether column-oriented storage is enabled for the fields to be queried and query the values of all fields for which column-oriented storage is enabled. This feature significantly improves the query performance if you attempt to query only a small number of columns.
true
false
keyword_sniff
No
true
Specifies whether to detect fields of the TEXT type in Elasticsearch and perform queries by using the corresponding KEYWORD fields. Valid values:
true: The system matches the entire value of a field of the TEXT type before the value is tokenized into terms.
false: The system matches against individual terms that are generated after the value of a field of the TEXT type is tokenized.
nodes_discovery
No
true
Specifies whether to enable the node discovery feature of Elasticsearch. Valid values:
true: allows ApsaraDB for SelectDB to discover all available data nodes in the Elasticsearch cluster and connect to the data nodes.
false: does not allow ApsaraDB for SelectDB to access all available data nodes in the Elasticsearch cluster. ApsaraDB for SelectDB can access the data nodes only by using the URL of the Elasticsearch cluster.
ImportantAn Alibaba Cloud Elasticsearch cluster can be accessed only by using a URL. If your data source is an Alibaba Cloud Elasticsearch cluster, you must set this parameter to false.
mapping_es_id
No
false
Specifies whether to map the
_id
field that is specified as the primary key in the Elasticsearch index. Valid values: If the primary key of the index is automatically specified by Elasticsearch, you must set this parameter to true.NoteIf you do not specify the
_id
field as the primary key, Elasticsearch assigns a globally unique primary key_id
to each file.true: maps the field. In this case, the value of the field can be queried.
false: does not map the field. In this case, the value of the field cannot be queried.
like_push_down
No
true
Specifies whether to convert LIKE queries to wildcard queries and push them down to the Elasticsearch cluster. Valid values:
true: converts LIKE queries to wildcard queries.
NoteThis may increase the CPU consumption of the Elasticsearch cluster and compromise the query performance in specific scenarios.
You can set this parameter to true if you prioritize a high recall rate over high accuracy.
false: does not convert LIKE queries to wildcard queries.
include_hidden_index
No
false
Specifies whether to include the hidden indexes in query conditions. Valid values:
false
true
Query the catalog.
SHOW CATALOGS; --Check whether the catalog is created.
The following output is returned:
+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+ | CatalogId | CatalogName | Type | IsCurrent | CreateTime | LastUpdateTime | Comment | +--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+ | 436009309195 | es_catalog | es | | 2024-08-06 17:09:08.058 | 2024-07-19 18:04:37 | | | 0 | internal | internal | yes | UNRECORDED | NULL | Doris internal catalog | +--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+
Optional. Switch to the es_catalog external catalog.
You can query and access the data in the es_catalog external catalog in the same way as you query and access data in the internal catalog. For more information about queries, see the "Query data from an Elasticsearch data source" section of the Elasticsearch data source topic.
NoteApsaraDB for SelectDB can only read data in the external catalog.
SWITCH es_catalog;
Optional. Switch to the internal catalog.
If you do not perform Step 4, skip this step.
SWITCH internal;
Optional. Create the destination database.
If you have created the destination database, skip this step.
CREATE database es_db;
Switch to the destination database.
USE es_db;
Create the destination table.
If you have created the destination table, check whether the data types of the columns in the destination table are mapped to those in the source table of the Elasticsearch cluster.Column type mappings
If you have not created the destination table, make sure that the data types of the columns in the destination table are mapped to those in the source table of the Elasticsearch cluster. For more information about data type mappings, see the "Column type mappings" section of the Elasticsearch data source topic.
CREATE TABLE test_Es2SelectDB ( `annual_rate` varchar(200), `describe` text, `productName` varchar(200) ) DISTRIBUTED BY HASH(productName) BUCKETS 4 PROPERTIES("replication_num" = "1");
Migrate data.
INSERT INTO test_Es2SelectDB SELECT * FROM es_catalog.default_db.product_info;
Query the imported data in the destination table.
SELECT * FROM test_Es2SelectDB;
Migrate incremental data.
In actual production environments, Elasticsearch data is mainly divided into offline data and incremental data. To migrate incremental data, you can refer to the following migration solutions:
To migrate logs, if you want to retain only recent logs for data analysis, you can write the log data to the Elasticsearch cluster and SelectDB instance to accumulate log data until the amount of log data is sufficient for analysis.
To query incremental data, you can use the Unique key model provided by SelectDB to write the incremental data to the Elasticsearch cluster and SelectDB instance. You can migrate historical data in the same way as you migrate offline data. If historical data and incremental data are duplicated during migration, the Unique key model automatically removes duplicates based on the primary key to ensure data uniqueness.
Compatibility with ELK
SelectDB provides customized and enhanced Filebeat and Logstash plug-ins to efficiently collect and process log data. In addition, you can collect log data by importing data. For more information, see the following topics: