Use an Elasticsearch data source in ApsaraDB for SelectDB - ApsaraDB for SelectDB

This topic describes how to connect an Elasticsearch data source to ApsaraDB for SelectDB and how to use the Elasticsearch data source in ApsaraDB for SelectDB. This helps you perform federated analytics on data in the Elasticsearch data source.

Overview

Elasticsearch catalogs in ApsaraDB for SelectDB support automatic mapping of metadata from Elasticsearch data sources. Based on the distributed query planning capability of ApsaraDB for SelectDB and the full-text search capability of Elasticsearch, you can use Elasticsearch catalogs to provide a comprehensive online analytical processing (OLAP) solution for the following queries:

Multi-index distributed join queries in Elasticsearch
Multi-table join queries across ApsaraDB for SelectDB and Elasticsearch for full-text search and filtering

ApsaraDB for SelectDB supports Elasticsearch 5.X and later.

Create an Elasticsearch catalog

CREATE CATALOG test_es PROPERTIES (
    "type"="es",
    "hosts"="http://127.0.0.1:9200",
    "user"="test_user",
    "password"="test_passwd",
    "nodes_discovery"="false"
);

Elasticsearch does not have the concept of a database. Therefore, after you connect an Elasticsearch data source to ApsaraDB for SelectDB, ApsaraDB for SelectDB automatically creates a unique database named default_db. After you execute the SWITCH statement to switch to the Elasticsearch catalog, ApsaraDB for SelectDB automatically switches to the default_db database without requiring you to execute the USE default_db statement.

The following table describes the parameters.

Parameter	Required	Default value	Description
hosts	Yes	N/A	The URL that is used to access the Elasticsearch data source. You can specify one or more URLs or set this parameter to the URL of the Server Load Balancer (SLB) instance of the Elasticsearch data source.
user	No	N/A	The account that is used to access the Elasticsearch data source.
password	No	N/A	The password of the account that is used to access the Elasticsearch data source.
doc_value_scan	No	true	Specifies whether to enable the column-oriented storage feature of Elasticsearch or Apache Lucene for querying field values.
keyword_sniff	No	true	Specifies whether to detect fields of the TEXT type in Elasticsearch and perform queries by using the corresponding KEYWORD fields. If you set this parameter to false, the system matches data after the TEXT field is split into terms.
nodes_discovery	No	true	Specifies whether to enable the node discovery feature of Elasticsearch. Default value: true. Note Alibaba Cloud Elasticsearch uses an SLB instance as the entry for handling incoming requests. This prevents direct access to nodes of an Elasticsearch cluster. Therefore, you must set this parameter to false.
ssl	No	false	Specifies whether to enable HTTPS access for the Elasticsearch data source. ApsaraDB for SelectDB is configured to trust all HTTPS requests from the frontend (FE) and backends (BEs), regardless of SSL certificate validity.
mapping_es_id	No	false	Specifies whether to map the `_id` field in the Elasticsearch index.
like_push_down	No	true	Specifies whether to convert LIKE conditions to wildcards and push the LIKE conditions down to the Elasticsearch data source. This increases the CPU consumption of the Elasticsearch data source.
include_hidden_index	No	false	Specifies whether to include the hidden indexes. Default value: false.

Note

Only HTTP basic authentication is supported. Make sure that the account that is used to access the Elasticsearch data source has the permissions to access paths such as /_cluster/state/ and _nodes/http and read indexes. If you do not enable HTTPS access for an Elasticsearch cluster, you do not need to specify the account or password.
By default, if an index in Elasticsearch 5.x or 6.x contains multiple types, ApsaraDB for SelectDB reads data from the first type.

Query data from an Elasticsearch data source

After you create an Elasticsearch catalog in ApsaraDB for SelectDB, you can query external Elasticsearch tables in ApsaraDB for SelectDB in the same way as you query internal tables in ApsaraDB for SelectDB. However, you cannot use the data models such as rollup, pre-aggregation, and materialized views in ApsaraDB for SelectDB.

Basic query

SELECT * FROM es_table WHERE k1 > 1000 AND k3 ='term' OR k4 LIKE 'fu*z_';

Extended esquery

You can use the esquery(field, QueryDSL) function to push down queries that cannot be expressed in SQL, such as match_phrase and geoshape, to Elasticsearch for filtering. In the esquery(field, QueryDSL) function, the field parameter is used to associate an index, and the QueryDSL parameter specifies the JSON expression of the basic query domain-specific language (DSL) in Elasticsearch. The QueryDSL parameter is enclosed in braces ({}). The JSON expression must have only one root key, such as match_phrase, geo_shape, or bool. Examples:

match_phrase query:

SELECT * FROM es_table WHERE esquery(k4, '{ "match_phrase": { "k4": "selectdb on es" } }');

geo_shape query:

SELECT * FROM es_table WHERE esquery(k4, '{ "geo_shape": { "location": { "shape": { "type": "envelope", "coordinates": [ [ 13, 53 ], [ 14, 52 ] ] }, "relation": "within" } } }');

bool query:

SELECT * FROM es_table WHERE esquery(k4, ' { "bool": { "must": [ { "terms": { "k1": [ 11, 12 ] } }, { "terms": { "k2": [ 100 ] } } ] } }');

Column type mappings

Data type in Elasticsearch	Data type in ApsaraDB for SelectDB	Remarks
NULL	NULL	N/A
BOOLEAN	BOOLEAN	N/A
BYTE	TINYINT	N/A
SHORT	SMALLINT	N/A
INTEGER	INT	N/A
LONG	BIGINT	N/A
UNSIGNED_LONG	LARGEINT	N/A
FLOAT	FLOAT	N/A
HALF_FLOAT	FLOAT	N/A
DOUBLE	DOUBLE	N/A
SCALED_FLOAT	DOUBLE	N/A
DATE	DATE	Only the default, yyyy-MM-dd HH:mm:ss, yyyy-MM-dd, and epoch_millis formats are supported.
KEYWORD	STRING	N/A
TEXT	STRING	N/A
IP	STRING	N/A
NESTED	STRING	N/A
OBJECT	STRING	N/A
OTHER	UNSUPPORTED	N/A

ARRAY type

Elasticsearch does not support explicit ARRAY types, but a field in Elasticsearch can contain zero or more values. To denote a field as an ARRAY type, you can add a specific structural annotation to the _meta field of the index mapping for ApsaraDB for SelectDB. For more information about the _meta field in Elasticsearch 6.x and earlier, see _meta field.

For example, the doc index uses the following data structure:

{
  "array_int_field": [1, 2, 3, 4],
  "array_string_field": ["selectdb", "is", "the", "best"],
  "id_field": "id-xxx-xxx",
  "timestamp_field": "2022-11-12T12:08:56Z",
  "array_object_field": [
    {
      "name": "xxx",
      "age": 18
    }
  ]
}

To define the ARRAY fields in the preceding data structure, you can run the following commands to add their property configurations under the _meta.selectdb property of the index mapping:

# ES 7.x and later
curl -X PUT "localhost:9200/doc/_mapping?pretty" -H 'Content-Type:application/json' -d '
{
    "_meta": {
        "selectdb":{
            "array_fields":[
                "array_int_field",
                "array_string_field",
                "array_object_field"
            ]
        }
    }
}'

# ES 6.x and earlier
curl -X PUT "localhost:9200/doc/_mapping?pretty" -H 'Content-Type: application/json' -d '
{
    "_doc": {
        "_meta": {
            "selectdb":{
                "array_fields":[
                    "array_int_field",
                    "array_string_field",
                    "array_object_field"
                ]
            }
    }
    }
}

The array_fields parameter specifies fields of the ARRAY type.

Best practices

Filter condition pushdown

Elasticsearch catalogs support filter condition pushdown to Elasticsearch. Only the data that meets the filter conditions is returned. This improves query performance and reduces usage of CPU, memory, and I/O resources in both ApsaraDB for SelectDB and Elasticsearch.

Operators are converted to the following Elasticsearch queries.

SQL syntax	Syntax in Elasticsearch 5.x or later
=	term query
in	terms query
>, <, >=, and ⇐	range query
and	bool.filter
or	bool.should
not	bool.must_not
not in	bool.must_not + terms query
is_not_null	exists query
is_null	bool.must_not + exists query
esquery	Native JSON-formatted query DSL in Elasticsearch

Enable the columnar scan feature to accelerate queries

To accelerate queries, you can enable the columnar scan feature by setting the enable_docvalue_scan parameter to true.

After you enable this feature, ApsaraDB for SelectDB applies the following two principles when you query data from Elasticsearch:

Best effort: ApsaraDB for SelectDB automatically checks whether column-oriented storage is enabled for the fields to be queried. If the doc_value parameter is set to true, column-oriented storage is enabled for all the fields. In this case, ApsaraDB for SelectDB queries all field values from column-oriented storage.
Auto-downgrade: If column-oriented storage is disabled for one of the fields to be queried, ApsaraDB for SelectDB parses and queries all field values from the _source field for which row-oriented storage is enabled.

By default, ApsaraDB for SelectDB on Elasticsearch queries all columns from the _source field, which uses row-oriented storage and stores data in the JSON format. Compared with column-oriented storage, the row-oriented approach is less efficient in reading multiple data records at a time, especially when only a few columns need to be queried. If only a few columns are queried, the performance of querying data from the fields for which column-oriented storage is enabled is more than ten times greater than from the _source field.

Important

Column-oriented storage cannot be enabled for TEXT fields in Elasticsearch. Therefore, if TEXT fields are part of the query, ApsaraDB for SelectDB reverts to querying data from the _source field.
If the number of fields to be queried is greater than or equal to 25, the performance of querying data from the fields for which column-oriented storage is enabled is basically equivalent to that from the _source field.

Detect KEYWORD fields

You can enable detection for fields of the KEYWORD type by setting the enable_keyword_sniff parameter to true.

Elasticsearch allows you to directly import data without the need to create an index. After data is imported, Elasticsearch automatically creates an index. For STRING fields, Elasticsearch uses its multi-field feature to create both TEXT and KEYWORD fields.

The following sample code provides an example on how to map a field to both TEXT and KEYWORD types:

"k4": {
   "type": "text",
   "fields": {
      "keyword": {   
         "type": "keyword",
         "ignore_above": 256
      }
   }
}

For example, if you perform a query that contains filter conditions that are specified by the = operator on the k4 field, ApsaraDB for SelectDB on Elasticsearch converts the query to a term query in Elasticsearch. SQL filter conditions:

k4 = "SelectDB On ES"

Converted query DSL in Elasticsearch:

"term" : { "k4": "SelectDB On ES"}

Because the first type of the k4 field is TEXT, the value of the k4 field is tokenized into the following terms based on the tokenizer that is specified for the k4 field: SelectDB, On, and ES. If you do not specify a tokenizer for the field, the standard tokenizer is used. Send the following API request to tokenize the value in the K4 field:

POST /_analyze{ "analyzer": "standard", "text": "SelectDB On ES"}

The following results are returned:

{
   "tokens": [
      {
         "token": "selectdb",
         "start_offset": 0,
         "end_offset": 8,
         "type": "<ALPHANUM>",
         "position": 0
      },
      {
         "token": "on",
         "start_offset": 9,
         "end_offset": 11,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "es",
         "start_offset": 12,
         "end_offset": 15,
         "type": "<ALPHANUM>",
         "position": 2
      }
   ]
}

Perform the following term query:

"term" : { "k4": "SelectDB On ES"}

No term in the dictionary matches the SelectDB On ES term. Therefore, no results are returned. However, if you set the enable_keyword_sniff parameter to true, the system automatically converts k4 = "SelectDB On ES" to k4.keyword = "SelectDB On ES" to match the SQL semantics. Converted query DSL in Elasticsearch:

"term" : { "k4.keyword": "SelectDB On ES"}

The k4.keyword field is of the KEYWORD type, and the data written to Elasticsearch is a complete term. Therefore, the SelectDB On ES term can be matched.

Enable the automatic node discovery feature

You can enable the automatic node discovery feature by setting the nodes_discovery parameter to true.

After you enable the feature, ApsaraDB for SelectDB discovers all available data nodes to which shards are allocated in Elasticsearch. If the BEs of ApsaraDB for SelectDB cannot access the URLs of the data nodes of Elasticsearch, set the nodes_discovery parameter to false.

Note

Alibaba Cloud Elasticsearch uses an SLB instance as the entry for handling incoming requests. This prevents direct access to nodes of an Elasticsearch cluster. Therefore, you must set the nodes_discovery parameter to false.

Enable HTTPS access for an Elasticsearch cluster

You can enable HTTPS access for an Elasticsearch cluster by setting the ssl parameter to true.

ApsaraDB for SelectDB is configured to trust all HTTPS requests from the FE and BEs, regardless of SSL certificate validity.

Suggestions on time fields

Note

These suggestions are applicable only to external Elasticsearch tables. Time fields are automatically mapped to the DATE or DATETIME type in Elasticsearch catalogs.

In Elasticsearch, the time fields are flexible. However, if the type of a time field is not properly specified in an external Elasticsearch table, the filter conditions cannot be pushed down.

When you create an index, configure the date field to accommodate a wide range of formats to ensure maximum compatibility:

 "dt": {
     "type": "date",
     "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
 }

When you create the field in ApsaraDB for SelectDB, we recommend that you set the type parameter to date or datetime. You can also set the type parameter to varchar. You can execute the following SQL statements to push the filter conditions down to Elasticsearch:

SELECT * FROM doe WHERE k2 > '2020-06-21';

SELECT * FROM doe WHERE k2 < '2020-06-21 12:00:00'; 

SELECT * FROM doe WHERE k2 < 1593497011; 

SELECT * FROM doe WHERE k2 < now();

SELECT * FROM doe WHERE k2 < date_format(now(), '%Y-%m-%d');

Important

In Elasticsearch, if you do not set the format parameter for a time field, the default format of the time field is strict_date_optional_time||epoch_millis.
If the value of a DATE field that is imported into Elasticsearch is a timestamp, you must convert the timestamp to milliseconds. Elasticsearch requires timestamps to be in milliseconds for internal processing. Otherwise, operations that involve an external Elasticsearch table may result in an error.

Query the metadata field `_id` in Elasticsearch

If you import documents into Elasticsearch without specifying the _id field, Elasticsearch automatically assigns a globally unique _id field to each document,. The field serves as the primary key. You can also specify an _id field that indicates specific business significance for a document when you import the document. To query the value of the _id field from an external Elasticsearch table, you can add an _id field of the VARCHAR type when you create the table.

CREATE EXTERNAL TABLE `doe` (
  `_id` varchar COMMENT "",
  `city`  varchar COMMENT ""
) ENGINE=ELASTICSEARCH
PROPERTIES (
"hosts" = "http://127.0.0.1:8200",
"user" = "root",
"password" = "root",
"index" = "doe"
}

To query the value of the _id field from an Elasticsearch catalog, set the mapping_es_id parameter to true.

Important

The value of the _id field can be filtered only by using the = or IN operator.
The _id field must be of the VARCHAR type.

Appendix

The following diagram illustrates the principle by which ApsaraDB for SelectDB queries data from Elasticsearch.

+----------------------------------------------+
|                                              |
| SelectDB   +------------------+              |
|            |       FE         +--------------+-------+
|            |                  |  Request Shard Location
|            +--+-------------+-+              |       |
|               ^             ^                |       |
|               |             |                |       |
|  +-------------------+ +------------------+  |       |
|  |            |      | |    |             |  |       |
|  | +----------+----+ | | +--+-----------+ |  |       |
|  | |      BE       | | | |      BE      | |  |       |
|  | +---------------+ | | +--------------+ |  |       |
+----------------------------------------------+       |
   |        |          | |        |         |          |
   |        |          | |        |         |          |
   |    HTTP SCROLL    | |    HTTP SCROLL   |          |
+-----------+---------------------+------------+       |
|  |        v          | |        v         |  |       |
|  | +------+--------+ | | +------+-------+ |  |       |
|  | |               | | | |              | |  |       |
|  | |   DataNode    | | | |   DataNode   +<-----------+
|  | |               | | | |              | |  |       |
|  | |               +<--------------------------------+
|  | +---------------+ | | |--------------| |  |       |
|  +-------------------+ +------------------+  |       |
|   Same Physical Node                         |       |
|                                              |       |
|           +-----------------------+          |       |
|           |                       |          |       |
|           |      MasterNode       +<-----------------+
| ES        |                       |          |
|           +-----------------------+          |
+----------------------------------------------+

The FE sends a request to the host specified during table creation to obtain the information about the HTTP ports of all nodes and the shard distribution of the index. If the request fails, the FE sequentially traverses the host list until success or complete failure.
When a query is initiated, the FE generates a query execution plan based on the obtained information about nodes and index metadata, and sends the execution plan to the corresponding BEs.
The BEs concurrently obtain the data of the _source field or the fields for which column-oriented storage is enabled from each Elasticsearch index shard in streaming mode by using the HTTP Scroll method.
ApsaraDB for SelectDB returns the results to you after it calculates the results.

Overview

Create an Elasticsearch catalog

Query data from an Elasticsearch data source

Basic query

Extended esquery

Column type mappings

ARRAY type

Best practices

Filter condition pushdown

Enable the columnar scan feature to accelerate queries

Detect KEYWORD fields

Enable the automatic node discovery feature

Enable HTTPS access for an Elasticsearch cluster

Suggestions on time fields

Query the metadata field _id in Elasticsearch

Appendix

Query the metadata field `_id` in Elasticsearch