This topic describes how to connect an Elasticsearch data source to ApsaraDB for SelectDB and how to use the Elasticsearch data source in ApsaraDB for SelectDB. This helps you perform federated analytics on data in the Elasticsearch data source.
Overview
Elasticsearch catalogs in ApsaraDB for SelectDB support automatic mapping of metadata from Elasticsearch data sources. Based on the distributed query planning capability of ApsaraDB for SelectDB and the full-text search capability of Elasticsearch, you can use Elasticsearch catalogs to provide a comprehensive online analytical processing (OLAP) solution for the following queries:
Multi-index distributed join queries in Elasticsearch
Multi-table join queries across ApsaraDB for SelectDB and Elasticsearch for full-text search and filtering
ApsaraDB for SelectDB supports Elasticsearch 5.X and later.
Create an Elasticsearch catalog
CREATE CATALOG test_es PROPERTIES (
"type"="es",
"hosts"="http://127.0.0.1:9200",
"user"="test_user",
"password"="test_passwd",
"nodes_discovery"="false"
);
Elasticsearch does not have the concept of a database. Therefore, after you connect an Elasticsearch data source to ApsaraDB for SelectDB, ApsaraDB for SelectDB automatically creates a unique database named default_db
. After you execute the SWITCH statement to switch to the Elasticsearch catalog, ApsaraDB for SelectDB automatically switches to the default_db
database without requiring you to execute the USE default_db
statement.
The following table describes the parameters.
Parameter | Required | Default value | Description |
hosts | Yes | N/A | The URL that is used to access the Elasticsearch data source. You can specify one or more URLs or set this parameter to the URL of the Server Load Balancer (SLB) instance of the Elasticsearch data source. |
user | No | N/A | The account that is used to access the Elasticsearch data source. |
password | No | N/A | The password of the account that is used to access the Elasticsearch data source. |
doc_value_scan | No | true | Specifies whether to enable the column-oriented storage feature of Elasticsearch or Apache Lucene for querying field values. |
keyword_sniff | No | true | Specifies whether to detect fields of the TEXT type in Elasticsearch and perform queries by using the corresponding KEYWORD fields. If you set this parameter to false, the system matches data after the TEXT field is split into terms. |
nodes_discovery | No | true | Specifies whether to enable the node discovery feature of Elasticsearch. Default value: true. Note Alibaba Cloud Elasticsearch uses an SLB instance as the entry for handling incoming requests. This prevents direct access to nodes of an Elasticsearch cluster. Therefore, you must set this parameter to false. |
ssl | No | false | Specifies whether to enable HTTPS access for the Elasticsearch data source. ApsaraDB for SelectDB is configured to trust all HTTPS requests from the frontend (FE) and backends (BEs), regardless of SSL certificate validity. |
mapping_es_id | No | false | Specifies whether to map the |
like_push_down | No | true | Specifies whether to convert LIKE conditions to wildcards and push the LIKE conditions down to the Elasticsearch data source. This increases the CPU consumption of the Elasticsearch data source. |
include_hidden_index | No | false | Specifies whether to include the hidden indexes. Default value: false. |
Only HTTP basic authentication is supported. Make sure that the account that is used to access the Elasticsearch data source has the permissions to access paths such as
/_cluster/state/
and_nodes/http
and read indexes. If you do not enable HTTPS access for an Elasticsearch cluster, you do not need to specify the account or password.By default, if an index in Elasticsearch 5.x or 6.x contains multiple types, ApsaraDB for SelectDB reads data from the first type.
Query data from an Elasticsearch data source
After you create an Elasticsearch catalog in ApsaraDB for SelectDB, you can query external Elasticsearch tables in ApsaraDB for SelectDB in the same way as you query internal tables in ApsaraDB for SelectDB. However, you cannot use the data models such as rollup, pre-aggregation, and materialized views in ApsaraDB for SelectDB.
Basic query
SELECT * FROM es_table WHERE k1 > 1000 AND k3 ='term' OR k4 LIKE 'fu*z_';
Extended esquery
You can use the esquery(field, QueryDSL)
function to push down queries that cannot be expressed in SQL, such as match_phrase
and geoshape
, to Elasticsearch for filtering. In the esquery(field, QueryDSL)
function, the field parameter is used to associate an index, and the QueryDSL parameter specifies the JSON expression of the basic query domain-specific language (DSL) in Elasticsearch. The QueryDSL parameter is enclosed in braces ({}
). The JSON expression must have only one root key
, such as match_phrase
, geo_shape
, or bool
. Examples:
match_phrase
query:
SELECT * FROM es_table WHERE esquery(k4, '{ "match_phrase": { "k4": "selectdb on es" } }');
geo_shape
query:
SELECT * FROM es_table WHERE esquery(k4, '{ "geo_shape": { "location": { "shape": { "type": "envelope", "coordinates": [ [ 13, 53 ], [ 14, 52 ] ] }, "relation": "within" } } }');
bool
query:
SELECT * FROM es_table WHERE esquery(k4, ' { "bool": { "must": [ { "terms": { "k1": [ 11, 12 ] } }, { "terms": { "k2": [ 100 ] } } ] } }');
Column type mappings
Data type in Elasticsearch | Data type in ApsaraDB for SelectDB | Remarks |
NULL | NULL | N/A |
BOOLEAN | BOOLEAN | N/A |
BYTE | TINYINT | N/A |
SHORT | SMALLINT | N/A |
INTEGER | INT | N/A |
LONG | BIGINT | N/A |
UNSIGNED_LONG | LARGEINT | N/A |
FLOAT | FLOAT | N/A |
HALF_FLOAT | FLOAT | N/A |
DOUBLE | DOUBLE | N/A |
SCALED_FLOAT | DOUBLE | N/A |
DATE | DATE | Only the default, yyyy-MM-dd HH:mm:ss, yyyy-MM-dd, and epoch_millis formats are supported. |
KEYWORD | STRING | N/A |
TEXT | STRING | N/A |
IP | STRING | N/A |
NESTED | STRING | N/A |
OBJECT | STRING | N/A |
OTHER | UNSUPPORTED | N/A |
ARRAY type
Elasticsearch does not support explicit ARRAY types, but a field in Elasticsearch can contain zero or more values. To denote a field as an ARRAY type, you can add a specific structural annotation to the _meta field of the index mapping for ApsaraDB for SelectDB
. For more information about the _meta field in Elasticsearch 6.x and earlier, see _meta field.
For example, the doc
index uses the following data structure:
{
"array_int_field": [1, 2, 3, 4],
"array_string_field": ["selectdb", "is", "the", "best"],
"id_field": "id-xxx-xxx",
"timestamp_field": "2022-11-12T12:08:56Z",
"array_object_field": [
{
"name": "xxx",
"age": 18
}
]
}
To define the ARRAY fields in the preceding data structure, you can run the following commands to add their property configurations under the _meta.selectdb
property of the index mapping:
# ES 7.x and later
curl -X PUT "localhost:9200/doc/_mapping?pretty" -H 'Content-Type:application/json' -d '
{
"_meta": {
"selectdb":{
"array_fields":[
"array_int_field",
"array_string_field",
"array_object_field"
]
}
}
}'
# ES 6.x and earlier
curl -X PUT "localhost:9200/doc/_mapping?pretty" -H 'Content-Type: application/json' -d '
{
"_doc": {
"_meta": {
"selectdb":{
"array_fields":[
"array_int_field",
"array_string_field",
"array_object_field"
]
}
}
}
}
The array_fields
parameter specifies fields of the ARRAY type.
Best practices
Filter condition pushdown
Elasticsearch catalogs support filter condition pushdown to Elasticsearch. Only the data that meets the filter conditions is returned. This improves query performance and reduces usage of CPU, memory, and I/O resources in both ApsaraDB for SelectDB and Elasticsearch.
Operators are converted to the following Elasticsearch queries.
SQL syntax | Syntax in Elasticsearch 5.x or later |
= | term query |
in | terms query |
>, <, >=, and ⇐ | range query |
and | bool.filter |
or | bool.should |
not | bool.must_not |
not in | bool.must_not + terms query |
is_not_null | exists query |
is_null | bool.must_not + exists query |
esquery | Native JSON-formatted query DSL in Elasticsearch |
Enable the columnar scan feature to accelerate queries
To accelerate queries, you can enable the columnar scan feature by setting the enable_docvalue_scan
parameter to true.
After you enable this feature, ApsaraDB for SelectDB applies the following two principles when you query data from Elasticsearch:
Best effort: ApsaraDB for SelectDB automatically checks whether column-oriented storage is enabled for the fields to be queried. If the
doc_value
parameter is set to true, column-oriented storage is enabled for all the fields. In this case, ApsaraDB for SelectDB queries all field values from column-oriented storage.Auto-downgrade: If column-oriented storage is disabled for one of the fields to be queried, ApsaraDB for SelectDB parses and queries all field values from the
_source
field for which row-oriented storage is enabled.
By default, ApsaraDB for SelectDB on Elasticsearch queries all columns from the _source
field, which uses row-oriented storage and stores data in the JSON format. Compared with column-oriented storage, the row-oriented approach is less efficient in reading multiple data records at a time, especially when only a few columns need to be queried. If only a few columns are queried, the performance of querying data from the fields for which column-oriented storage is enabled is more than ten times greater than from the _source
field.
Column-oriented storage cannot be enabled for
TEXT
fields in Elasticsearch. Therefore, ifTEXT
fields are part of the query, ApsaraDB for SelectDB reverts to querying data from the_source
field.If the number of fields to be queried is greater than or equal to 25, the performance of querying data from the fields for which column-oriented storage is enabled is basically equivalent to that from the
_source
field.
Detect KEYWORD fields
You can enable detection for fields of the KEYWORD type by setting the enable_keyword_sniff
parameter to true.
Elasticsearch allows you to directly import data without the need to create an index. After data is imported, Elasticsearch automatically creates an index. For STRING fields, Elasticsearch uses its multi-field feature to create both TEXT
and KEYWORD
fields.
The following sample code provides an example on how to map a field to both TEXT and KEYWORD types:
"k4": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
For example, if you perform a query that contains filter conditions that are specified by the =
operator on the k4 field, ApsaraDB for SelectDB on Elasticsearch converts the query to a term query in Elasticsearch. SQL filter conditions:
k4 = "SelectDB On ES"
Converted query DSL in Elasticsearch:
"term" : { "k4": "SelectDB On ES"}
Because the first type of the k4 field is TEXT
, the value of the k4 field is tokenized into the following terms based on the tokenizer that is specified for the k4 field: SelectDB, On, and ES. If you do not specify a tokenizer for the field, the standard tokenizer is used. Send the following API request to tokenize the value in the K4 field:
POST /_analyze{ "analyzer": "standard", "text": "SelectDB On ES"}
The following results are returned:
{
"tokens": [
{
"token": "selectdb",
"start_offset": 0,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "on",
"start_offset": 9,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "es",
"start_offset": 12,
"end_offset": 15,
"type": "<ALPHANUM>",
"position": 2
}
]
}
Perform the following term query:
"term" : { "k4": "SelectDB On ES"}
No term in the dictionary matches the SelectDB On ES
term. Therefore, no results are returned. However, if you set the enable_keyword_sniff
parameter to true, the system automatically converts k4 = "SelectDB On ES"
to k4.keyword = "SelectDB On ES"
to match the SQL semantics. Converted query DSL in Elasticsearch:
"term" : { "k4.keyword": "SelectDB On ES"}
The k4.keyword
field is of the KEYWORD
type, and the data written to Elasticsearch is a complete term. Therefore, the SelectDB On ES term can be matched.
Enable the automatic node discovery feature
You can enable the automatic node discovery feature by setting the nodes_discovery
parameter to true.
After you enable the feature, ApsaraDB for SelectDB discovers all available data nodes to which shards are allocated in Elasticsearch. If the BEs of ApsaraDB for SelectDB cannot access the URLs of the data nodes of Elasticsearch, set the nodes_discovery parameter to false.
Alibaba Cloud Elasticsearch uses an SLB instance as the entry for handling incoming requests. This prevents direct access to nodes of an Elasticsearch cluster. Therefore, you must set the nodes_discovery
parameter to false
.
Enable HTTPS access for an Elasticsearch cluster
You can enable HTTPS access for an Elasticsearch cluster by setting the ssl
parameter to true.
ApsaraDB for SelectDB is configured to trust all HTTPS requests from the FE and BEs, regardless of SSL certificate validity.
Suggestions on time fields
These suggestions are applicable only to external Elasticsearch tables. Time fields are automatically mapped to the DATE or DATETIME type in Elasticsearch catalogs.
In Elasticsearch, the time fields are flexible. However, if the type of a time field is not properly specified in an external Elasticsearch table, the filter conditions cannot be pushed down.
When you create an index, configure the date field to accommodate a wide range of formats to ensure maximum compatibility:
"dt": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
When you create the field in ApsaraDB for SelectDB, we recommend that you set the type parameter to date
or datetime
. You can also set the type parameter to varchar
. You can execute the following SQL statements to push the filter conditions down to Elasticsearch:
SELECT * FROM doe WHERE k2 > '2020-06-21';
SELECT * FROM doe WHERE k2 < '2020-06-21 12:00:00';
SELECT * FROM doe WHERE k2 < 1593497011;
SELECT * FROM doe WHERE k2 < now();
SELECT * FROM doe WHERE k2 < date_format(now(), '%Y-%m-%d');
In Elasticsearch, if you do not set the
format
parameter for a time field, the default format of the time field isstrict_date_optional_time||epoch_millis
.If the value of a DATE field that is imported into Elasticsearch is a timestamp, you must convert the timestamp to
milliseconds
. Elasticsearch requires timestamps to be inmilliseconds
for internal processing. Otherwise, operations that involve an external Elasticsearch table may result in an error.
Query the metadata field _id
in Elasticsearch
If you import documents into Elasticsearch without specifying the _id
field, Elasticsearch automatically assigns a globally unique _id
field to each document,. The field serves as the primary key. You can also specify an _id
field that indicates specific business significance for a document when you import the document. To query the value of the _id field from an external Elasticsearch table, you can add an _id
field of the VARCHAR
type when you create the table.
CREATE EXTERNAL TABLE `doe` (
`_id` varchar COMMENT "",
`city` varchar COMMENT ""
) ENGINE=ELASTICSEARCH
PROPERTIES (
"hosts" = "http://127.0.0.1:8200",
"user" = "root",
"password" = "root",
"index" = "doe"
}
To query the value of the _id field from an Elasticsearch catalog, set the mapping_es_id
parameter to true.
The value of the
_id
field can be filtered only by using the=
orIN
operator.The
_id
field must be of theVARCHAR
type.
Appendix
The following diagram illustrates the principle by which ApsaraDB for SelectDB queries data from Elasticsearch.
+----------------------------------------------+
| |
| SelectDB +------------------+ |
| | FE +--------------+-------+
| | | Request Shard Location
| +--+-------------+-+ | |
| ^ ^ | |
| | | | |
| +-------------------+ +------------------+ | |
| | | | | | | | |
| | +----------+----+ | | +--+-----------+ | | |
| | | BE | | | | BE | | | |
| | +---------------+ | | +--------------+ | | |
+----------------------------------------------+ |
| | | | | | |
| | | | | | |
| HTTP SCROLL | | HTTP SCROLL | |
+-----------+---------------------+------------+ |
| | v | | v | | |
| | +------+--------+ | | +------+-------+ | | |
| | | | | | | | | | |
| | | DataNode | | | | DataNode +<-----------+
| | | | | | | | | | |
| | | +<--------------------------------+
| | +---------------+ | | |--------------| | | |
| +-------------------+ +------------------+ | |
| Same Physical Node | |
| | |
| +-----------------------+ | |
| | | | |
| | MasterNode +<-----------------+
| ES | | |
| +-----------------------+ |
+----------------------------------------------+
The FE sends a request to the host specified during table creation to obtain the information about the HTTP ports of all nodes and the shard distribution of the index. If the request fails, the FE sequentially traverses the host list until success or complete failure.
When a query is initiated, the FE generates a query execution plan based on the obtained information about nodes and index metadata, and sends the execution plan to the corresponding BEs.
The BEs concurrently obtain the data of the
_source
field or the fields for which column-oriented storage is enabled from each Elasticsearch index shard in streaming mode by using theHTTP Scroll
method.ApsaraDB for SelectDB returns the results to you after it calculates the results.