By Yizheng
Elasticsearch is a very popular distributed search engine that provides powerful and easy-to-use query and analysis features, including full-text search, fuzzy query, multi-condition combination query, and geo location query. It also features analysis and aggregation capabilities. Analyzing query performance in a broad sense is very complex due to the wide range of query scenarios and many other factors such as machine models, parameter configuration, and cluster size. This article will analyze the query overhead using several main query scenarios from the query principle perspective, and provides rough performance indicator values for your reference.
This section mainly introduces some background knowledge about Lucene. You can skip this section if you are already familiar with it.
Lucene is the underlying layer of Elasticsearch, and Lucene performance determines the Elasticsearch query performance.
The most important feature in Lucene is that it has several data structures, which determine how the data is retrieved. Let's take a brief look at these data structures:
After getting familiar with the data structures in Lucene and the basic query principles, we know the following:
The question is, if specific combined query conditions are given, how does Lucene combine results for each condition to get the final results? Simply put, how can we find the union and intersection of two sets?
As described in the Lucene principle analysis article above, we can use skipList to skip invalid docs and find the intersections of N postings lists.
Approach 1: Keep multiple ordered lists, and group the top of each list together into a priority queue (the minimum stack). This allows subsequent iterators to be performed on the entire union (take the top of the stack out of and get the next queued docID into the stack). We can also use a skipList to skip backwards (each sub-list is skipped using a skipList). This method works for scenarios where the number of postings lists is relatively small (N is relatively small).
Approach 2: If there are too many postings lists (N is relatively big), the first approach is not cost-effective. In this situation, we can directly merge results into an ordered docID array.
Approach 3: For the second approach, original docIDs are directly saved, and memory usage scales directly with the number of docIDs. Therefore, when the number of docs exceeds a specific value (a 32-bit docID only uses 1 bit in BitSet and the BitSet size depends on the total number of docs in segments. So, we can evaluate the cost-effectiveness of BitSet based on the total number of docs and the number of the current docs), constructing BitSet will reduce memory usage and improve the efficiency in finding unions/intersections.
Since docIDs found through BKD-Tree are unordered, we should either convert them into ordered docID arrays or construct BitSet before merging them with other results.
If a lookup contains several conditions, it is optimal to query by low-cost conditions first and then iterate over small result collections. Lucene has made a lot of optimizations in this regard. Before running queries, Lucene will first evaluate the cost of each query, and then decide an appropriate query order accordingly.
By default, Lucene sorts by score (calculated score values). If other sort fields are specified, Lucene will sort results in the specified order. Does sorting significantly impact performance? Sorting doesn't target all docs found. Instead, sorting constructs a stack and only ensures that the first (Offset+Size) docs are ordered. Therefore, sorting performance depends on (Size+Offset) and the number of docs found as well as the overhead of reading docValues. Since (Size+Offset) doesn't get too large, and reading docValues is very efficient, sorting doesn't impact performance very much.
The previous section explained some query-related theories. In this section, we will combine theories with practice and analyze query performance for specific scenarios based on some test values. In the test, we will use a single-shard 64-core machine with SSDs and analyze the computing overhead for several scenarios, ignoring any influences from the OS cache. The test results are for your reference only.
Create an index and a shard in ES. No replica is present. Prepare 10 million rows of data, each row containing only a few tags and a unique ID. Write all the data into the created index. Tag1 only has two values: a and b. Now, try to find entries with Tag1=a from the 10 million data rows (about 5 million entries). How long does it take to run the following query?
Request:
{
"query": {
"constant_score": {
"filter": {
"term": {
"Tag1": "a"
}
}
}
},
"size": 1
}'
Response:
{"took":233,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":5184867,"max_score":1.0,"hits":...}
This request takes 233 ms and returns a total of 5,184,867 matching data entries.
We know that the query condition Tag1="a" intends to search the postings list of Tag1="a". The length of this postings list is 5,184,867, which is very long. Most of the time spent is scanning this postings list. In this example, the benefit of scanning the postings list is to get the total records that meet the condition. Because the constant_score is set in the condition, it only needs to return one matching record, without having to calculate scores. In scenarios where score calculation is required, Lucene will calculate scores based on how often the term appears in a doc and return sorted scores.
Now, at least 5 million postings lists can be scanned in 233 ms. In addition, since a single request is executed on a single thread, one CPU core can scan roughly 10 million docs in inverted indexes in one second.
Now let's switch to a shorter postings list that has a total length of 10,000 and takes 3 ms to scan.
{"took":3,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":10478,"max_score":1.0,"hits":...}
Let's try to find the intersection of two term queries first:
Consider a term combination query that includes two postings lists with a length of 10,000 and 5,000,000 respectively and has 5,000 matching data entries after the merge. How is the query performance?
Request:
{
"size": 1,
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{
"term": {
"Tag1": "a" //length of postings list 5,000,000
}
},
{
"term": {
"Tag2": "0" // length of postings list 10,000
}
}
]
}
}
}
}
}
Response:
{"took":21,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":5266,"max_score":2.0,"hits":...}
This request takes 21 ms, and the main action is to find the intersection of the two postings lists. Therefore, our analysis focuses on the skipList performance.
In this example, the postings list length is 1,000 and 5,000,000 respectively. After the merge, over 5,000 docs still match the condition. For the postings list with the length of 10,000, skip is almost unnecessary because half of the docs meet the condition; for postings lists longer than 5,000,000, skip an average of 1,000 docs each time. The minimum storage unit for postings lists is blocks. A block generally contains 128 docIDs, and the skip operation will not be performed inside a block. Therefore, even if it's possible to skip to a specific block, the docIDs in that block still need to be scanned in sequence. In this example, roughly tens of thousands of docIDs are actually scanned, so around 20 ms is within the expected range.
Now let's find the union of the term queries. Replace "must" in the preceding bool query with "should", and here is the query result:
{"took":393,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":5190079,"max_score":1.0,"hits":...}
It takes 393 ms to complete the operation. Therefore, it takes longer to find the union than to run a single query.
Consider 10 million data entries. Each RecordID is a UUID, and each doc has a unique UUID. Find UUIDs that begin with 0–7. There are probably over 5 million results. Let's have a look at the query performance in this scenario.
Request:
{
"query": {
"constant_score": {
"filter": {
"range": {
"RecordID": {
"gte": "0",
"lte": "8"
}
}
}
}
},
"size": 1
}
Response:
{"took":3001,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":5185663,"max_score":1.0,"hits":...}
Assume that we are going to query UUIDs beginning with "a". We may get around 600,000 results. How about the performance?
Request:
{
"query": {
"constant_score": {
"filter": {
"range": {
"RecordID": {
"gte": "a",
"lte": "b"
}
}
}
}
},
"size": 1
}
Response:
{"took":379,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":648556,"max_score":1.0,"hits":...}
For this query, we will mainly analyze the FST query performance. Based on previous results, we can see that FST queries perform much worse than scanning postings lists. When scanning postings lists, it takes less than 300 ms to scan 5 million data entries. However, it takes 3 seconds to scan the same amount of data when using FST scans, almost 10 times slower. For UUID strings, FST range scanning can search about 1 million entries per second.
Consider a string range query (5 million matching entries) and two term queries (5,000 matching entries). A total of 2,600 entries meet the conditions. Let's test the performance.
Request:
{
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{
"range": {
"RecordID": {
"gte": "0",
"lte": "8"
}
}
},
{
"term": {
"Tag1": "a"
}
},
{
"term": {
"Tag2": "0"
}
}
]
}
}
}
},
"size": 1
}
Results:
{"took":2849,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":2638,"max_score":1.0,"hits":...}
In this example, most of the query time is spent scanning FSTs. First, the terms that match the conditions are obtained using FSTs, and then the docID list for each term is read to construct a BitSet, and finally it finds the intersection of the BitSet and the postings lists for the two term queries.
For the numeric type, we also search 10 million data entries for 5 million targets and see how it performs.
Request:
{
"query": {
"constant_score": {
"filter": {
"range": {
"Number": {
"gte": 100000000,
"lte": 150000000
}
}
}
}
},
"size": 1
}
Response:
{"took":567,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":5183183,"max_score":1.0,"hits":...}
In this scenario, we mainly test the BKD-Tree performance. We can see that the BKD-Tree query performance is pretty good. It takes around 500 ms to find 5 million docs, about twice the time needed to scan inverted indexes. Compared with FST, BKD-Tree has much higher performance. Geo location queries are also implemented by BKD-Tree, and have high performance.
Now, we'll cover a complex query scenario: the numeric range includes 5 million data entries, and another two term conditions are also added to the query, with over 2,600 final entries that match the conditions. Let's evaluate the performance for this scenario.
Request:
{
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{
"range": {
"Number": {
"gte": 100000000,
"lte": 150000000
}
}
},
{
"term": {
"Tag1": "a"
}
},
{
"term": {
"Tag2": "0"
}
}
]
}
}
}
},
"size": 1
}
Response:
{"took":27,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":2638,"max_score":1.0,"hits":...}
The results are actually unexpected. This query operation only takes 27 ms! In the previous example, the numeric range query took more than 500 ms. Although we have added another two conditions, the query time is as short as 27 ms. Why is this?
Actually, Lucene has an optimization: in the underlying layer is a query called IndexOrDocValuesQuery, which automatically determines whether Index (BKD-Tree) or DocValues should be queried. In this example, Lucene first finds the intersection of the two term queries, and as a result, gets a little over 5,000 docIDs. Then it reads the docValues of these 5,000+ docIDs and searches for data entries that match the numeric range values. Since it only needs to read the docValues of around 5,000 docs, it doesn't take so long.
I will end this article with a question: since more data to scan means poorer performance, is it possible to stop a query after enough data has been obtained?
How Table Store Implements High Reliability and High Availability
57 posts | 12 followers
FollowAlibaba Cloud Storage - April 10, 2019
ApsaraDB - July 8, 2021
Data Geek - April 8, 2024
Alibaba Cloud Community - April 15, 2024
Alibaba Cloud MaxCompute - March 24, 2021
Whybert - January 10, 2019
57 posts | 12 followers
FollowA fully managed NoSQL cloud database service that enables storage of massive amount of structured and semi-structured data
Learn MoreAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreConduct large-scale data warehousing with MaxCompute
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreMore Posts by Alibaba Cloud Storage