Analysis methods, matching methods, relevance-based score calculation, and sort expressions
This topic describes in detail why actual and expected search results may be inconsistent in specific scenarios. In view of this, this topic also describes how OpenSearch promotes the search performance and what to do to achieve further improvement.
In most cases, you may use the following methods to perform searches:
Use LIKE clauses to query a database in inclusive match mode.
Use search engines such as Baidu and Google. After you enter a search query in a search engine, the search query is analyzed into several terms based on their semantic meanings. The analysis process is a key but tough point that confronts a search engine. Then, these terms are combined and used to match relevant documents. The documents are scored, sorted based on their scores, and then returned to you.
OpenSearch works in the same way as a search engine. The search performance of OpenSearch may be affected by the following factors: analysis method, matching method, and relevance-based score calculation.
The following sections describe how these factors work and perform in OpenSearch.
In the following sections, the impacts on search performance and scenarios of each factor are described in detail for your reference.
Analysis method
Before you get started with this topic, make sure that you are familiar with various analysis methods. For more information, see Text analyzers.
Matching method
How it works
After a search query is analyzed into several terms, how to use these terms to retrieve documents involves what matching methods are used. By default, OpenSearch searches for documents based on the logical relationship of AND. Only the documents that contain all the terms obtained after analysis can be returned. This is specific to a single search query. OpenSearch supports various matching methods. For example, you can use logical operators such as AND, OR, RANK, ANDNOT, and () to match results. These logical operators are ranked in the following descending order of priority: () > ANDNOT > AND > OR > RANK.
Examples
Logical operator | Syntax | Description |
query=title:"Apple Mobile phone" | Search for documents in which the title contains Apple and Mobile phone. | |
AND | query=title:'Apple' AND cate:'Mobile phone' | Search for documents in which the title contains Apple and the value of the cate field contains Mobile phone. The AND operator returns the intersection of the query results. |
OR | query=title:'Apple' OR cate:'Mobile phone' | Search for documents in which the title contains Apple or the value of the cate field contains Mobile phone. The OR operator returns the union of the query results. |
RANK | query=title:'Apple' RANK cate:'Mobile phone' | Search for documents in which the title contains Apple. Documents in which the value of the cate field contains Mobile phone are given extra points. |
ANDNOT | query=title:'Apple' ANDNOT cate:'Mobile phone' | Search for documents in which the title contains Apple and the value of the cate field does not contain Mobile phone. |
FAQ
Question: What do I do if I want to search for documents that start with a specific term such as KFC? Answer: OpenSearch does not allow you to retrieve documents based on terms at specific positions.
Relevance-based score calculation
The preceding section describes the methods that are used to retrieve documents. After documents are retrieved, how to sort these documents involves relevance. OpenSearch allows you to use sort clauses to customize sort configurations. If you do not specify a sort clause, sort=-RANK is used by default. You can use sort clauses to sort documents from multiple dimensions or in ascending or descending order. For example, if you use sort=-RANK;+ bonus, documents are first sorted based on relevance in descending order. Then, the documents with the same score are sorted based on the bonus in ascending order. This section focuses on how to use RANK to implement relevance-based score calculation in OpenSearch. You can use RANK in a rough sort expression or fine sort expression.
How it works
In OpenSearch, documents are first scored based on a rough sort expression. The number of documents involved in the rough sort process is equal to the value of the rank_size parameter, which is one million. Then, the N documents with the highest scores are scored and sorted based on a fine sort expression. Hundreds of documents are involved in the fine sort process. After the fine sort, documents are returned based on the values of the start and hit parameters. If the number of documents to be returned is greater than N, the remaining documents subsequent to the N documents are returned based on the scores obtained in the rough sort process.
Rough sort expression: The working principle described in the preceding paragraph shows that the rough sort process greatly affects the search performance, such as the latency. The rough sort process is significant, which determines whether high-quality documents can enter the fine sort process and finally be returned. In view of this, a simple but efficient rough sort is preferred. OpenSearch allows you to perform a rough sort based on several simple forward indexes or the static_bm25 or timeliness function.
Fine sort expression: After top N high-quality documents are retrieved by using a rough sort expression, you can use a fine sort expression to finely sort these documents. Fine sort expressions support mathematical operations and logical operations. In addition, OpenSearch provides bountiful functions and features for typical scenarios such as online to offline (O2O) scenarios. This satisfies the requirements for sorting documents by relevance.
OpenSearch also provides built-in application schemas and sort expressions for your reference and use in various scenarios.
Examples
Scenario | Expression | Description |
Forum: Rough sort | static_bm25() | Roughly calculates the text score. |
Forum: Fine sort | text_relevance(title)*3+text_relevance(body) + if(text_relevance(title)>0.07,timeliness(create_timestamp),timeliness(create_timestamp)*0.5) + (topped+special+atan(hits)*0.5+atan(replies))*0.1 | Calculates the text score , timeliness score , and scores of other attributes. |
O2O: Rough sort | sold_score+general_score*2 | Calculates the score of the sales volume and the comprehensive store score offline. |
O2O: Fine sort | 2*sold_score+0.5*reward - 10*distance(lon,lat,u_posx,u_posy) + if ((flags&2) =2, 2, 0)+if(is_open=5,10,0) + special_score | Calculates the scores of the sales volume, delivery speed, punctuality rate , distance , busy status, operation status , and human intervention. |
Fiction: Rough sort | static_bm25()*0.7+hh_hot*0.00003 | Calculates the text score and popularity score. |
Fiction: Fine sort | pow(min(0.5,max(text_relevance(category),max(text_relevance(title), text_relevance(author)))),2) + general_score*2 + 1.5*(1/(1+pow(2.718281,-((log10(hh_hot)-2)*2-5))))) | Calculates the scores of classification relevance, title relevance, author relevance , novel quality , and popularity. |
E-commerce: Rough sort | static_bm25()+general_score*2+timeliness(end_time) | Calculates the text score, comprehensive commodity score, and expiration time score. |
E-commerce: Fine sort | text_relevance(title)*3+text_relevance(category) + general_score*2+boughtScore*2 + tag_match(ctr_query_value,doc_value,mul,sum,false,true)+.. | Calculates the scores of text relevance, category relevance , popularity, seller ratings , click-through rate (CTR) estimation, and feature rules. |
FAQ
Question: Why cannot the seller_id field in the fine sort expression text_relevance(seller_id) be found? Answer: The text_relevance() expression supports fields only of the TEXT and SHORT_TEXT types.
Question: Why is the 2112 error reported? Answer: The fields specified in the query clause must be the same as those in the formula. For example, the query clause query=default:'keyword' contains title and body fields. However, the formula text_relevance(title)+text_relevance(author) contains title and author fields, where the author field is not contained in the query clause. In this case, the 2112 error is reported.
Tips
After documents are retrieved, sort expressions are used to score each document. Scores irrelevant to searches can be calculated offline in advance, and a general_score field can be added to store the offline scores. This way, documents can be sorted based on the general_score field. This saves a large number of calculations and improves search performance.
You can use the tag_match function to perform multi-dimensional operations on the features in query clauses and documents. This function can be widely used in E-commerce. If you have similar needs, try this function.
OpenSearch provides bountiful functions and features, which is a great aid if the service is properly used.
Relevance is a combination of many factors. You can adjust the weight of each factor to meet the requirements for search performance.