distinct clause - OpenSearch - Alibaba Cloud Documentation Center

Overview

You can include a distinct clause in a query statement to disperse documents that are obtained based on the statement. This helps ensure that the system returns distinct results and improves user experience. For example, a great many documents are retrieved in a query. However, multiple documents of a specific user are highly scored and assigned high ranks. As a result, most of the results displayed on the same page are from the same user. This affects the display effect and user experience. In this case, you can include a distinct clause in the query statement to extract specific documents from the set of documents that are obtained based on the rules that you specify in the distinct clause. This can disperse the documents and sort the documents in a new order to ensure that documents from each user are displayed.

Syntax

"distinct" : {
     "default": {  
      "dist_key" : "field",
      "dist_count":number,
      "dist_times" : number,
      "dist_filter" : "filter_expression",
      "reserved" : boolean,
      "max_item_count" : number,
      "grade" : []
    },
    "rank": {  
      "dist_key" : "field",
      "dist_count":number,
      "dist_times" : number,
      "dist_filter" : "filter_expression",
      "reserved" : boolean,
      "max_item_count" : number,
      "grade" : []
    },
    "rerank": {
      "dist_key" : "field",
      "dist_count":number,
      "dist_times" : number,
      "dist_filter" : "filter_expression",
      "reserved" : boolean,
      "max_item_count" : number,
      "grade" : []
    }
  }
}

To ensure the dispersing effect, OpenSearch Retrieval Engine Edition disperses the documents when they are sorted in the rough sort and fine sort phases. You can specify the same or different dispersing rules for the rough sort and fine sort phases. The rules used for dispersing documents in different phases vary based on the rules that you specify. Take note of the following items:

If you specify only the default rule, the default rule is used for dispersing documents in both the rough sort and fine sort phases.
If you specify only the rank rule, the rank rule is used for dispersing documents in the rough sort phase.
If you specify only the rerank rule, the rerank rule is used for dispersing documents in the fine sort phase.
If you specify both the default rule and the rank rule, the rank rule is used for dispersing documents in the rough sort phase, and the default rule is used for dispersing documents in the fine sort phase.
If you specify both the default rule and the rerank rule, the default rule is used for dispersing documents in the rough sort phase, and the rerank rule is used for dispersing documents in the fine sort phase.
If you specify both the rank rule and the rerank rule, the rank rule is used for dispersing documents in the rough sort phase, and the rerank rule is used for dispersing documents in the fine sort phase.
If you specify the default, rank, and rerank rules at the same time, the rank rule is used for dispersing documents in the rough sort phase, and the rerank rule is used for dispersing documents in the fine sort phase.
You must specify one of the default, rank, and rerank rules.

Parameters

dist_key: required. The attribute field based on which you want to disperse the documents that are obtained.
dist_count: optional. The number of documents that you want to extract each time. Default value: 1.
dist_times: optional. The number of times that you want to extract the documents. Default value: 1.
dist_filter: optional. The filter conditions. The system does not use the documents that are filtered based on the specified conditions as dispersing objects. In the fine sort phase, the system sorts the documents that are extracted by using the distinct clause together with the documents that are filtered. By default, the system uses all documents that are obtained as dispersing objects.
reserved: optional. Specifies whether to retain the remaining documents that are not extracted. Valid values: true and false. Default value: true. If you set this parameter to false, the system discards the documents that are not extracted. In this case, the total number of matching results may be inaccurate.
max_item_count: optional. The maximum number of documents that can be retained in the DISTINCT calculation. The maximum number of documents that can be retained is calculated by using the max(max_item_count, hit) function.

To ensure that the final results can be properly returned by page, you can set this parameter to the maximum number of documents that can be queried. For example, if 10 results are returned per page and up to 100 pages are returned, you can set this parameter to 1000.

grade: optional. The threshold values based on which the system classifies documents into different grades. The system extracts documents from each grade based on the threshold value that you specify for the grade. If you do not include this parameter in the distinct clause, the system classifies all documents into one grade by default. The system classifies documents into grades based on the relevance scores that are calculated in the rough sort phase. If you specify multiple grades, separate the threshold values with vertical bars (|). The number of grades that you can specify is not limited. Example 1: grade:3.0. In this case, documents are classified into two grades based on the specified threshold value. The documents with a score less than 3.0 are classified into the first grade. The documents with a score greater than or equal to 3.0 are classified into the second grade. Example 2: grade:3.0|5.0. In this case, documents are classified into three grades. The documents with a score less than 3.0 are classified into the first grade. The documents with a score greater than or equal to 3.0 but less than 5.0 are classified into the second grade. The documents with a score greater than or equal to 5.0 are classified into the third grade. Documents in different grades must be sorted in the same order as the documents in the rough sort phase. If the documents are sorted in descending order in the rough sort phase, documents in different grades are also sorted in descending order. If the documents are sorted in ascending order in the rough sort phase, documents in different grades are also sorted in ascending order.

Example:

"distinct" : {
     "default": {  
      "dist_key" : "company_id",
      "dist_count":2,
      "dist_times" : 10
    }
}
In this example, the system performs 10 rounds of document extraction based on the company_id field and extracts two documents during each round of document extraction. The system assigns lower ranks for the documents that are not extracted.

dist_count and dist_times

The following examples show how to specify the dist_count and dist_times parameters in a distinct clause and how the system obtains distinct results based on the values of these parameters:

For example, the system obtains six documents for a query. The documents contain the following attribute fields: id and name. The id field is the primary key field. You can specify the name field as the distinct key.

doc 1: id:1 name:a

doc 2: id:2 name:a

doc 3: id:3 name:a

doc 4: id:4 name:b

doc 5: id:5 name:c

doc 6: id:6 name:c

Case 1:

"distinct" : {
     "default": {  
      "dist_key" : "name",
      "dist_count":2,
      "dist_times" : 1
    }
}
The following results are returned in sequence: doc1, doc2, doc4, doc5, and doc6.

Case 2:

"distinct" : {
     "default": {  
      "dist_key" : "name",
      "dist_count":1,
      "dist_times" : 2
    }
}
The following results are returned in sequence: doc1, doc4, doc5, doc2, and doc6.

Case 3:

"distinct" : {
     "default": {  
      "dist_key" : "name",
      "dist_count":1,
      "dist_times" : 1
    }
}
The following results are returned in sequence: doc1, doc4, and doc5.

distinct uniq plug-in

If you set the reserved parameter to false, the values of the total and viewtotal parameters in the returned results may be inaccurate. If you want to display the results by page or perform other operations based on the values of these parameters, errors may occur. To resolve this issue, OpenSearch provides the distinct uniq plug-in to ensure that the values of the total and viewtotal parameters are accurate when the dist_times, dist_count, and reserved parameters are set to 1, 1, and false.

To use the distinct uniq plug-in, include duniqfield:field in a kvpairs clause.

Note:

The value of the duniqfield parameter in a kvpairs clause must be the same as the value of the dist_key parameter in a distinct clause.
This plug-in works only if the dist_times parameter is set to 1, the dist_count parameter is set to 1, and the reserved parameter is set to false. If you change the values of these parameters to other values, this plug-in does not work.
For performance reasons, this plug-in can return up to 5,000 query results for each query. If the number of query results is greater than 5,000, this plug-in returns 5,000 results.

Example:

{
  "distinct" : {
    "default": {  
      "dist_key" : "company_id",
      "dist_count":1,
      "dist_times" : 1,
      "reserved" : false
    }
  },
  "kvpairs" : {
    "duniqfield":"company_id"
  }
}

Usage notes

The fields that you specify in a distinct clause must be the attribute fields that you specify in the schema.json file.
The ARRAY type is not supported. Only the INT and LITERAL types are supported.