distinct clause - OpenSearch - Alibaba Cloud Documentation Center

distinct clauses can be used to ensure diversified results. This improves user experience. For example, a large number of documents are retrieved in a query. If multiple documents of a specific user are highly scored and ranked in the front, most of the results displayed on the same page are from the same user. This affects the display effect and user experience. In this case, distinct clauses can be used to extract documents from each user so that the documents of each user can be displayed.

Syntax

dist_key:field,dist_count:1,dist_times:1,reserved:false.

Parameter	Type	Required	Valid value	Default value	Description
dist_key	string	Yes			The field to be scattered.
dist_times	int	No		1	The number of extractions.
dist_count	int	No		1	The number of documents to be extracted in one extraction.
reserved	true/false	No	true/false	true	Specifies whether to retain the remaining documents after extraction. If this parameter is set to false, the remaining documents are discarded. As a result, the total number of matching results is inaccurate.
update_total_hit	true/false	No	true/false	false	If you set the value of the reserved parameter to false and the value of the update_total_hit parameter to true, the system calculates the difference between the number of discarded documents and the value of the total_hit parameter. The value of the total_hit response parameter may be inaccurate. If you set the value of the update_total_hit parameter to false, the value of the total_hit parameter includes the number of documents that are discarded.
dist_filter	string	No			The filter condition. The documents that are filtered out are not scattered, but are sorted together with the first group of scattered documents. By default, all documents are scattered.
grade	float	No			The thresholds for classifying documents into different categories. All documents are classified into different categories based on the specified thresholds. The documents in each category are scattered based on the parameters in the distinct clause. The grade parameter is optional. If you do not configure the grade parameter, all documents are classified into one category by default. Documents are classified based on the specified thresholds. Separate thresholds with vertical bars (\|). The number of thresholds that you can specify is not limited. Example 1: grade:3.0. In this case, documents are classified into two categories based on the specified threshold. The documents with a score less than 3.0 are classified into the first category. The documents with a score greater than or equal to 3.0 are classified into the second category. Example 2: grade:3.0\|5.0. In this case, documents are classified into three categories. The documents with a score less than 3.0 are classified into the first category. The documents with a score greater than or equal to 3.0 but less than 5.0 are classified into the second category. The documents with a score greater than or equal to 5.0 are classified into the third category. The categories are sorted in the same order as that used for sorting the documents in the first category. If the documents in the first category are sorted in descending order, the categories are sorted in descending order. This also works the other way around.

Description about the dist_count and dist_times parameters

The following examples describe the usage and meanings of the dist_count and dist_times parameters. Six documents are provided. id is the primary key field, and name is the field to be scattered.

doc1: id:11 name:a

doc2: id:22 name:a

doc3: id:33 name:a

doc4: id:44 name:b

doc5: id:55 name:c

doc6: id:66 name:c

Example 1: distinct=dist_key:name,dist_count:2,dist_times:1,reserved:false. In this example, one extraction is performed, and two documents are extracted. The following results are obtained after scattering: doc1, doc2, doc4, doc5, and doc6.

Example 2: distinct=dist_key:name,dist_count:1,dist_times:2,reserved:false. In this example, two extractions are performed. In each extraction, one document is extracted. The following results are obtained after scattering: doc1, doc4, doc5, doc2, and doc6.

Example 3: distinct=dist_key:name,dist_count:1,dist_times:1,reserved:false. In this example, one extraction is performed, and one document is extracted. The following results are obtained after scattering: doc1, doc4, and doc5.

Usage notes

A distinct clause is optional.
The fields referenced in a distinct clause must be configured as attribute fields when you define an application schema.
You cannot specify fields of the ARRAY type in a distinct clause. Only fields of the INT type and LITERAL type are supported.
You can specify only one field to be scattered.
The sort feature cannot remove duplicates automatically. However, you can use distinct clauses to remove duplicates. For example, if you want to deduplicate the documents with the same title, you can specify title as the field to be scattered and perform one extraction. In the extraction, one document is extracted.

distinct uniq plug-in

If the reserved parameter is set to false, the values of the total and viewtotal parameters related to search results are inaccurate. In this case, if you implement pagination or perform other processing based on these values, errors may occur. To resolve this issue, OpenSearch provides the distinct uniq plug-in to ensure that the values of the total and viewtotal parameters are accurate when the dist_times, dist_count, and reserved parameters are set to 1, 1, and false. To use the distinct uniq plug-in, include duniqfield:field in a kvpairs clause. Example: kvpairs=duniqfield:name.

Notes

The value of the field parameter must be the same as the value of the dist_key parameter in the distinct clause.
This plug-in works only when the dist_times, dist_count, and reserved parameters are set to 1, 1, and false. If the values of these parameters change, this plug-in does not work.
For performance reasons, this plug-in returns a maximum of 5,000 search results in each query even if more than 5,000 search results are obtained.
If you use this plug-in in a query that hits millions of data records, timeouts may occur.

Examples

You want to search for documents in which the value of the create_time parameter is greater than 1402301230 and "Zhejiang University" is contained. The retrieved documents are scattered based on the company_id field. A total of 10 extractions are performed. In each extraction, two documents are extracted. The extracted documents are ranked at the back.
```
query=default:'Zhejiang University'&&filter=create_time>1402301230&&distinct=dist_key:company_id,dist_count:2,dist_times:10
```
You want to search for documents that contain "Zhejiang University". The retrieved documents are scattered based on the company_id field. One extraction is performed, and one document is extracted. The remaining documents after extraction are discarded, and only the extracted documents are returned.
```
query=default:'Zhejiang University'&&distinct=dist_key:company_id,dist_count:1,dist_times:1,reserved:false&&kvpairs=duniqfield:company_id
```