Aggregation - Tablestore - Alibaba Cloud Documentation Center

You can perform aggregation operations to obtain the minimum, maximum, sum, and average values, and the count and distinct count of rows. You can also perform aggregation operations to group results by field value, range, geographical location, or filter, and perform nested queries. You can perform multiple aggregation operations for complex queries.

Note

Tablestore SDK for Python V5.2.1 or later supports the aggregation feature.

Background information

The following table describes the aggregation methods.

Method	Description
Minimum value	The aggregation method that can be used to return the minimum value of a field. This method can be used in a similar manner as the SQL MIN function.
Maximum value	The aggregation method that can be used to return the maximum value of a field. This method can be used in a similar manner as the SQL MAX function.
Sum	The aggregation method that can be used to return the sum of all values for a numeric field. This method can be used in a similar manner as the SQL SUM function.
Average value	The aggregation method that can be used to return the average of all values for a numeric field. This method is used in a similar manner as the SQL AVG function.
Count	The aggregation method that can be used to return the total number of values for a specified field or the total number of rows in a search index. This method can be used in a similar manner as the SQL COUNT function.
Distinct count	The aggregation method that can be used to return the number of distinct values for a field. This method can be used in a similar manner as the SQL COUNT(DISTINCT) function.
Percentile statistics	A percentile value indicates the relative position of a value in a dataset. For example, when you collect statistics for the response time of each request during the routine O&M of your system, you must analyze the response time distribution by using percentiles such as p25, p50, p90, and p99.
Group by field value	The aggregation method that can be used to group query results based on field values. The values that are the same are grouped together. The identical value of each group and the number of identical values in each group are returned. Note The calculated number may be different from the actual number if the number of values in a group is very large.
Group by range	The aggregation method that can be used to group query results based on the value ranges of a field. Field values that are within a specified range are grouped together. The number of values in each range is returned.
Group by geographical location	The aggregation method that can be used to group query results based on geographical locations to a central point. Query results in distances that are within a specified range are grouped together. The number of items in each range is returned.
Group by filter	The aggregation method that can be used to filter the query results and group them together to obtain the number of results that match each filter. Results are returned in the order in which the filters are specified.
Query by histogram	The aggregation method that can be used to group query results based on specific data intervals. Field values that are within the same range are grouped together. The value range of each group and the number of values in each group are returned.

Prerequisites

The OTSClient is initialized. For more information, see Initialization.
A data table is created. Data is written to the table.
A search index is created for the data table. For more information, see Create search indexes.

Minimum value

The aggregation method that can be used to return the minimum value of a field. This method can be used in a similar manner as the SQL MIN function.

Parameters

Parameter	Description
name	The unique name of the aggregation operation. You can query the results of a specific aggregation operation based on this name.
field	The name of the field that is used to perform the aggregation operation. Only the LONG, DOUBLE, and DATE types are supported.
missing	The default value for the field that is used to perform the aggregation operation on a row when the field value is empty. If you do not specify a value for missing, the row is ignored. If you specify a value for missing, the value of this parameter is used as the field value of the row.

Example

The following code provides an example on how to use aggregation to obtain the lowest score from the scores of people aged 18:

query = TermQuery('age', 18)
agg = Min('score', name = 'min')

search_response = client.search(table_name, index_name,
                                SearchQuery(query, next_token = None, limit=0, aggs=[agg]),
                                columns_to_get = ColumnsToGet(return_type = ColumnReturnType.ALL_FROM_INDEX))

for agg_result in search_response.agg_results:
    print('{\n"name":"%s",\n"value":%s\n}\n' % (agg_result.name, str(agg_result.value)))

Maximum value

The aggregation method that can be used to return the maximum value of a field. This method can be used in a similar manner as the SQL MAX function.

Parameters

Parameter	Description
name	The unique name of the aggregation operation. You can query the results of a specific aggregation operation based on this name.
field	The name of the field that is used to perform the aggregation operation. Only the LONG, DOUBLE, and DATE types are supported.
missing	The default value for the field that is used to perform the aggregation operation on a row when the field value is empty. If you do not specify a value for missing, the row is ignored. If you specify a value for missing, the value of this parameter is used as the field value of the row.

Example

The following code provides an example on how to use aggregation to obtain the highest score from the scores of people aged 18. By default, if a person has no score, a value of 0 is used as the score for aggregation.

query = TermQuery('age', 18)
agg = Max('score', missing_value = 0, name = 'max')

search_response = client.search(table_name, index_name,
                                SearchQuery(query, next_token = None, limit=0, aggs=[agg]),
                                columns_to_get = ColumnsToGet(return_type = ColumnReturnType.ALL_FROM_INDEX))

for agg_result in search_response.agg_results:
    print('{\n"name":"%s",\n"value":%s\n}\n' % (agg_result.name, str(agg_result.value)))

Sum

The aggregation method that can be used to return the sum of all values for a numeric field. This method can be used in a similar manner as the SQL SUM function.

Parameters

Parameter	Description
name	The unique name of the aggregation operation. You can query the results of a specific aggregation operation based on this name.
field	The name of the field that is used to perform the aggregation operation. Only the LONG and DOUBLE types are supported.
missing	The default value for the field that is used to perform the aggregation operation on a row when the field value is empty. If you do not specify a value for missing, the row is ignored. If you specify a value for missing, the value of this parameter is used as the field value of the row.

Example

The following code provides an example on how to use aggregation to obtain the sum of the scores of people aged 18:

query = TermQuery('age', 18)
agg = Sum('score', name = 'sum')

search_response = client.search(table_name, index_name,
                                SearchQuery(query, next_token = None, limit=2, aggs=[agg]),
                                columns_to_get = ColumnsToGet(return_type = ColumnReturnType.ALL_FROM_INDEX))

for agg_result in search_response.agg_results:
    print('{\n"name":"%s",\n"value":%s\n}\n' % (agg_result.name, str(agg_result.value)))

Average value

The aggregation method that can be used to return the average of all values for a numeric field. This method is used in a similar manner as the SQL AVG function.

Parameters

Parameter	Description
name	The unique name of the aggregation operation. You can query the results of a specific aggregation operation based on this name.
field	The name of the field that is used to perform the aggregation operation. Only the LONG, DOUBLE, and DATE types are supported.
missing	The default value for the field that is used to perform the aggregation operation on a row when the field value is empty. If you do not specify a value for missing, the row is ignored. If you specify a value for missing, the value of this parameter is used as the field value of the row.

Example

The following code provides an example on how to use aggregation to obtain the average score of people aged 18:

query = TermQuery('age', 18)
agg = Avg('score', name = 'avg')

search_response = client.search(table_name, index_name,
                                SearchQuery(query, next_token = None, limit=2, aggs=[agg]),
                                columns_to_get = ColumnsToGet(return_type = ColumnReturnType.ALL_FROM_INDEX))

for agg_result in search_response.agg_results:
    print('{\n"name":"%s",\n"value":%s\n}\n' % (agg_result.name, str(agg_result.value)))

Count

The aggregation method that can be used to return the total number of values for a specified field or the total number of rows in a search index. This method can be used in a similar manner as the SQL COUNT function.

Note

You can use the following methods to query the total number of rows in a search index or the total number of rows that meet the query conditions:

Use the count feature of aggregation and specify count(*) in the request.
Use the query feature to obtain the number of rows that meet the query conditions. Set setGetTotalCount to true in the query. Use MatchAllQuery to obtain the total number of rows in a search index.

You can use the name of a column as the value of the count expression to query the number of rows that contain the column in a search index. This method is suitable for scenarios that involve sparse columns.

Parameters

Parameter	Description
name	The unique name of the aggregation operation. You can query the results of a specific aggregation operation based on this name.
field	The name of the field that is used to perform the aggregation operation. Only the LONG, DOUBLE, BOOLEAN, KEYWORD, GEO_POINT, and DATE types are supported.

Example

The following code provides an example on how to use aggregation to obtain the number of people aged 18 who have scores:


query = TermQuery('age', 18)
agg = Count('score', name = 'count')

search_response = client.search(table_name, index_name,
                                SearchQuery(query, next_token = None, limit=2, aggs=[agg]),
                                columns_to_get = ColumnsToGet(return_type = ColumnReturnType.ALL_FROM_INDEX))

for agg_result in search_response.agg_results:
    print('{\n"name":"%s",\n"value":%s\n}\n' % (agg_result.name, str(agg_result.value)))

Distinct count

The aggregation method that can be used to return the number of distinct values for a field. This method can be used in a similar manner as the SQL COUNT(DISTINCT) function.

Note

The number of distinct values is an approximate number.

If the total number of rows before the distinct count feature is used is less than 10,000, the calculated result is close to the exact value.
If the total number of rows before the distinct count feature is used is greater than or equal to 100 million, the error rate is approximately 2%.

Parameters

Parameter	Description
name	The unique name of the aggregation operation. You can query the results of a specific aggregation operation based on this name.
field	The name of the field that is used to perform the aggregation operation. Only the LONG, DOUBLE, BOOLEAN, KEYWORD, GEO_POINT, and DATE types are supported.
missing	The default value for the field that is used to perform the aggregation operation on a row when the field value is empty. If you do not specify a value for missing, the row is ignored. If you specify a value for missing, the value of this parameter is used as the field value of the row.

Example

The following code provides an example on how to use aggregation to obtain the number of unique names of people aged 18:

query = TermQuery('age', 18)
agg = DistinctCount('name', name = 'distinct_name')

search_response = client.search(table_name, index_name,
                                SearchQuery(query, next_token = None, limit=2, aggs=[agg]),
                                columns_to_get = ColumnsToGet(return_type = ColumnReturnType.ALL_FROM_INDEX))

for agg_result in search_response.agg_results:
    print('{\n"name":"%s",\n"value":%s\n}\n' % (agg_result.name, str(agg_result.value)))

Percentile statistics

A percentile value indicates the relative position of a value in a dataset. For example, when you collect statistics for the response time of each request during the routine O&M of your system, you must analyze the response time distribution by using percentiles such as p25, p50, p90, and p99.

Note To improve the accuracy of the results, we recommend that you specify extreme percentile values such as p1 and p99. If you use extreme percentile values instead of other values such as p50, the returned results are more accurate.

Parameters

Parameter	Description
name	The unique name of the aggregation operation. You can query the results of a specific aggregation operation based on this name.
field	The name of the field that is used to perform the aggregation operation. Only the LONG, DOUBLE, and DATE types are supported.
percentiles	The percentiles such as p50, p90, and p99. You can specify one or more percentiles.
missing_value	The default value for the field that is used to perform the aggregation operation on a row when the field value is empty. If you do not specify a value for missing_value, the row is ignored. If you specify a value for missing_value, the value of this parameter is used as the field value of the row.

Example

query = TermQuery('product', '10010')
agg = Percentiles('latency', percentiles_list = [50, 90, 95])

search_response = client.search(table_name, index_name,
                                SearchQuery(query, next_token = None, limit=2, aggs=[agg]),
                                columns_to_get = ColumnsToGet(return_type = ColumnReturnType.ALL_FROM_INDEX))

for agg_result in search_response.agg_results:
    print('percentiles:%s' % agg_result.name)
    for item in agg_result.value:
        print('%s:%s' % (str(item.key), str(item.value)))

Group by field value

The aggregation method that can be used to group query results based on field values. The values that are the same are grouped together. The identical value of each group and the number of identical values in each group are returned.

Note

The calculated number may be different from the actual number if the number of values in a group is very large.

Parameters

Parameter	Description
name	The unique name of the aggregation operation. You can query the results of a specific aggregation operation based on this name.
field	The name of the field that is used to perform the aggregation operation. Only the LONG, DOUBLE, BOOLEAN, KEYWORD, and DATE types are supported.
size	The number of groups that you want to return. Default value: 10. Maximum value: 2000. If the number of groups exceeds 2,000, only the first 2,000 groups are returned.
group_by_sort	The sorting rules for groups. By default, groups are sorted based on the number of items in the groups in descending order. If you configure multiple sorting rules, the groups are sorted based on the order in which the rules are configured. Supported parameters: Sort by value in alphabetical order. Sort by value in reverse alphabetical order. Sort by row count in ascending order. Sort by row count in descending order. Sort by the values that are obtained from sub-aggregation results in ascending order. Sort by the values that are obtained from sub-aggregation results in descending order.
sub_aggs and sub_group_bys	The sub-aggregation operation. You can perform the sub-aggregation operation based on the grouping results. Scenario Query the number of products in each category, and the maximum and minimum product prices in each category. Method Group query results by product category to obtain the number of products in each category. Then, perform two sub-aggregation operations to obtain the maximum and minimum product prices in each category. Sample result Fruits: 5. The maximum price is CNY 15. The minimum price is CNY 3. Toiletries: 10. The maximum price is CNY 98. The minimum price is CNY 1. Electronic devices: 3. The maximum price is CNY 8,699. The minimum price is CNY 2,300. Other products: 15. The maximum price is CNY 1,000. The minimum price is CNY 80.

Example 1

The following code provides an example on how to group people aged 18 by score and obtain the top 10 common scores and the number of people in each group:

query = TermQuery('age', 18)
group_by = GroupByField('score', size = 10)

search_response = client.search(table_name, index_name,
                                SearchQuery(query, next_token = None, limit=20, group_bys = [group_by]),
                                columns_to_get = ColumnsToGet(return_type = ColumnReturnType.ALL_FROM_INDEX))

for group_by in search_response.group_by_results:
    print("name:%s" % group_by.name)
    print("groups:")
    for item in group_by.items:
        print("key:%s, count:%d" % (item.key, item.row_count))

Example 2

The following code provides an example on how to group people aged 18 by score and obtain the two least common scores and the number of people in each group:

group_by = GroupByField('score', size = 2, group_by_sort = [RowCountSort(sort_order=SortOrder.ASC)])
search_response = client.search(table_name, index_name,
                                SearchQuery(TermQuery('age', 18), limit=100, get_total_count=True, group_bys = [group_by]),
                                ColumnsToGet(return_type=ColumnReturnType.ALL_FROM_INDEX))

for group_by in search_response.group_by_results:
    print("name:%s" % group_by.name)
    print("groups:")
    for item in group_by.items:
        print("key:%s, count:%d" % (item.key, item.row_count))

Example 3

The following code provides an example on how to group people aged 18 by score and obtain the top two common scores and the number of people in each group. Then, sort each group of people by primary key and obtain the information of the top three people in each group:

sort = RowCountSort(sort_order = SortOrder.DESC)
sub_agg = [TopRows(limit=3,sort=Sort([PrimaryKeySort(sort_order=SortOrder.DESC)]), name = 't1')]

group_by = GroupByField('l', size = 2, group_by_sort = [sort], sub_aggs = sub_agg)
search_response = client.search(table_name, index_name,
                                SearchQuery(TermQuery('age', 18), limit=100, get_total_count=True, group_bys = [group_by]),
                                ColumnsToGet(return_type=ColumnReturnType.ALL_FROM_INDEX))

for group_by in search_response.group_by_results:
    print("name:%s" % group_by.name)
    print("groups:")
    for item in group_by.items:
        print("\tkey:%s, count:%d" % (item.key, item.row_count))
        for sub_agg in item.sub_aggs:
            print("\t\tname:%s:" % sub_agg.name)
            for entry in sub_agg.value:
                print("\t\t\tvalue:%s" % str(entry))

Example 4

The following code provides an example on how to group people aged 18 by score and gender:

sort = RowCountSort(sort_order = SortOrder.ASC)
sub_group = GroupByField('sex', size = 10, group_by_sort = [sort])

group_by = GroupByField('score', size = 10, group_by_sort = [sort], sub_group_bys = [sub_group])
search_response = client.search(table_name, index_name,
                                SearchQuery(TermQuery('age', 18), limit=100, get_total_count=True, group_bys = [group_by]),
                                ColumnsToGet(return_type=ColumnReturnType.ALL_FROM_INDEX))

for group_by in search_response.group_by_results:
    print("name:%s" % group_by.name)
    print("groups:")
    for item in group_by.items:
        print("\tkey:%s, count:%d" % (item.key, item.row_count))
        for sub_group in item.sub_group_bys:
            print("\t\tname:%s:" % sub_group.name)
            for sub_item in sub_group.items:
                print("\t\t\tkey:%s, count:%s" % (str(sub_item.key), str(sub_item.row_count)))

Group by range

The aggregation method that can be used to group query results based on the value ranges of a field. Field values that are within a specified range are grouped together. The number of values in each range is returned.

Parameters

Parameter	Description
name	The unique name of the aggregation operation. You can query the results of a specific aggregation operation based on this name.
field	The name of the field that is used to perform the aggregation operation. Only the LONG and DOUBLE types are supported.
range[double_from, double_to)	The value ranges for grouping. The value range can start from Double.MIN_VALUE and end with Double.MAX_VALUE.
sub_aggs and sub_group_bys	The sub-aggregation operation. You can perform the sub-aggregation operation based on the grouping results. For example, after you group query results by sales volume and by province, you can obtain the province that has the largest proportion of sales volume in a specified range. You must specify a value for GroupByField in GroupByRange to perform this query.

Example

The following code provides an example on how to group people aged 18 by score range and obtain the numbers of people with scores in the [80, 90) and [90, 100) intervals:

query = TermQuery('age', 18)
group_by = GroupByRange(field_name = 'score', ranges = [(80, 90),(90, 100)])

search_response = client.search(table_name, index_name,
                                SearchQuery(query, next_token = None, limit=0, group_bys = [group_by]),
                                columns_to_get = ColumnsToGet(return_type = ColumnReturnType.ALL_FROM_INDEX))

for group_by in search_response.group_by_results:
    print("name:%s" % group_by.name)
    print("groups:")
    for item in group_by.items:
        print("range:%.1f~%.1f, count:%d" % (item.range_from, item.range_to, item.row_count))

Group by geographical location

The aggregation method that can be used to group query results based on geographical locations to a central point. Query results in distances that are within a specified range are grouped together. The number of items in each range is returned.

Parameters

Parameter	Description
name	The unique name of the aggregation operation. You can query the results of a specific aggregation operation based on this name.
field	The name of the field that is used for the aggregation operation. Only the GEOPOINT type is supported.
origin(double lat, double lon)	The longitude and latitude of the central point. lat specifies the latitude of the central point. lon specifies the longitude of the central point.
range[double_from, double_to)	The distance ranges that are used for grouping. Unit: meters. The value range can start from Double.MIN_VALUE and end with Double.MAX_VALUE.
sub_aggs and sub_group_bys	The sub-aggregation operation. You can perform the sub-aggregation operation based on the grouping results.

Example

The following code provides an example on how to group people aged 18 by the geographical distance from their school to their home and obtain the numbers of people who live within one kilometer from the school and people who live one to two kilometers away from school: The latitude and longitude of the school is (31,116).

query = TermQuery('age', 18)
group_by = GroupByGeoDistance(field_name = 'address', origin=GeoPoint(31, 116), ranges = [(0, 1000), (1000,2000)])

search_response = client.search(table_name, index_name,
                                SearchQuery(query, next_token = None, limit=2, group_bys = [group_by]),
                                columns_to_get = ColumnsToGet(return_type = ColumnReturnType.ALL_FROM_INDEX))

for group_by in search_response.group_by_results:
    print("name:%s" % group_by.name)
    print("groups:")
    for item in group_by.items:
        print("range:%.1f~%.1f, count:%d" % (item.range_from, item.range_to, item.row_count))

Group by filter

The aggregation method that can be used to filter the query results and group them together to obtain the number of results that match each filter. Results are returned in the order in which the filters are specified.

Parameters

Parameter	Description
name	The unique name of the aggregation operation. You can query the results of a specific aggregation operation based on this name.
filter	The filters that can be used for the query. Results are returned in the order in which the filters are specified.
sub_aggs and sub_group_bys	The sub-aggregation operation. You can perform the sub-aggregation operation based on the grouping results.

Example

The following code provides an example on how to group people aged 18 by filter and obtain the numbers of people who scored 100 in the math exam and who scored 100 in the Chinese exam:

query = TermQuery('age', 18)
filter1 = TermQuery('math', 100)
filter2 = TermQuery('chinese', 100)
filters = [filter1, filter2]
group_by = GroupByFilter(filters)

search_response = client.search(
    table_name, index_name,
    SearchQuery(query, next_token = None, limit=2, group_bys = [group_by]),
    columns_to_get = ColumnsToGet(return_type = ColumnReturnType.ALL_FROM_INDEX))

for group_by in search_response.group_by_results:
    print("name:%s" % group_by.name)
    print("groups:")
    i = 0
    for item in group_by.items:
        print("filter:%s=%s, count:%d" % (str(filters[i].field_name), str(filters[i].column_value), item.row_count))
        i=i+1

Query by histogram

The aggregation method that can be used to group query results based on specific data intervals. Field values that are within the same range are grouped together. The value range of each group and the number of values in each group are returned.

Parameters

Parameter	Description
name	The unique name of the aggregation operation. You can query the results of a specific aggregation operation based on this name.
field	The name of the field that is used to perform the aggregation operation. Only the LONG and DOUBLE types are supported.
interval	The data interval that is used to obtain aggregation results.
field_range[min,max]	The range that is used together with the interval parameter to limit the number of groups. `The value that is calculated by using the formula cannot exceed 2,000.`
min_doc_count	The minimum number of rows. If the number of rows in a group is less than the minimum number of rows, the aggregation results for the group are not returned.
missing_value	The default value for the field that is used to perform the aggregation operation on a row when the field value is empty. If you do not specify a value for missing_value, the row is ignored. If you specify a value for missing_value, the value of this parameter is used as the field value of the row.

Example

query = TermQuery('product', '10010')
group_by = GroupByHistogram(field_name = 'latency', interval = 100, field_range = FieldRange(0, 10000), missing_value = 0)

search_response = client.search(table_name, index_name,
                                SearchQuery(query, next_token = None, limit=2, group_bys = [group_by]),
                                columns_to_get = ColumnsToGet(return_type = ColumnReturnType.ALL_FROM_INDEX))

for group_by in search_response.group_by_results:
    print("name:%s" % group_by.name)
    print("groups:")
    for item in group_by.items:
        print("%s:%s" % (item.key, item.value))