All Products
Search
Document Center

Hologres:General-purpose aggregate functions supported by Hologres

Last Updated:Dec 02, 2024

Hologres is compatible with PostgreSQL and allows you to use the standard PostgreSQL syntax for data development.

Aggregate functions that are compatible with PostgreSQL

The following table describes the general-purpose aggregate functions supported by Hologres. The functions supported by Hologres are a subset of the PostgreSQL functions. For more information about how to use these functions, see Aggregate Functions in the PostgreSQL documentation.

Function

Description

Example

Returned result

array_agg(anyelement)

Aggregates the values of an expression into an array. This function does not support the JSON, JSONB, TIMETZ, INTERVAL, INET, OID, and UUID data types and array types.

array_agg(c1)

{1,2}
{true,false}
{a,b}
{1.1,2.2}

avg(bigint)

Calculates the average value of the non-NULL values of a BIGINT expression.

avg(c1)

2.000000

avg(float8)

Calculates the average value of the non-NULL values of a FLOAT8 expression.

avg(c1)

2

avg(float4)

Calculates the average value of the non-NULL values of a FLOAT4 expression.

avg(c1)

2

avg(int)

Calculates the average value of the non-NULL values of an INT expression.

avg(c1)

2.000000

bit_and(bigint)

Performs bitwise AND operations on the non-NULL values of a BIGINT expression.

bit_and(c1)

0

bit_and(int)

Performs bitwise AND operations on the non-NULL values of an INT expression.

bit_and(c1)

0

bit_or(bigint)

Performs bitwise OR operations on the non-NULL values of a BIGINT expression.

bit_or(c1)

3

bit_or(int)

Performs bitwise OR operations on the non-NULL values of an INT expression.

bit_or(c1)

3

bool_and(bool)

Performs AND operations on a BOOLEAN expression. This function returns true if all the values returned by the expression are true and returns false if one or more values are false.

bool_and(c1)

f

bool_or(bool)

Performs OR operations on a BOOLEAN expression. This function returns true if one or more values returned by the expression are true and returns false if no value is true.

bool_or(c1)

t

count(*)

Returns the number of rows in a table.

count(*)

3

count(bigint)

Returns the number of rows defined by a BIGINT expression.

Note

This function ignores NULL values.

count(c1)

3

count(numeric)

Returns the number of rows defined by a NUMERIC expression.

Note

This function ignores NULL values.

count(c1)

3

every(bool)

Performs AND operations on a BOOLEAN expression. This function returns true if all the values returned by the expression are true and returns false if one or more values are false.

N/A

N/A

max(bigint)

Returns the maximum value returned by a BIGINT expression.

max(c1)

3

max(float8)

Returns the maximum value returned by a FLOAT8 expression.

max(c1)

3

max(float4)

Returns the maximum value returned by a FLOAT4 expression.

max(c1)

3

max(int)

Returns the maximum value returned by an INT expression.

max(c1)

3

max(numeric)

Returns the maximum value returned by a NUMERIC expression.

max(c1)

3

min(bigint)

Returns the minimum value returned by a BIGINT expression.

min(c1)

1

min(float8)

Returns the minimum value returned by a FLOAT8 expression.

min(c1)

1

min(float4)

Returns the minimum value returned by a FLOAT4 expression.

min(c1)

1

min(int)

Returns the minimum value returned by an INT expression.

min(c1)

1

min(numeric)

Returns the minimum value returned by a NUMERIC expression.

min(c1)

1

sum(bigint)

Returns the sum of the values returned by a BIGINT expression.

sum(c1)

6

sum(float8)

Returns the sum of the values returned by a FLOAT8 expression.

sum(c1)

6

sum(float4)

Returns the sum of the values returned by a FLOAT4 expression.

sum(c1)

6

sum(int)

Returns the sum of the values returned by an INT expression.

sum(c1)

6

sum(numeric)

Returns the sum of the values returned by a NUMERIC expression.

sum(c1)

6.0

string_agg(expression, delimiter)

Concatenates the non-NULL values of an expression into a string by using a delimiter.

string_agg(c1, '-')

a-b-c

corr(Y, X)

Returns the correlation coefficient of input values.

corr(c1, c2)

N/A

covar_pop(Y, X)

Returns the population covariance of input values.

covar_pop(c1, c2)

N/A

covar_samp(Y, X)

Returns the sample covariance of input values.

covar_samp(c1, c2)

N/A

regr_avgx(Y, X)

Calculates the average value of the independent variables.

reg_avgx(c1, c2)

N/A

regr_avgy(Y, X)

Calculates the average value of the dependent variables.

reg_avgy(c1, c2)

N/A

regr_count(Y, X)

Returns the number of non-NULL data records of two input parameters.

regr_count(c1, c2)

N/A

regr_intercept(Y, X)

Returns the intercept on the y-axis of the linear regression line that best fits the dependent and independent variables.

reg_intercept(c1, c2)

N/A

regr_r2(Y, X)

Returns the square of the correlation coefficient.

regr_r2(c1, c2)

N/A

regr_slope(Y, X)

Returns the slope of the linear regression line that best fits the dependent and independent variables.

regr_slope(c1, c2)

N/A

regr_sxx(Y, X)

Returns the sum of the squares of the independent variable by using the following formula: sum(X^2) - sum(X)^2/N.

regr_sxx(c1, c2)

N/A

regr_sxy(Y, X)

Returns the sum of the products of the dependent and independent variables by using the following formula: sum(X*Y) - sum(X) * sum(Y)/N.

regr_sxy(c1, c2)

N/A

regr_syy(Y, X)

Returns the sum of the squares of the dependent variable by using the following formula: sum(Y^2) - sum(Y)^2/N.

regr_syy(c1, c2)

N/A

stddev(int)

Calculates the sample standard deviation of the values returned by an INT expression.

stddev(c1)

N/A

stddev(numeric)

Calculates the sample standard deviation of the values returned by a NUMERIC expression.

stddev(c1)

N/A

stddev(float8)

Calculates the sample standard deviation of the values returned by a FLOAT8 expression.

stddev(c1)

N/A

stddev_pop(int)

Calculates the population standard deviation of the values returned by an INT expression.

stddev_pop(c1)

N/A

stddev_pop(numeric)

Calculates the population standard deviation of the values returned by a NUMERIC expression.

stddev_pop(c1)

N/A

stddev_pop(float8)

Calculates the population standard deviation of the values returned by a FLOAT8 expression.

stddev_pop(c1)

N/A

stddev_samp(int)

Calculates the sample standard deviation of the values returned by an INT expression.

stddev_samp(c1)

N/A

stddev_samp(numeric)

Calculates the sample standard deviation of the values returned by a NUMERIC expression.

stddev_samp(c1)

N/A

stddev_samp(float8)

Calculates the sample standard deviation of the values returned by a FLOAT8 expression.

stddev_samp(c1)

N/A

variance(int)

Calculates the sample variance of the values returned by an INT expression.

variance(c1)

N/A

variance(numeric)

Calculates the sample variance of the values returned by a NUMERIC expression.

variance(c1)

N/A

var_pop(float8)

Calculates the population variance of the values returned by a FLOAT8 expression.

var_pop(c1)

N/A

var_pop(int)

Calculates the population variance of the values returned by an INT expression.

var_pop(c1)

N/A

var_pop(numeric)

Calculates the population variance of the values returned by a NUMERIC expression.

var_pop(c1)

N/A

var_samp(float8)

Calculates the sample variance of the values returned by a FLOAT8 expression.

var_samp(c1)

N/A

var_samp(int)

Calculates the sample variance of the values returned by an INT expression.

var_samp(c1)

N/A

var_samp(numeric)

Calculates the sample variance of the values returned by a NUMERIC expression.

var_samp(c1)

N/A

Other aggregate functions

APPROX_COUNT_DISTINCT

  • Syntax

    The APPROX_COUNT_DISTINCT function is used to calculate the number of rows that have distinct values in a column. The result of this function is an approximate value.

    APPROX_COUNT_DISTINCT ( <column> )

    The following table describes the parameters in the preceding syntax.

    Parameter

    Description

    column

    The column for which you want to calculate the approximate number of rows that have distinct values.

    The APPROX_COUNT_DISTINCT function uses the HyperLogLog approximating analysis to perform an inaccurate COUNT DISTINCT operation. An inaccurate COUNT DISTINCT operation allows you to improve query performance especially when a large number of discrete values exist in the column. This ensures that the average margin of error is in the range of 0.1% to 1%. This function is suitable for scenarios in which high query performance is required and an approximate result is acceptable.

    If you can afford the cost of more resources, you can also perform a precise COUNT DISTINCT (column) operation by using the COUNT DISTINCT function.

  • Adjust the margin of error

    To adjust the margin of error, you can execute the following statement to change the value of the hg_experimental_approx_count_distinct_precision parameter:

    SET hg_experimental_approx_count_distinct_precision = 20;
    • Valid values: 12 to 20. Default value: 17.

    • This parameter specifies the number of bits that is used as the index of buckets in the HyperLogLog algorithm. A greater value indicates more buckets and higher accuracy.

    • A greater value indicates that more computing time and memory are consumed. However, the overhead is still smaller than the overhead of the COUNT DISTINCT (column) function. Therefore, we recommend that you use the APPROX_COUNT_DISTINCT function rather than the COUNT DISTINCT ( column ) function.

    • If you set this parameter to a value greater than 17, Hologres uses the HyperLogLog++ algorithm to correct the error of the returned result in order to reduce the margin of error. For example, if you set the hg_experimental_approx_count_distinct_precision parameter to 20, the margin of error can be reduced to 0.01% to 0.2%.

  • Examples

    Execute the following statement to calculate the approximate number of rows that have distinct values in the O_CUSTKEY column:

    SELECT APPROX_COUNT_DISTINCT ( O_CUSTKEY ) FROM ORDERS;
    
    -- Set the hg_experimental_approx_count_distinct_precision parameter to 20 for all sessions and calculate the approximate number of rows that have distinct values in the O_CUSTKEY column.
    ALTER DATABASE dbname SET hg_experimental_approx_count_distinct_precision = 20;
    SELECT APPROX_COUNT_DISTINCT ( O_CUSTKEY ) FROM ORDERS;
    
    -- Set the hg_experimental_approx_count_distinct_precision parameter to 20 for the current session.
    SET hg_experimental_approx_count_distinct_precision = 20;
    SELECT APPROX_COUNT_DISTINCT ( O_CUSTKEY ) FROM ORDERS;

UNIQ

  • Limits

    • Only Hologres V1.3 and later support the UNIQ function. If the version of your instance is earlier than V1.3, manually upgrade your Hologres instance in the Hologres console or join the DingTalk group for technical support. For more information about how to manually upgrade your Hologres instance in the Hologres console, see the Manual upgrade (beta) section in "Instance upgrades". For more information about how to obtain technical support, see Obtain online support for Hologres.

    • The UNIQ function delivers better performance than the COUNT DISTINCT function only if the SQL statement contains the GROUP BY clause and values of the field based on which the GROUP BY operation is performed are evenly distributed.

  • Syntax

    The UNIQ function is used to calculate the number of rows that have distinct values in a column. Duplicated values are counted as a single value.

    UNIQ ( < column > );
    • The following table describes the parameters in the preceding syntax.

      Parameter

      Description

      column

      The column for which you want to calculate the number of rows that have distinct values.

      The following data types are supported: SMALLINT, INTEGER, BIGINT, REAL, DOUBLE PRECISION, TEXT, VARCHAR, TIMESTAMP, TIMESTAMPTZ, DATE, TIMETZ, and UUID.

    • Take note of the following items when you use the function:

      • In most cases, the UNIQ function has better performance than the COUNT DISTINCT function if the GROUP BY key has high cardinality. In addition, the UNIQ function occupies less memory space than the COUNT DISTINCT function. If the memory space is insufficient for using the COUNT DISTINCT function, you can use the UNIQ function instead.

      • The performance of the COUNT DISTINCT function is optimized in Hologres V2.1 and later for various scenarios, such as scenarios where one or more COUNT DISTINCT functions are used, data skew occurs, or no GROUP BY clause is used. In these scenarios in Hologres V2.1 or later, you do not need to use the UNIQ function because the COUNT DISTINCT function delivers high performance. For more information, see the "Optimize the COUNT DISTINCT function" section in Optimize performance of queries on Hologres internal tables.

  • Examples

    Execute the following statement to calculate the number of rows that have distinct values in the O_CUSTKEY column of the ORDERS table:

    SELECT UNIQ ( O_CUSTKEY ) FROM ORDERS;
    
    -- Execute the following statement to group data by O_ORDERSTATUS and calculate the number of rows that have distinct values in the O_CUSTKEY column in each group:
    SELECT O_ORDERSTATUS, UNIQ ( O_CUSTKEY ) FROM ORDERS GROUP BY O_ORDERSTATUS;

MAX_BY and MIN_BY

  • Limits

    This function is supported in Hologres V1.3.36 and later. If the version of your Hologres instance is earlier than V1.3.36, join the Hologres DingTalk group to apply for an upgrade. For more information, see Obtain online support for Hologres.

  • Syntax

    The MAX_BY or MIN_BY function is used to find the row in which the maximum or minimum value of a column (column y) is included and return the value of another column (column x) in the same row. Syntax:

    Note

    Values of the numeric type are sorted based on the numeric values, and values of non-numeric types are sorted in alphabetical order.

    MAX_BY(x, y);
    MIN_BY(x, y);
    • The following table describes the parameters in the preceding syntax.

      Parameter

      Description

      y

      The name of a column whose values you want to sort.

      x

      The name of another column whose value you want to query.

      • If you specify to find the row in which the maximum value of column y is included, the MAX_BY function returns the value of column x in the same row.

      • If you specify to find the row in which the minimum value of column y is included, the MIN_BY function returns the value of column x in the same row.

    • Take note of the following items when you use the MAX_BY and MIN_BY functions:

      • If multiple rows in column y contain the maximum value, the MAX_BY function returns the maximum value of the values in column x in the same rows of column y.

      • If multiple rows in column y contain the minimum value, the MIN_BY function returns the minimum value of the values in column x in the same rows of column y.

      • Null values in column y are not involved in the computing. If all values in column y are null, the functions return NULL.

  • Examples

    • Perform a GROUP BY operation on the column named id to group data. Then, use the MAX_BY function to find the row in which the maximum value of the column named cost is included in each group and return the value of the column named name in the same row. Sample statement:

      SELECT id, max_by(name, cost) FROM test GROUP BY id;
      
       id | max_by
      ----+--------
        2 | bb
        1 | aaa
        3 | c
      (3 rows)
    • Use the MAX_BY function to find the rows in which the maximum value of the cost column is included and return the maximum value of the values in the name column in the same rows. Sample statement:

      select max_by(name, cost) from test;
      
       max_by
      --------
       bb
      (1 row)
    • Use the MIN_BY function to find the row in which the minimum value of the cost column is included and return the value of the name column in the same row. Sample statement:

      SELECT min_by(name, cost) FROM test;
      
       min_by
      --------
       cc
      (1 row)