The MaxCompute V2.0 extension function COLLECT_SET aggregates the values of the column specified by colname into an array with distinct elements.
Usage notes
MaxCompute V2.0 provides additional functions. If the functions that you use involve new data types, you must enable the MaxCompute V2.0 data type edition. The new data types include TINYINT, SMALLINT, INT, FLOAT, VARCHAR, TIMESTAMP, and BINARY.
Session level: To use the MaxCompute V2.0 data type edition, you must add
set odps.sql.type.system.odps2=true;before the SQL statement that you want to execute, and commit and execute them together.Project level: The project owner can enable the MaxCompute V2.0 data type edition for the project based on the project requirements. The configuration takes effect after 10 to 15 minutes. Sample command:
setproject odps.sql.type.system.odps2=true;For more information about
setproject, see Project operations. For more information about the precautions that you must take when you enable the MaxCompute V2.0 data type edition at the project level, see Data type editions.
If you use an SQL statement that includes multiple aggregate functions and the project resources are insufficient, memory overflow may occur. We recommend that you optimize the SQL statement or purchase computing resources based on your business requirements.
Syntax
array collect_set(<colname>)Parameters
colname: required. The name of the column to be aggregated.
Return value
A value of the ARRAY type is returned. If the value of colname is NULL, the row is not included in the calculation.
Sample data
This section provides sample source data and examples for you to understand how to use the functions. Create a table named emp and insert the sample data into the table. Sample statement:
create table if not exists emp
(empno bigint,
ename string,
job string,
mgr bigint,
hiredate datetime,
sal bigint,
comm bigint,
deptno bigint);
tunnel upload emp.txt emp; -- Replace emp.txt with the actual path (path and name) to which you upload the data file.The emp.txt file contains the following sample data:
7369,SMITH,CLERK,7902,1980-12-17 00:00:00,800,,20
7499,ALLEN,SALESMAN,7698,1981-02-20 00:00:00,1600,300,30
7521,WARD,SALESMAN,7698,1981-02-22 00:00:00,1250,500,30
7566,JONES,MANAGER,7839,1981-04-02 00:00:00,2975,,20
7654,MARTIN,SALESMAN,7698,1981-09-28 00:00:00,1250,1400,30
7698,BLAKE,MANAGER,7839,1981-05-01 00:00:00,2850,,30
7782,CLARK,MANAGER,7839,1981-06-09 00:00:00,2450,,10
7788,SCOTT,ANALYST,7566,1987-04-19 00:00:00,3000,,20
7839,KING,PRESIDENT,,1981-11-17 00:00:00,5000,,10
7844,TURNER,SALESMAN,7698,1981-09-08 00:00:00,1500,0,30
7876,ADAMS,CLERK,7788,1987-05-23 00:00:00,1100,,20
7900,JAMES,CLERK,7698,1981-12-03 00:00:00,950,,30
7902,FORD,ANALYST,7566,1981-12-03 00:00:00,3000,,20
7934,MILLER,CLERK,7782,1982-01-23 00:00:00,1300,,10
7948,JACCKA,CLERK,7782,1981-04-12 00:00:00,5000,,10
7956,WELAN,CLERK,7649,1982-07-20 00:00:00,2450,,10
7956,TEBAGE,CLERK,7748,1982-12-30 00:00:00,1300,,10Examples
Example 1: Aggregate the salary (sal) values of all employees into an array with only distinct values. Sample statement:
SELECT collect_set(sal) FROM emp;Sample result:
+------+ | _c0 | +------+ | [800, 950, 1100, 1250, 1300, 1500, 1600, 2450, 2850, 2975, 3000, 5000] | +------+Example 2: Use COLLECT_SET with
group byto group all employees by department (deptno) and aggregate the salary (sal) values of employees in each group into an array with only distinct values. Sample statement:SELECT deptno, collect_set(sal) FROM emp GROUP BY deptno;Sample result:
+------------+------+ | deptno | _c1 | +------------+------+ | 10 | [1300, 2450, 5000] | | 20 | [800, 1100, 2975, 3000] | | 30 | [950, 1250, 1500, 1600, 2850] | +------------+------+
Related functions
COLLECT_SET is an aggregate function. For more information about functions that calculate the average value or aggregate parameters from multiple input records, see Aggregate functions.