MaxCompute allows you to query data by using SELECT
statements. This topic describes the syntax of SELECT
statements in MaxCompute. This topic also describes how to use SELECT statements to perform operations, such as nested queries, sorting, and queries by group.
Before you execute SELECT
statements, make sure that you are granted the Select permission on the destination table. For more information, see MaxCompute permissions.
You can execute the statements that are described in this topic on the following platforms:
Overview
SELECT
statements are used to query data that meets the specified conditions from a table. The following table describes the query operations that can be performed in different scenarios.
Query operation | Description |
Allows you to perform further queries based on the result of a query. | |
Allows you to obtain the intersection, union, or supplementary set of two datasets. | |
Allows you to perform | |
Allows you to filter data in the left table by using the right table and obtain the data that appears only in the left table. | |
Allows you to explicitly specify | |
If two tables that you want to join contain hot key values, a long tail issue may occur. You can extract hot key values from the two tables, separately calculate the join result of hot key values and the join result of non-hot key values, and then join the calculated data. | |
Allows you to use LATERAL VIEW with a user-defined table-valued function (UDTF) to split one row of data into multiple rows. | |
Allows you to aggregate and analyze data from multiple dimensions. | |
| |
Allows you to modify the split size to control the parallelism of subtasks. | |
In MaxCompute, you can perform time travel queries and incremental queries on Transaction Table 2.0 tables.
|
Limits
After a
SELECT
statement is executed, a maximum of 10,000 rows of results can be displayed. The size of the returned result must be less than 10 MB. This limit does not apply toSELECT
clauses.SELECT
clauses return all results in response to the query from the upper layer.If you execute a
SELECT
statement to query data from a partitioned table, you cannot perform a full table scan on the table.If your project was created after 20:00:00 on January 10, 2018, you cannot perform a full table scan on a partitioned table in the project. This limit applies when you execute a SELECT statement to query data from the table. To query data from a partitioned table, you must specify the partitions that you want to scan. This reduces unnecessary I/O and conserves computing resources. This also reduces your computing costs if you use the pay-as-you-go billing method.
To perform a full table scan on a partitioned table, add the
set odps.sql.allow.fullscan=true;
command before the SQL statement that is used for the full table scan. Then, commit and run the added command with the SQL statement. For example, if you want to perform a full table scan on thesale_detail
partitioned table, execute the following statement:set odps.sql.allow.fullscan=true; select * from sale_detail;
If you want to query clustered tables, you can execute a SELECT statement to perform bucket pruning only when the number of partitions that can be scanned in a single table is less than or equal to 400. If bucket pruning does not take effect, the number of data records that are scanned increases. If you use the pay-as-you-go billing method and bucket pruning does not take effect, your costs increase. If you use the subscription billing method and bucket pruning does not take effect, your SQL computing performance deteriorates.
Syntax
[with <cte>[, ...] ]
select [all | distinct] <select_expr>[, <except_expr>][, <replace_expr>] ...
from <table_reference>
[where <where_condition>]
[group by {<col_list>|rollup(<col_list>)}]
[having <having_condition>]
[window <window_clause>]
[order by <order_condition>]
[distribute by <distribute_condition> [sort by <sort_condition>]|[ cluster by <cluster_condition>] ]
[limit <number>]
For more information about the sequence to execute clauses in a SELECT statement, see Sequence for executing clauses in a SELECT statement.
Sample data
This topic provides sample source data and sample statements to demonstrate how to prepare source data. The following sample statements show how to create the sale_detail table and insert data into this table.
-- Create a partitioned table named sale_detail.
create table if not exists sale_detail
(
shop_name string,
customer_id string,
total_price double
)
partitioned by (sale_date string, region string);
-- Add partitions to the sale_detail table.
alter table sale_detail add partition (sale_date='2013', region='china');
-- Insert data into the sale_detail table.
insert into sale_detail partition (sale_date='2013', region='china') values ('s1','c1',100.1),('s2','c2',100.2),('s3','c3',100.3);
Query data in the sale_detail table. Sample statement:
select * from sale_detail;
-- The following result is returned:
+------------+------------+------------+------------+------------+
| shop_name | price | customer | sale_date | region |
+------------+------------+------------+------------+------------+
| s1 | 100.1 | c1 | 2013 | china |
| s2 | 100.2 | c2 | 2013 | china |
| s3 | 100.3 | c3 | 2013 | china |
+------------+------------+------------+------------+------------+
WITH clause (CTE)
The WITH clause is optional. The WITH clause contains one or more common table expressions (CTEs). A CTE is used as a temporary table in the runtime environment. You can reference the temporary table in subsequent queries. When you use a CTE, you must comply with the following rules:
The name of a CTE must be unique in a WITH clause.
A CTE that is defined in a WITH clause can be referenced only by other CTEs that are defined in the same WITH clause.
For example, A is the first CTE in a WITH clause and B is the second CTE in the same WITH clause.
If A references A, the reference is invalid. Incorrect usage of CTEs:
with A as (select 1 from A) select * from A;
If A references B and B references A, the references are invalid. Circular reference is not supported. Incorrect usage of CTEs:
with A as (select * from B ), B as (select * from A ) select * from B;
Sample statement of correct usage:
with
A as (select 1 as C),
B as (select * from A)
select * from B;
The following result is returned:
+---+
| c |
+---+
| 1 |
+---+
Column expression (select_expr)
select_expr is required. select_expr
is in the format of col1_name, col2_name, column expression,...
. This format indicates a common column or partition key column that you want to query, or a regular expression that you use to query data. When you use select_expr, you must comply with the following rules:
Specify the names of the columns from which you want to read data.
The following statement reads data of the
shop_name
column from thesale_detail
table. Sample statement:select shop_name from sale_detail;
The following result is returned:
+------------+ | shop_name | +------------+ | s1 | | s2 | | s3 | +------------+
Use an asterisk (
*
) to represent all columns. You can also use an asterisk (*) withwhere_condition
to specify filter conditions.The following statement queries data of all columns from the
sale_detail
table. Sample statement:-- Enable a full table scan only for the current session. set odps.sql.allow.fullscan=true; select * from sale_detail;
The following result is returned:
+------------+-------------+-------------+------------+------------+ | shop_name | customer_id | total_price | sale_date | region | +------------+-------------+-------------+------------+------------+ | s1 | c1 | 100.1 | 2013 | china | | s2 | c2 | 100.2 | 2013 | china | | s3 | c3 | 100.3 | 2013 | china | +------------+-------------+-------------+------------+------------+
The following statement uses an asterisk (*) with
where_condition
to specify filter conditions. Sample statement:select * from sale_detail where shop_name='s1';
The following result is returned:
+------------+-------------+-------------+------------+------------+ | shop_name | customer_id | total_price | sale_date | region | +------------+-------------+-------------+------------+------------+ | s1 | c1 | 100.1 | 2013 | china | +------------+-------------+-------------+------------+------------+
Use a regular expression.
The following statement queries data of all columns whose names start with
sh
from thesale_detail
table. Sample statement:select `sh.*` from sale_detail;
The following result is returned:
+------------+ | shop_name | +------------+ | s1 | | s2 | | s3 | +------------+
The following statement queries data of all columns whose names are not
shop_name
from thesale_detail
table. Sample statement:select `(shop_name)?+.+` from sale_detail;
The following result is returned:
+-------------+-------------+------------+------------+ | customer_id | total_price | sale_date | region | +-------------+-------------+------------+------------+ | c1 | 100.1 | 2013 | china | | c2 | 100.2 | 2013 | china | | c3 | 100.3 | 2013 | china | +-------------+-------------+------------+------------+
The following statement queries data of all columns except the columns whose names are
shop_name
andcustomer_id
from thesale_detail
table. Sample statement:select `(shop_name|customer_id)?+.+` from sale_detail;
The following result is returned:
+-------------+------------+------------+ | total_price | sale_date | region | +-------------+------------+------------+ | 100.1 | 2013 | china | | 100.2 | 2013 | china | | 100.3 | 2013 | china | +-------------+------------+------------+
The following statement queries data of all columns except the columns whose names start with
t
from thesale_detail
table. Sample statement:select `(t.*)?+.+` from sale_detail;
The following result is returned:
+------------+-------------+------------+------------+ | shop_name | customer_id | sale_date | region | +------------+-------------+------------+------------+ | s1 | c1 | 2013 | china | | s2 | c2 | 2013 | china | | s3 | c3 | 2013 | china | +------------+-------------+------------+------------+
NoteIf the name of col2 is the prefix of the name of col1 and you want to exclude multiple columns, you must make sure that the name of col1 is placed before that of col2. The longer column name is placed before the shorter column name. For example, two partitions of a partitioned table do not need to be queried. One partition is named
ds
and the other is nameddshh
. The name of the ds partition is the prefix for the name of the dshh partition. Therefore, theselect `(dshhds)?+.+` from t;
expression is correct, but theselect `(dsdshh)?+.+` from t;
expression is incorrect.
Use
DISTINCT
before the name of a column to filter out duplicate values from the column and return only distinct values. If you useALL
before the name of a column, all values of the column, including duplicate values, are returned. If DISTINCT is not used,ALL
is used.The following statement queries data of the region column from the sale_detail table and returns only one distinct value. Sample statement:
select distinct region from sale_detail;
The following result is returned:
+------------+ | region | +------------+ | china | +------------+
The following statement specifies multiple columns after the
DISTINCT
option. TheDISTINCT
option takes effect on all the specified columns instead of a single column. Sample statement:select distinct region, sale_date from sale_detail;
The following result is returned:
+------------+------------+ | region | sale_date | +------------+------------+ | china | 2013 | +------------+------------+
In the following statement, DISTINCT is used together with a window function. DISTINCT is used to deduplicate the computing results of the window function. Sample statement:
set odps.sql.allow.fullscan=true; select distinct sale_date, row_number() over (partition by customer_id order by total_price) as rn from sale_detail;
The following result is returned:
+-----------+------------+ | sale_date | rn | +-----------+------------+ | 2013 | 1 | +-----------+------------+
In the following statement, DISTINCT cannot be used together with GROUP BY. For example, if you execute the following statement, an error is returned.
select distinct shop_name from sale_detail group by shop_name; -- The error message "GROUP BY cannot be used with SELECT DISTINCT" is returned.
Column exclusion expression (except_expr)
except_expr is optional. except_expr
is in the except(col1_name, col2_name, ...)
format. You can use except_expr to read data from most columns in a table and exclude data from a small number of columns in the table. For example, you can execute the select * except(col1_name, col2_name, ...) from ...;
statement to read data from all columns except the col1 and col2 columns.
Sample statement:
-- Read data from all columns, except the region column, in the sale_detail table.
select * except(region) from sale_detail;
The following result is returned:
+-----------+-------------+-------------+-----------+
| shop_name | customer_id | total_price | sale_date |
+-----------+-------------+-------------+-----------+
| s1 | c1 | 100.1 | 2013 |
| s2 | c2 | 100.2 | 2013 |
| s3 | c3 | 100.3 | 2013 |
+-----------+-------------+-------------+-----------+
Column modification expression (replace_expr)
replace_expr is optional. replace_expr
is in the replace(exp1 [as] col1_name, exp2 [as] col2_name, ...)
format. You can use replace_expr to read data from most columns in a table and modify data of a small number of columns in the table. For example, you can execute the select * replace(exp1 as col1_name, exp2 as col2_name, ...) from ...;
statement to replace the data of the col1 column with the calculation result of exp1, and replace the data of the col2 column with the calculation result of exp2 when you read data from a table.
Sample statement:
-- Read data from the sale_detail table and modify the data in the total_price and region columns.
select * replace(total_price+100 as total_price, 'shanghai' as region) from sale_detail;
The following result is returned:
+-----------+-------------+-------------+-----------+--------+
| shop_name | customer_id | total_price | sale_date | region |
+-----------+-------------+-------------+-----------+--------+
| s1 | c1 | 200.1 | 2013 | shanghai |
| s2 | c2 | 200.2 | 2013 | shanghai |
| s3 | c3 | 200.3 | 2013 | shanghai |
+-----------+-------------+-------------+-----------+--------+
Destination table information (table_reference)
table_reference is required. table_reference
specifies the table that you want to query. When you use table_reference, you must comply with the following rules:
Specify the name of a destination table. Sample statement:
select customer_id from sale_detail;
The following result is returned:
+-------------+ | customer_id | +-------------+ | c1 | | c2 | | c3 | +-------------+
Use a nested subquery. Sample statement:
select * from (select region,sale_date from sale_detail) t where region = 'china';
The following result is returned:
+------------+------------+ | region | sale_date | +------------+------------+ | china | 2013 | | china | 2013 | | china | 2013 | +------------+------------+
WHERE clause (where_condition)
where_condition is optional. where_condition
specifies a filter condition. If where_condition is used for a partitioned table, column pruning can be performed. When you use where_condition, you must comply with the following rules:
Use where_condition with relational operators to obtain the data that meets the specified conditions. Relational operators include:
>
,<
,=
,>=
,<=
, and<>
LIKE
andRLIKE
IN
,NOT IN
BETWEEN…AND
For more information, see Relational operators.
The following statement specifies the partitions that you want to scan in
where_condition
. This prevents a full table scan. Sample statement:select * from sale_detail where sale_date >= '2008' and sale_date <= '2014'; -- The preceding statement is equivalent to the following statement: select * from sale_detail where sale_date between '2008' and '2014';
The following result is returned:
+------------+-------------+-------------+------------+------------+ | shop_name | customer_id | total_price | sale_date | region | +------------+-------------+-------------+------------+------------+ | s1 | c1 | 100.1 | 2013 | china | | s2 | c2 | 100.2 | 2013 | china | | s3 | c3 | 100.3 | 2013 | china | +------------+-------------+-------------+------------+------------+
NoteYou can execute the EXPLAIN statement to check whether partition pruning takes effect. A common user-defined function (UDF) or the partition condition settings of JOIN may cause partition pruning to fail. For more information, see Check whether partition pruning is effective.
Use UDF-based partition pruning. If you use UDFs, MaxCompute executes the UDFs as small jobs and backfills partitions with the results of these jobs.
Implementation methods
Add an annotation to a UDF class when you write a UDF.
@com.aliyun.odps.udf.annotation.UdfProperty(isDeterministic=true)
NoteThe UDF annotation
com.aliyun.odps.udf.annotation.UdfProperty
is defined in the odps-sdk-udf.jar file. To use this annotation, you must update the version of the referenced odps-sdk-udf to 0.30.X or later.Add
set odps.sql.udf.ppr.deterministic = true;
before the SQL statement that you want to execute. Then, all UDFs in the SQL statement are considereddeterministic
UDFs. The preceding SET command backfills partitions with the results of jobs. A maximum of 1,000 partitions can be backfilled with the results of jobs. If you add an annotation to a UDF class, an error that indicates more than 1,000 partitions are backfilled may be returned. To ignore this error, you can run theset odps.sql.udf.ppr.to.subquery = false;
command. After you run this command, UDF-based partition pruning is no longer in effect.
Precautions
When you use a UDF to perform partition pruning, the UDF must be placed in the
WHERE
clause that is used to query a table. UDF-based partition pruning takes effect only when you place the UDF in the WHERE clause.Sample statement of correct usage:
--Place a UDF in the WHERE clause that is used to query the source table. select key, value from srcp where udf(ds) = 'xx';
Sample statement of incorrect usage:
--Place a UDF after the ON condition in the JOIN clause. UDF-based partition pruning does not take effect. select A.c1, A.c2 from srcp1 A join srcp2 B on A.c1 = B.c1 and udf(A.ds) ='xx';
In a column expression (select_expr), if the column that is renamed a column alias uses a function, the column alias cannot be referenced in the
WHERE
clause. Incorrect sample statement:select task_name ,inst_id ,settings ,GET_JSON_OBJECT(settings, '$.SKYNET_ID') as skynet_id ,GET_JSON_OBJECT(settings, '$.SKYNET_NODENAME') as user_agent from Information_Schema.TASKS_HISTORY where ds = '20211215' and skynet_id is not null limit 10;
GROUP BY (col_list)
GROUP BY is optional. In most cases, GROUP BY
is used with aggregate functions to group columns based on the specified common columns, partition key columns, or regular expressions. When you use GROUP BY
, you must comply with the following rules:
GROUP BY
takes precedence overSELECT
. Therefore, columns inGROUP BY
can be specified by column names of the input table ofSELECT
or an expression that is formed by columns of the input table of SELECT. When you use GROUP BY, take note of the following points:If columns in
GROUP BY
are specified by a regular expression, the complete expression must be used.The columns that do not use aggregate functions in a
SELECT
statement must be specified inGROUP BY
.
Examples:
The following statement groups table data by the column name region. In this case, data is grouped based on the values of the region column. Sample statement:
select region from sale_detail group by region;
The following result is returned:
+------------+ | region | +------------+ | china | +------------+
The following statement groups table data based on the values of the region column and returns the total sales of each group. Sample statement:
select sum(total_price) from sale_detail group by region;
The following result is returned:
+------------+ | _c0 | +------------+ | 300.6 | +------------+
The following statement groups table data based on the values of the region column and returns distinct values and total sales of each group. Sample statement:
select region, sum (total_price) from sale_detail group by region;
The following result is returned:
+------------+------------+ | region | _c1 | +------------+------------+ | china | 300.6 | +------------+------------+
The following statement groups table data based on the alias of an output column in a
SELECT
statement. Sample statement:select region as r from sale_detail group by r; -- The preceding statement is equivalent to the following statement: select region as r from sale_detail group by region;
The following result is returned:
+------------+ | r | +------------+ | china | +------------+
The following statement groups table data based on the column expression. Sample statement:
select 2 + total_price as r from sale_detail group by 2 + total_price;
The following result is returned:
+------------+ | r | +------------+ | 102.1 | | 102.2 | | 102.3 | +------------+
If some columns in a
SELECT
statement do not use aggregate functions, these columns must be specified inGROUP BY
. Otherwise, an error is returned. Sample statement of incorrect usage:select region, total_price from sale_detail group by region;
Sample statement of correct usage:
select region, total_price from sale_detail group by region, total_price;
The following result is returned:
+------------+-------------+ | region | total_price | +------------+-------------+ | china | 100.1 | | china | 100.2 | | china | 100.3 | +------------+-------------+
If you add the
set hive.groupby.position.alias=true;
command before a SELECT statement, integer constants in theGROUP BY
clause are considered column numbers in aSELECT
statement. Sample statement:-- Run this command with the following SELECT statement. set odps.sql.groupby.position.alias=true; -- 1 indicates the region column, which is the first column read by the following SELECT statement. This statement groups table data based on the values of the region column and returns distinct values of the region column and total sales of each group. select region, sum(total_price) from sale_detail group by 1;
The following result is returned:
+------------+------------+ | region | _c1 | +------------+------------+ | china | 300.6 | +------------+------------+
HAVING clause (having_condition)
having_condition is optional. In most cases, having_condition
is used with aggregate functions to filter data. Sample statement:
-- Insert data into the sale_detail table to display the data rendering effect.
insert into sale_detail partition (sale_date='2014', region='shanghai') values ('null','c5',null),('s6','c6',100.4),('s7','c7',100.5);
-- Use having_condition with aggregate functions to filter data.
select region,sum(total_price) from sale_detail
group by region
having sum(total_price)<305;
The following result is returned:
+------------+------------+
| region | _c1 |
+------------+------------+
| china | 300.6 |
| shanghai | 200.9 |
+------------+------------+
ORDER BY (order_condition)
order_condition is optional. ORDER BY
is used to sort all data records based on a specified common column or partition key column. ORDER BY can also be used to sort all data records based on a specified constant. When you use ORDER BY
, you must comply with the following rules:
By default, data is sorted in ascending order. If you want to sort data in descending order, the
DESC
keyword is required.By default,
ORDER BY
is followed byLIMIT <number>
to limit the number of data rows that are displayed in the output. If ORDER BY is not followed byLIMIT <number>
, an error is returned. You can also work around thislimit
. For more information, see LIMIT .The following statement queries data from the sale_detail table, sorts data records in ascending order based on the values of the total_price column, and then displays the first two records. Sample statement:
select * from sale_detail order by total_price limit 2;
The following result is returned:
+------------+-------------+-------------+------------+------------+ | shop_name | customer_id | total_price | sale_date | region | +------------+-------------+-------------+------------+------------+ | s1 | c1 | 100.1 | 2013 | china | | s2 | c2 | 100.2 | 2013 | china | +------------+-------------+-------------+------------+------------+
The following statement queries data from the sale_detail table, sorts data records in descending order based on the values of the total_price column, and then displays the first two records. Sample statement:
select * from sale_detail order by total_price desc limit 2;
The following result is returned:
+------------+-------------+-------------+------------+------------+ | shop_name | customer_id | total_price | sale_date | region | +------------+-------------+-------------+------------+------------+ | s3 | c3 | 100.3 | 2013 | china | | s2 | c2 | 100.2 | 2013 | china | +------------+-------------+-------------+------------+------------+
NULL is the smallest value when you use
ORDER BY
to sort data. This is also the case in MySQL. However, this is not the case in Oracle.The following statement queries data from the sale_detail table, sorts data records in ascending order based on the values of the total_price column, and then displays the first two records. Sample statement:
select * from sale_detail order by total_price limit 2;
The following result is returned:
+------------+-------------+-------------+------------+------------+ | shop_name | customer_id | total_price | sale_date | region | +------------+-------------+-------------+------------+------------+ | s1 | c1 | 100.1 | 2013 | china | | s2 | c2 | 100.2 | 2013 | china | +------------+-------------+-------------+------------+------------+
ORDER BY
is followed by the alias of an output column of aSELECT
statement. If you do not specify the alias of an output column of aSELECT
statement, the name of this column is used as the alias of this column.The following statement adds the alias of an output column after
ORDER BY
. Sample statement:select total_price as t from sale_detail order by total_price limit 3; -- The preceding statement is equivalent to the following statement: select total_price as t from sale_detail order by t limit 3;
The following result is returned:
+------------+ | t | +------------+ | 100.1 | | 100.2 | | 100.3 | +------------+
If you add the
set hive.orderby.position.alias=true;
command before a SELECT statement, integer constants in theORDER BY
clause are considered column numbers in theSELECT
statement. Sample statement:-- Run this command with the following SELECT statement. set odps.sql.orderby.position.alias=true; select * from sale_detail order by 3 limit 3;
The following result is returned:
+------------+-------------+-------------+------------+------------+ | shop_name | customer_id | total_price | sale_date | region | +------------+-------------+-------------+------------+------------+ | s1 | c1 | 100.1 | 2013 | china | | s2 | c2 | 100.2 | 2013 | china | | s3 | c3 | 100.3 | 2013 | china | +------------+-------------+-------------+------------+------------+
An
OFFSET
clause can be used with anORDER BY...LIMIT
clause to specify the number of rows to skip. The format isORDER BY...LIMIT m OFFSET n
, which can be abbreviated asORDER BY...LIMIT n, m
.LIMIT m
specifies that m rows of data are returned.OFFSET n
specifies that n rows are skipped before data is returned. If you do not want to skip rows, you can useOFFSET 0
in the statement that you want to execute. You can also execute the statement without specifying an OFFSET clause.The following statement sorts the data of the sale_detail table in ascending order based on the values of the total_price column and displays three rows of data starting from the third row. Sample statement:
select customer_id,total_price from sale_detail order by total_price limit 3 offset 2; -- The preceding statement is equivalent to the following statement: select customer_id,total_price from sale_detail order by total_price limit 2, 3;
The following result is returned:
+-------------+-------------+ | customer_id | total_price | +-------------+-------------+ | c3 | 100.3 | +-------------+-------------+
The queried data contains only three rows of data. In this case, only the third row is returned.
Range clustering can also be used for global sorting acceleration. In common scenarios in which ORDER BY is used, all sorted data is distributed to the same instance to ensure global sorting. However, concurrent processing cannot be fully utilized in these scenarios. You can use the partitioning step of range clustering to implement concurrent global sorting. If you want to perform global sorting, you must sample data and divide the data into ranges, sort data in each range in parallel, and then obtain the result of global sorting. For more information, see Global sorting acceleration.
DISTRIBUTE BY hash partition (distribute_condition)
distribute_condition is optional. DISTRIBUTE BY
is used to perform hash partitioning on data based on the values of specific columns.
DISTRIBUTE BY
controls how the output of a mapper is distributed among reducers. To prevent the same data from being distributed to different reducers, you can use DISTRIBUTE BY
. This ensures that the same group of data is distributed to the same reducer.
The alias of an output column of a SELECT
statement must be specified. If you execute a SELECT
statement to query data of a column and the alias of this column is not specified, the column name is used as the alias. Sample statement:
-- The following statement queries the values of the region column from the sale_detail table and performs hash partitioning on data based on the values of the region column.
select region from sale_detail distribute by region;
-- The preceding statement is equivalent to the following statements:
select region as r from sale_detail distribute by region;
select region as r from sale_detail distribute by r;
SORT BY (sort_condition)
sort_condition is optional. In most cases, SORT BY is used with DISTRIBUTE BY
. When you use SORT BY
, you must comply with the following rules:
By default, data is sorted in ascending order. If you want to sort data in descending order, the
DESC
keyword is required.If
SORT BY
is preceded byDISTRIBUTE BY
,SORT BY
sorts the result ofDISTRIBUTE BY
based on the values of a specified column.The following statements query the values of the region and total_price columns from the sale_detail table, perform hash partitioning on the query results based on the values of the region column, and then sort the partitioning results in ascending order based on the values of the total_price column. Sample statement:
-- Insert data into the sale_detail table to display the data rendering effect. insert into sale_detail partition (sale_date='2014', region='shanghai') values ('null','c5',null),('s6','c6',100.4),('s7','c7',100.5); select region,total_price from sale_detail distribute by region sort by total_price;
The following result is returned:
+------------+-------------+ | region | total_price | +------------+-------------+ | shanghai | NULL | | china | 100.1 | | china | 100.2 | | china | 100.3 | | shanghai | 100.4 | | shanghai | 100.5 | +------------+-------------+
The following statement queries the values of the region and total_price columns from the sale_detail table, performs hash partitioning on the query results based on the values of the region column, and then sorts the partitioning results in descending order based on the values of the total_price column. Sample statement:
select region,total_price from sale_detail distribute by region sort by total_price desc;
The following result is returned:
+------------+-------------+ | region | total_price | +------------+-------------+ | shanghai | 100.5 | | shanghai | 100.4 | | china | 100.3 | | china | 100.2 | | china | 100.1 | | shanghai | NULL | +------------+-------------+
If
SORT BY
is not preceded byDISTRIBUTE BY
,SORT BY
sorts the data that is distributed to each reducer.This ensures that the output data of each reducer is sorted in order and increases the storage compression ratio. If data is filtered during data reading, this method reduces the amount of data that is read from disks and improves the efficiency of subsequent global sorting. Sample statement:
select region,total_price from sale_detail sort by total_price desc;
The following result is returned:
+------------+-------------+ | region | total_price | +------------+-------------+ | china | 100.3 | | china | 100.2 | | china | 100.1 | | shanghai | 100.5 | | shanghai | 100.4 | | shanghai | NULL | +------------+-------------+
Columns in an
ORDER BY, DISTRIBUTE BY, or SORT BY
clause must be specified by the aliases of the output columns in aSELECT
statement. Column aliases can be Chinese.In MaxCompute, an
ORDER BY, DISTRIBUTE BY, or SORT BY
clause is executed after aSELECT
statement. Therefore, columns in ORDER BY, DISTRIBUTE BY, or SORT BY must be specified by the aliases of the output columns in theSELECT
statement.ORDER BY
cannot be used at the same time withDISTRIBUTE BY
orSORT BY
. Similarly,GROUP BY
cannot be used at the same time withDISTRIBUTE BY
orSORT BY
.
LIMIT <number>
LIMIT <number> is optional. The number
in the limit <number>
is a constant that restricts the number of rows that can be displayed. The value of number is a 32-bit integer, and the maximum value is 2,147,483,647.
LIMIT <number>
is used to scan and filter data for a distributed query system. When you use LIMIT <number>
, the amount of data returned is not reduced. Therefore, computing costs are not reduced.
The following sections describe the limits of LIMIT <number> and how to work around these limits.
ORDER BY
must be used withLIMIT <number>
.ORDER BY
sorts all data of a single node. By default, ORDER BY is used withLIMIT <number>
to prevent a single node from processing large amounts of data. You can work around thislimit
by using the following methods:To work around the
limit
for a project, run thesetproject odps.sql.validate.orderby.limit=false;
command.To work around the limit for a session, commit and run the
set odps.sql.validate.orderby.limit=false;
command with the SQL statement that you want to execute.NoteAfter you work around this
limit
, if a single node has large amounts of data to sort, more resources and time are consumed.
Limited rows are displayed.
If you execute a
SELECT
statement without aLIMIT <number>
clause or the number specified in theLIMIT <number>
clause exceeds the maximum number (n) of rows that can be displayed, a maximum of n rows can be displayed.The maximum number of rows that can be displayed varies based on projects. You can use one of the following methods to control the maximum number:
If project data protection is disabled, modify the odpscmd config.ini file.
Set
use_instance_tunnel
to true in the odpscmd config.ini file. If theinstance_tunnel_max_record
parameter is not configured, the number of rows that can be displayed is not limited. Otherwise, the number of rows that can be displayed is limited by theinstance_tunnel_max_record
parameter. The maximum value of theinstance_tunnel_max_record
parameter is 10000. For more information about InstanceTunnel, see Usage notes.If project data protection is enabled, the number of rows that can be displayed is limited by the
READ_TABLE_MAX_ROW
parameter. The maximum value of this parameter is 10000.
NoteYou can run the
SHOW SecurityConfiguration;
command to view the value ofProjectProtection
. IfProjectProtection
is set to true, you can determine whether to disable project data protection based on your business requirements. You can run theset ProjectProtection=false;
command to disable project data protection. By default,ProjectProtection
is set to false. For more information about project data protection, see Project data protection.
Window clause (window_clause)
For more information about the window clause, see Syntax.
Split size hint
The split size hint is optional. You can modify the split size to control the parallelism of subtasks and adjust the computing performance. Split sizes can be applied to tables. The unit of a split size is MB. The default value is 256 MB.
Scenarios
If a large number of subtasks are waiting for resources in a job but no resources can be allocated to the subtasks, you can increase the split size to reduce the parallelism of subtasks. This way, the time for starting and stopping subtasks is reduced.
If the parallelism of subtasks is low, such as hundreds of subtasks, and the current subtask does not return results within an expected period of time, check whether some resources in the resource pool are available. If sufficient resources are available, you can decrease the split size to increase the parallelism of subtasks. This can reduce the duration for which the job runs.
Precautions
If you use the split size hint for a clustered table and bucketing operations are performed on the table for optimizing the computing performance, the split size hint is invalid.
You can change the value of the split size to a value that is a factor or multiple of 256 MB, such as 128 MB or 512 MB.
If data in a table is read multiple times in an SQL statement, the smallest split size is used for splitting. For example, the src table is read twice:
If one split size is set to 1 MB and the other split size is set to 10 MB, 1 MB is used for splitting.
If one split size is set to 1 MB and the other split size is not configured, 1 MB is used for splitting.
Sample statement:
-- Set the split size to 1 MB. This setting indicates that a job is split into subtasks based on a size of 1 MB when data in the src table is read. select a.key from src a /*+split_size(1)*/ join src2 b on a.key=b.key;