All Products
Search
Document Center

MaxCompute:Subqueries

Last Updated:Nov 15, 2024

You can use a subquery if you want to perform a further query based on the result of a query. For example, you can use subqueries if you want to separately calculate aggregate values, check whether a record exists, filter data returned by a query, associate with an update or delete operation, simplify a JOIN operation to obtain a value, use a query result as a derived table for a main query, or use a query result as the basis for sorting or grouping and row-by-row comparison. This topic describes the definition and usage of subqueries that are supported in MaxCompute.

Description

Subqueries are nested inside a statement to perform complex data queries. MaxCompute supports the following types of subqueries:

  • Basic subquery

    A subquery in a FROM clause is used as a temporary table for complex computing or data conversions in queries.

  • IN SUBQUERY

    You can use IN subqueries in the WHERE clause if you want to match a group of values that are returned by a subquery. IN subqueries are applicable if you want to query data from rows that match specified conditions.

  • NOT IN SUBQUERY

    NOT IN subqueries are used to exclude one collection from another. If you use a NOT IN subquery in the WHERE clause, rows that match the conditions of the subquery are removed from the result of the main query.

  • EXISTS SUBQUERY

    EXISTS subqueries are used in main queries to check whether subqueries return any rows. EXISTS subqueries are applicable if you want to check whether a record exists in a subquery regardless of the returned content.

  • NOT EXISTS SUBQUERY

    NOT EXISTS subqueries work opposite to EXISTS subqueries. Records in main queries are selected only if subqueries return no rows. NOT EXISTS subqueries are applicable if you want to query rows that are not matched in subqueries.

  • SCALAR SUBQUERY

    Scalar subqueries are subqueries that return exactly one column value from one row. In most cases, you can use scalar subqueries in SELECT statements or WHERE or HAVING clauses. Scalar subqueries are applicable if you want to calculate a specific aggregate value or obtain a column value from a row.

Note

Some subqueries such as SCALAR, IN, NOT IN, EXISTS, and NOT EXISTS can be converted into JOIN operations during execution. MAPJOIN is an efficient JOIN algorithm. If the result of the SUBQUERY is a small table, you can use a HINT in the subquery statement to explicitly specify the MAPJOIN algorithm.

Sample data

Sample source data is provided for you to better understand the examples in this topic. Sample statements:

-- Create a partitioned table named sale_detail. 
create table if not exists sale_detail
(
shop_name     string,
customer_id   string,
total_price   double
)
partitioned by (sale_date string, region string);

-- Add partitions to the sale_detail table. 
alter table sale_detail add partition (sale_date='2013', region='china') partition (sale_date='2014', region='shanghai');

-- Insert data into the sale_detail table. 
insert into sale_detail partition (sale_date='2013', region='china') values ('s1','c1',100.1),('s2','c2',100.2),('s3','c3',100.3);
insert into sale_detail partition (sale_date='2014', region='shanghai') values ('null','c5',null),('s6','c6',100.4),('s7','c7',100.5);

Query data in the sale_detail table. Sample statement:

set odps.sql.allow.fullscan=true;
select * from sale_detail; 
-- The following result is returned: 
+------------+-------------+-------------+------------+------------+
| shop_name  | customer_id | total_price | sale_date  | region     |
+------------+-------------+-------------+------------+------------+
| s1         | c1          | 100.1       | 2013       | china      |
| s2         | c2          | 100.2       | 2013       | china      |
| s3         | c3          | 100.3       | 2013       | china      |
| null       | c5          | NULL        | 2014       | shanghai   |
| s6         | c6          | 100.4       | 2014       | shanghai   |
| s7         | c7          | 100.5       | 2014       | shanghai   |
+------------+-------------+-------------+------------+------------+

Basic subquery

The object of a common query is a destination table. You can also use a SELECT statement as a query object. In this case, the SELECT statement is considered a subquery. A subquery in a FROM clause can be used as a table. You can join the subquery with other tables or subqueries. For more information about JOIN operations, see JOIN.

  • Syntax

    • Syntax 1

      select <select_expr> from (<select_statement>) [<sq_alias_name>];
    • Syntax 2

      select (<select_statement>) from <table_name>;
  • Parameters

    • select_expr: required. The value of this parameter is in the format of col1_name, col2_name, Regular expression,.... This format indicates common columns or partition key columns that you want to query or regular expressions that are used for a query.

    • select_statement: required. This parameter specifies a subquery clause. If you use Syntax 2, the subquery result can have only one row. For more information about the syntax, see SELECT syntax.

    • sq_alias_name: optional. This parameter specifies the alias of a subquery.

    • table_name: required. This parameter specifies the name of the table that you want to query.

  • Examples

    • Example 1: Use Syntax 1. Sample statements:

      set odps.sql.allow.fullscan=true;
      select * from (select shop_name from sale_detail) a;

      The following result is returned:

      +------------+
      | shop_name  |
      +------------+
      | s1         |
      | s2         |
      | s3         |
      | null       |
      | s6         |
      | s7         |
      +------------+
    • Example 2: Use Syntax 2. Sample statements:

      set odps.sql.allow.fullscan=true;
      select (select * from sale_detail where shop_name='s1') from sale_detail;

      The following result is returned:

      +------------+-------------+-------------+------------+------------+
      | shop_name  | customer_id | total_price | sale_date  | region     |
      +------------+-------------+-------------+------------+------------+
      | s1         | c1          | 100.1       | 2013       | china      |
      | s1         | c1          | 100.1       | 2013       | china      |
      | s1         | c1          | 100.1       | 2013       | china      |
      | s1         | c1          | 100.1       | 2013       | china      |
      | s1         | c1          | 100.1       | 2013       | china      |
      | s1         | c1          | 100.1       | 2013       | china      |
      +------------+-------------+-------------+------------+------------+
    • Example 3: Use Syntax 1. In this example, a subquery in a FROM clause is used as a table, and the subquery is joined with other tables or subqueries. Sample statements:

      -- Create a table and join the table with a subquery. 
      create table shop as select shop_name,customer_id,total_price from sale_detail;
      select a.shop_name, a.customer_id, a.total_price from
      (select * from shop) a join sale_detail on a.shop_name = sale_detail.shop_name;

      The following result is returned:

      +------------+-------------+-------------+
      | shop_name  | customer_id | total_price |
      +------------+-------------+-------------+
      | null       | c5          | NULL        |
      | s6         | c6          | 100.4       |
      | s7         | c7          | 100.5       |
      | s1         | c1          | 100.1       |
      | s2         | c2          | 100.2       |
      | s3         | c3          | 100.3       |
      +------------+-------------+-------------+

IN SUBQUERY

IN SUBQUERY is used in a similar manner to LEFT SEMI JOIN.

  • Syntax

    • Syntax 1

      select<select_expr1>from<table_name1>where<select_expr2>in(select<select_expr3>from<table_name2>);
      -- The preceding statement is equivalent to the following statement with LEFT SEMI JOIN: 
      select<select_expr1>from<table_name1><alias_name1>leftsemijoin<table_name2><alias_name2>on<alias_name1>.<select_expr2>=<alias_name2>.<select_expr3>;
      Note

      If select_expr2 specifies partition key columns, select <select_expr2> from <table_name2> is not converted into LEFT SEMI JOIN. A separate job is started to run a subquery. MaxCompute compares the subquery results with the columns that you specify in select_expr2 in sequence. If the partitions of the table specified by table_name1 contain the columns in select_expr2 and these columns are not included in the results, MaxCompute does not read data from these partitions. This ensures that partition pruning is still valid.

    • Syntax 2

      MaxCompute supports IN SUBQUERY and correlated conditions. where <table_name2_colname> = <table_name1>.<colname> is a correlated condition. MaxCompute V1.0 does not support expressions that reference source tables from both subqueries and main queries. MaxCompute V2.0 supports such expressions. These filter conditions are part of the ON condition in SEMI JOIN operations.

      select<select_expr1>from<table_name1>where<select_expr2>in(select<select_expr3>from<table_name2>where
      <table_name1>.<col_name>=<table_name2>.<col_name>);
      Note

      MaxCompute supports IN SUBQUERY that does not serve as a JOIN condition. For example, a non-WHERE clause uses IN SUBQUERY, or a WHERE clause uses IN SUBQUERY that cannot be converted into a JOIN condition. In this case, IN SUBQUERY cannot be converted into SEMI JOIN. A separate job must be started to run a subquery. Correlated conditions are not supported.

    • Syntax 3

      IN SUBQUERY supports multi-column subqueries based on the preceding capabilities and limits. This rule also applies to PostgreSQL. If you use Syntax 3 for IN SUBQUERY, you do not need to split a query into subqueries. Multi-column subqueries reduce one JOIN operation and save computing resources. You can use multi-column subqueries in the following ways:

      • Use a simple SELECT statement in which you specify multiple columns for the IN SUBQUERY expression.

      • Use aggregate functions for the IN SUBQUERY expression. For more information about aggregate functions, see Aggregate functions.

      • Use constants for the IN SUBQUERY expression.

  • Parameters

    • select_expr1: required. The value of this parameter is in the format of col1_name, col2_name, Regular expression,.... This parameter specifies the common columns or partition key columns that you want to query or regular expressions that are used for a query.

    • table_name1 and table_name2: required. The parameters specify the names of the tables.

    • select_expr2 and select_expr3: required. The parameters specify the names of the columns in the tables specified by table_name1 and table_name2. Columns in the two tables are mapped to each other.

    • col_name: required. This parameter specifies the name of a column in the table.

  • Usage notes

    If you use the IN SUBQUERY expression, null values are automatically excluded from the returned result of the subquery.

  • Examples

    • Example 1: Use Syntax 1. Sample statements:

      set odps.sql.allow.fullscan=true;
      select * from sale_detail where total_price in (select total_price from shop);

      The following result is returned:

      +-----------+-------------+-------------+-----------+--------+
      | shop_name | customer_id | total_price | sale_date | region |
      +-----------+-------------+-------------+-----------+--------+
      | s1        | c1          | 100.1       | 2013      | china  |
      | s2        | c2          | 100.2       | 2013      | china  |
      | s3        | c3          | 100.3       | 2013      | china  |
      | s6        | c6          | 100.4       | 2014      | shanghai |
      | s7        | c7          | 100.5       | 2014      | shanghai |
      +-----------+-------------+-------------+-----------+--------+
    • Example 2: Use Syntax 2. Sample statements:

      set odps.sql.allow.fullscan=true;
      select * from sale_detail where total_price in (select total_price from shop where customer_id = shop.customer_id);

      The following result is returned:

      +-----------+-------------+-------------+-----------+--------+
      | shop_name | customer_id | total_price | sale_date | region |
      +-----------+-------------+-------------+-----------+--------+
      | s1        | c1          | 100.1       | 2013      | china  |
      | s2        | c2          | 100.2       | 2013      | china  |
      | s3        | c3          | 100.3       | 2013      | china  |
      | s6        | c6          | 100.4       | 2014      | shanghai |
      | s7        | c7          | 100.5       | 2014      | shanghai |
      +-----------+-------------+-------------+-----------+--------+
    • Example 3: Multiple columns are specified in a SELECT statement for a subquery. Sample statements:

      -- Sample data is reconstructed to help you understand this example. 
      create table if not exists t1(a bigint,b bigint,c bigint,d bigint,e bigint);
      create table if not exists t2(a bigint,b bigint,c bigint,d bigint,e bigint);
      insert into table t1 values (1,3,2,1,1),(2,2,1,3,1),(3,1,1,1,1),(2,1,1,0,1),(1,1,1,0,1);
      insert into table t2 values (1,3,5,0,1),(2,2,3,1,1),(3,1,1,0,1),(2,1,1,0,1),(1,1,1,0,1);
      -- Scenario 1: The IN SUBQUERY expression is a simple SELECT statement in which you specify multiple columns. 
      select a, b from t1 where (c, d) in (select a, b from t2 where e = t1.e);
      -- The following result is returned: 
      +------------+------------+
      | a          | b          |
      +------------+------------+
      | 1          | 3          |
      | 2          | 2          |
      | 3          | 1          |
      +------------+------------+
      -- Scenario 2: The IN SUBQUERY expression uses aggregate functions. 
      select a, b from t1 where (c, d) in (select max(a), b from t2 where e = t1.e group by b having max(a) > 0);
      -- The following result is returned: 
      +------------+------------+
      | a          | b          |
      +------------+------------+
      | 2          | 2          |
      +------------+------------+
      -- Scenario 3: The IN SUBQUERY expression uses constants. 
      select a, b from t1 where (c, d) in ((1, 3), (1, 1));
      -- The following result is returned: 
      +------------+------------+
      | a          | b          |
      +------------+------------+
      | 2          | 2          |
      | 3          | 1          |
      +------------+------------+

NOT IN SUBQUERY

NOT IN SUBQUERY is used in a similar manner to LEFT ANTI JOIN. However, if the values of a row are NULL for a specified column in the table that you want to query, the value of the expression in NOT IN SUBQUERY is NULL. In this case, the WHERE condition is invalid, and no data is returned. This processing logic is different from that of LEFT ANTI JOIN.

  • Syntax

    • Syntax 1

      select <select_expr1> from <table_name1> where <select_expr2> not in (select <select_expr2> from <table_name2>);
      -- The preceding statement is equivalent to the following statement with LEFT ANTI JOIN. 
      select <select_expr1> from <table_name1> <alias_name1> left anti join <table_name2> <alias_name2> on <alias_name1>.<select_expr1> = <alias_name2>.<select_expr2>;
      Note

      If select_expr2 specifies partition key columns, select <select_expr2> from <table_name2> is not converted into LEFT ANTI JOIN. A separate job is started to run a subquery. MaxCompute compares the subquery results with the columns specified in select_expr2 in sequence. If the partitions of the table specified by table_name1 contain the columns in select_expr2 and these columns are not included in the results, MaxCompute does not read data from these partitions. This ensures that partition pruning is still valid.

    • Syntax 2

      MaxCompute supports NOT IN SUBQUERY and correlated conditions. where <table_name2_colname> = <table_name1>.<colname> is a correlated condition. MaxCompute V1.0 does not support expressions that reference source tables from both subqueries and main queries. MaxCompute V2.0 supports such expressions. These expressions are part of the ON condition in ANTI JOIN operations.

      select <select_expr1> from <table_name1> where <select_expr2> not in (select <select_expr2> from <table_name2> where <table_name2_colname> = <table_name1>.<colname>);
      Note

      MaxCompute supports NOT IN SUBQUERY that does not serve as a JOIN condition. For example, a non-WHERE clause uses NOT IN SUBQUERY, or a WHERE clause uses NOT IN SUBQUERY that cannot be converted into a JOIN condition. In this case, NOT IN SUBQUERY cannot be converted into ANTI JOIN. A separate job must be started to run a subquery. Correlated conditions are not supported.

    • Syntax 3

      NOT IN SUBQUERY supports multi-column subqueries based on the preceding capabilities and limits. This rule also applies to PostgreSQL. If you use Syntax 3 for NOT IN SUBQUERY, you do not need to split a query into multiple subqueries. Multi-column subqueries reduce one JOIN operation and save computing resources. You can use multi-column subqueries in the following ways:

      • Use a simple SELECT statement in which you specify multiple columns for the NOT IN SUBQUERY expression.

      • Use aggregate functions for the NOT IN SUBQUERY expression. For more information about aggregate functions, see Aggregate functions.

      • Use constants for the NOT IN SUBQUERY expression.

  • Parameters

    • select_expr1: required. The value of this parameter is in the format of col1_name, col2_name, Regular expression,.... This parameter specifies the common columns or partition key columns that you want to query or regular expressions that are used for a query.

    • table_name1 and table_name2: required. The parameters specify the names of the tables.

    • select_expr2 and select_expr3: required. The parameters specify the names of the columns in the tables specified by table_name1 and table_name2. Columns in the two tables are mapped to each other.

    • col_name: required. This parameter specifies the name of a column in the table.

  • Usage notes

    If you use the NOT IN SUBQUERY expression, null values are automatically excluded from the returned result of the subquery.

  • Examples

    • Example 1: Use Syntax 1. Sample statements:

      -- Create a table named shop1 and insert data into the table. 
      create table shop1 as select shop_name,customer_id,total_price from sale_detail;
      insert into shop1 values ('s8','c1',100.1);
      
      select * from shop1 where shop_name not in (select shop_name from sale_detail);

      The following result is returned:

      +------------+-------------+-------------+
      | shop_name  | customer_id | total_price |
      +------------+-------------+-------------+
      | s8         | c1          | 100.1       |
      +------------+-------------+-------------+
    • Example 2: Use Syntax 2. Sample statements:

      set odps.sql.allow.fullscan=true;
      select * from shop1 where shop_name not in (select shop_name from sale_detail where customer_id = shop1.customer_id);

      The following result is returned:

      +------------+-------------+-------------+
      | shop_name  | customer_id | total_price |
      +------------+-------------+-------------+
      | s8         | c1          | 100.1       |
      +------------+-------------+-------------+
    • Example 3: NOT IN SUBQUERY does not serve as a JOIN condition. Sample statements:

      set odps.sql.allow.fullscan=true;
      select * from shop1 where shop_name not in (select shop_name from sale_detail) and total_price < 100.3;

      NOT IN SUBQUERY cannot be converted into ANTI JOIN. This is because the WHERE clause includes an AND operator. A separate job is started to run a subquery.

      The following result is returned:

      +------------+-------------+-------------+
      | shop_name  | customer_id | total_price |
      +------------+-------------+-------------+
      | s8         | c1          | 100.1       |
      +------------+-------------+-------------+
    • Example 4: If the values of a row in the table from which you want to query data are NULL, no data is returned. Sample statements:

      -- Create a table named sale and insert data into the table. 
      create table if not exists sale
      (
      shop_name     string,
      customer_id   string,
      total_price   double
      )
      partitioned by (sale_date string, region string);
      alter table sale add partition (sale_date='2013', region='china');
      insert into sale partition (sale_date='2013', region='china') values ('null','null',null),('s2','c2',100.2),('s3','c3',100.3),('s8','c8',100.8);
      
      set odps.sql.allow.fullscan=true;
      select * from sale where shop_name not in (select shop_name from sale_detail);

      The following result is returned:

      +------------+-------------+-------------+------------+------------+
      | shop_name  | customer_id | total_price | sale_date  | region     |
      +------------+-------------+-------------+------------+------------+
      +------------+-------------+-------------+------------+------------+
    • Example 5: Multiple columns are specified in a SELECT statement for a subquery. Sample statements:

      -- Sample data is reconstructed to help you understand this example. The sample data is the same as that of IN SUBQUERY. 
      create table if not exists t1(a bigint,b bigint,c bigint,d bigint,e bigint);
      create table if not exists t2(a bigint,b bigint,c bigint,d bigint,e bigint);
      insert into table t1 values (1,3,2,1,1),(2,2,1,3,1),(3,1,1,1,1),(2,1,1,0,1),(1,1,1,0,1);
      insert into table t2 values (1,3,5,0,1),(2,2,3,1,1),(3,1,1,0,1),(2,1,1,0,1),(1,1,1,0,1);
      -- Scenario 1: The NOT IN SUBQUERY expression is a simple SELECT statement in which you specify multiple columns. 
      select a, b from t1 where (c, d) not in (select a, b from t2 where e = t1.e);
      -- The following result is returned: 
      +------------+------------+
      | a          | b          |
      +------------+------------+
      | 2          | 1          |
      | 1          | 1          |
      +------------+------------+
      -- Scenario 2: The NOT IN SUBQUERY expression uses aggregate functions. 
      select a, b from t1 where (c, d) not in (select max(a), b from t2 where e = t1.e group by b having max(a) > 0);
      -- The following result is returned: 
      +------------+------------+
      | a          | b          |
      +------------+------------+
      | 1          | 3          |
      | 3          | 1          |
      | 2          | 1          |
      | 1          | 1          |
      +------------+------------+
      -- Scenario 3: The NOT IN SUBQUERY expression uses constants. 
      select a, b from t1 where (c, d) not in ((1, 3), (1, 1));
      -- The following result is returned: 
      +------------+------------+
      | a          | b          |
      +------------+------------+
      | 1          | 3          |
      | 2          | 1          |
      | 1          | 1          |
      +------------+------------+

EXISTS SUBQUERY

When you use an EXISTS SUBQUERY clause, if the subquery returns at least one row of data, True is returned. If the subquery does not return data, False is returned.

MaxCompute supports only the WHERE subqueries that have correlated conditions. To use an EXISTS SUBQUERY clause, you must convert this clause into LEFT SEMI JOIN.

  • Syntax

    select <select_expr> from <table_name1> where exists (select <select_expr> from <table_name2> where <table_name2_colname> = <table_name1>.<colname>);
  • Parameters

    • select_expr: required. The value of this parameter is in the format of col1_name, col2_name, Regular expression,.... This format indicates common columns or partition key columns that you want to query or regular expressions that are used for a query.

    • table_name1 and table_name2: required. The parameters specify the names of the tables.

    • col_name: required. This parameter specifies the name of a column in the table.

  • Usage notes

    If you use the EXISTS SUBQUERY clause, null values are automatically excluded from the returned result of the subquery.

  • Examples

    set odps.sql.allow.fullscan=true;
    select * from sale_detail where exists (select * from shop where customer_id = sale_detail.customer_id);
    -- The preceding statement is equivalent to the following statement: 
    select * from sale_detail a left semi join shop b on a.customer_id = b.customer_id;

    The following result is returned:

    +------------+-------------+-------------+------------+------------+
    | shop_name  | customer_id | total_price | sale_date  | region     |
    +------------+-------------+-------------+------------+------------+
    | null       | c5          | NULL        | 2014       | shanghai   |
    | s6         | c6          | 100.4       | 2014       | shanghai   |
    | s7         | c7          | 100.5       | 2014       | shanghai   |
    | s1         | c1          | 100.1       | 2013       | china      |
    | s2         | c2          | 100.2       | 2013       | china      |
    | s3         | c3          | 100.3       | 2013       | china      |
    +------------+-------------+-------------+------------+------------+

NOT EXISTS SUBQUERY

When you use a NOT EXISTS SUBQUERY clause, if no data is returned, True is returned. Otherwise, False is returned.

MaxCompute supports only the WHERE subqueries that have correlated conditions. To use a NOT EXISTS SUBQUERY clause, you must convert this clause into LEFT ANTI JOIN.

  • Syntax

    select <select_expr> from <table_name1> where not exists (select <select_expr> from <table_name2> where <table_name2_colname> = <table_name1>.<colname>);
  • Parameters

    • select_expr: required. The value of this parameter is in the format of col1_name, col2_name, Regular expression,.... This format indicates common columns or partition key columns that you want to query or regular expressions that are used for a query.

    • table_name1 and table_name2: required. The parameters specify the names of the tables.

    • col_name: required. This parameter specifies the name of a column in the table.

  • Usage notes

    If you use the NOT EXISTS SUBQUERY clause, null values are automatically excluded from the returned result of the subquery.

  • Examples

    set odps.sql.allow.fullscan=true;
    select * from sale_detail where not exists (select * from shop where shop_name = sale_detail.shop_name);
    -- The preceding statement is equivalent to the following statement: 
    select * from sale_detail a left anti join shop b on a.shop_name = b.shop_name;

    The following result is returned:

    +------------+-------------+-------------+------------+------------+
    | shop_name  | customer_id | total_price | sale_date  | region     |
    +------------+-------------+-------------+------------+------------+
    +------------+-------------+-------------+------------+------------+

SCALAR SUBQUERY

If the output result of a SCALAR SUBQUERY clause contains only one row and one column of data, the result can be used as a scalar for data computations.

All SCALAR SUBQUERY clauses whose output result contains only one row and one column of data can be rewritten based on the following syntax. If the output result of SCALAR SUBQUERY contains only one row of data and one MAX or MIN operator is nested outside SCALAR SUBQUERY, the result does not change.

  • Syntax

    select <select_expr> from <table_name1> where (<select count(*) from <table_name2> where <table_name2_colname> = <table_name1>.<colname>) <Scalar operator> <scalar_value>;
    -- The preceding statement is equivalent to the following statement: 
    select <table_name1>.<select_expr> from <table_name1> left semi join (select <colname>, count(*) from <table_name2> group by <colname> having count(*) <Scalar operator> <scalar_value>) <table_name2> on <table_name1>.<colname> = <table_name2>.<colname>;
    Note
    • The output result of select count(*) from <table_name2> where <table_name2_colname> = <table_name1>.<colname> is a row set. The output contains only one row and one column of data. In this case, the result can be used as a scalar. In practical application, SCALAR SUBQUERY is converted into JOIN as much as possible.

    • The output result of SCALAR SUBQUERY can be used as a scalar only if you can confirm in the compilation phase that SCALAR SUBQUERY returns only one row and one column of data. If you cannot make this confirmation until the running phase, the compiler reports an error. The compiler can compile the statements that meet the following requirements:

      • The SELECT list of SCALAR SUBQUERY uses aggregate functions that are not included in the parameters of a specified user-defined table-valued function (UDTF).

      • SCALAR SUBQUERY that uses aggregate functions does not include a GROUP BY clause.

    SCALAR SUBQUERY also supports multi-column subqueries based on the preceding capabilities and limits.

    • A SELECT list is a SCALAR SUBQUERY expression in which you specify multiple columns. The expression must be an equality expression.

    • Columns in a SELECT list can be an expression of the BOOLEAN type. Only equivalent comparison is supported.

    • A WHERE clause supports multi-column comparison. Only equivalent comparison is supported.

  • Parameters

    • select_expr: required. The value of this parameter is in the format of col1_name, col2_name, Regular expression,.... This format indicates common columns or partition key columns that you want to query or regular expressions that are used for a query.

    • table_name1 and table_name2: required. The parameters specify the names of the tables.

    • col_name: required. This parameter specifies the name of a column in the table.

    • Scalar operator: required. The scalar operator can be greater than (>), less than (<), equal to (=), greater than or equal to (>=), or less than or equal to (<=).

    • scalar_value: required. This parameter specifies a scalar value.

  • Limits

    • SCALAR SUBQUERY can reference columns from main queries. If SCALAR SUBQUERY uses multiple-level nesting, only the outermost column can be referenced.

      -- Sample statement that you can execute: 
      select * from t1 where (select count(*) from t2 where t1.a = t2.a) = 3; 
      -- Sample statement that you cannot execute. This is because columns from main queries cannot be referenced in a SELECT statement for a subquery. 
      select * from t1 where (select count(*) from t2 where (select count(*) from t3 where t3.a = t1.a) = 2) = 3; 
    • SCALAR SUBQUERY can be used only in a WHERE clause.

      -- Scalar subqueries cannot be referenced in a SELECT statement for a subquery. 
      select * from t1 where (select t1.b + count(*) from t2) = 3; 
      -- The columns returned by the SELECT statement cannot reference outer columns. 
      select(selectcount(t1.a)fromt2wheret2.a=t1.a)fromt1;
      select(selectt1.afromt2wheret2.a=t1.a)fromt1;
  • Examples

    • Example 1: Common usage. Sample statements:

      set odps.sql.allow.fullscan=true;
      select * from shop where (select count(*) from sale_detail where sale_detail.shop_name = shop.shop_name) >= 1;

      The following result is returned:

      +------------+-------------+-------------+
      | shop_name  | customer_id | total_price |
      +------------+-------------+-------------+
      | s1         | c1          | 100.1       |
      | s2         | c2          | 100.2       |
      | s3         | c3          | 100.3       |
      | null       | c5          | NULL        |
      | s6         | c6          | 100.4       |
      | s7         | c7          | 100.5       |
      +------------+-------------+-------------+
    • Example 2: Multiple columns are specified in a SELECT statement for a subquery. Sample statements:

      -- Sample data is reconstructed to help you understand this example. 
      create table if not exists ts(a bigint,b bigint,c double);
      create table if not exists t(a bigint,b bigint,c double);
      insert into table ts values (1,3,4.0),(1,3,3.0);
      insert into table t values (1,3,4.0),(1,3,5.0);
      -- Scenario 1: The SELECT statement contains a scalar subquery expression in which multiple columns are specified. The expression must be an equality expression. Sample statements that you cannot execute: select (select a, b from t where c > ts.c) as (a, b), a from ts;
      select (select a, b from t where c = ts.c) as (a, b), a from ts;
      -- The following result is returned: 
      +------------+------------+------------+
      | a          | b          | a2         |
      +------------+------------+------------+
      | 1          | 3          | 1          |
      | NULL       | NULL       | 1          |
      +------------+------------+------------+
      -- Scenario 2: The SELECT statement contains an expression of the BOOLEAN type. Only equivalent comparison is supported. Sample statements that you cannot execute: select (a,b) > (select a,b from ts where c = t.c) from t;
      select (a,b) = (select a,b from ts where c = t.c) from t;
      -- The following result is returned: 
      +------+
      | _c0  |
      +------+
      | true |
      | false |
      +------+
      -- Scenario 3: A WHERE clause supports multi-column comparison. Only equivalent comparison is supported. Sample statements that you cannot execute: select * from t where (a,b) > (select a,b from ts where c = t.c);
      select * from t where c > 3.0 and (a,b) = (select a,b from ts where c = t.c);
      -- The following result is returned: 
      +------------+------------+------------+
      | a          | b          | c          |
      +------------+------------+------------+
      | 1          | 3          | 4.0        |
      +------------+------------+------------+
      select * from t where c > 3.0 or (a,b) = (select a,b from ts where c = t.c);
      -- The following result is returned: 
      +------------+------------+------------+
      | a          | b          | c          |
      +------------+------------+------------+
      | 1          | 3          | 4.0        |
      | 1          | 3          | 5.0        |
      +------------+------------+------------+

References

A large number of subqueries or improper use of subqueries may cause slow queries, especially in a big data environment. You can use temporary tables or materialized views instead of subqueries, or reconstruct multiple subqueries into a JOIN operation to improve query efficiency. For more information, see Recommendations and management of materialized views and JOIN.