This topic provides answers to some frequently asked questions about data query language (DQL) operations in MaxCompute.
Category | FAQ |
GROUP BY | |
ORDER BY | |
Subquery | |
INTERSECT, UNION, EXCEPT, and MINUS | |
JOIN | |
MAPJOIN | |
Others |
What do I do if the "Repeated key in GROUP BY" error message appears when I execute a MaxCompute SQL statement?
Problem description
When I execute a MaxCompute SQL statement, the following error message appears:
FAILED: ODPS-0130071:Semantic analysis exception - Repeated key in GROUP BY.
Cause
SELECT DISTINCT cannot be followed by a constant.
Solution
Split the execution logic of the SQL statement into two layers. This way, the DISTINCT logic without constants is processed at the inner layer, and constant data is added to the outer layer.
What do I do if the "Expression not in GROUP BY key" error message appears when I execute a MaxCompute SQL statement?
Problem description
When I execute a MaxCompute SQL statement, the following error message appears:
FAILED: ODPS-0130071:Semantic analysis exception - Expression not in GROUP BY key : line 1:xx 'xxx'
Cause
Columns that are not specified in the GROUP BY clause cannot be directly referenced. For more information, see GROUP BY (col_list).
Solution
Modify the SQL statement to ensure that the columns that are queried by using SELECT are the columns that are specified in the GROUP BY clause or the columns that are processed by using aggregate functions, such as SUM or COUNT.
Table B is generated after I execute GROUP BY on Table A. The number of rows in Table B is less than the number of rows in Table A, but the physical storage capacity of Table B is ten times the physical storage capacity of Table A. Why?
In MaxCompute, data is stored in columnar compression mode. If the data in the adjacent columns is similar, the data compression ratio is high. If odps.sql.groupby.skewindata
is set to true, data is scattered when you execute an SQL statement to write data. In this case, the data compression ratio is low. If you want to obtain a high data compression ratio, you can sort specific data when you execute an SQL statement to write data.
Is query performance affected when I use GROUP BY to query 10 billion data records? Is the data amount limited when I use GROUP BY to query data?
No, query performance is not affected when you use GROUP BY to query 10 billion data records. The data amount is not limited when you use GROUP BY to query data. For more information about GROUP BY, see GROUP BY (col_list).
After I query data in MaxCompute, how are the query results sorted?
The data of MaxCompute tables is arranged in random order. If you do not configure order settings, the data is also returned in random order.
If you want to obtain sorted data, configure order settings for the data. For example, you can specify order by xx limit n
in the SQL statement to sort data.
If you want to sort full data, set n
after limit
to the value of the total number of data records + 1
.
If you sort large amounts of full data, query performance is significantly affected and memory overflow may occur. We recommend that you do not perform this operation.
Does MaxCompute support ORDER BY FIELD NULLS LAST?
Yes, MaxCompute supports ORDER BY FIELD NULLS LAST. For more information about the syntax supported by MaxCompute, see Differences in the support for SQL statements.
What do I do if the "ORDER BY must be used with a LIMIT clause" error message appears when I execute a MaxCompute SQL statement?
Problem description
When I execute a MaxCompute SQL statement, the following error message appears:
FAILED: ODPS-0130071:[1,27] Semantic analysis exception - ORDER BY must be used with a LIMIT clause, please set odps.sql.validate.orderby.limit=false to use it.
Cause
The ORDER BY clause needs to sort all data of a single node. By default, the ORDER BY clause is used with the LIMIT clause to prevent a single node from processing large amounts of data.
Solution
You can remove the limit on the simultaneous execution of the ORDER BY and LIMIT clauses for a project or session.
To remove the limit for a project, run the
setproject odps.sql.validate.orderby.limit=false;
command.To remove the limit for a session, commit and run the
set odps.sql.validate.orderby.limit=false;
command with the SQL statement that you want to execute.NoteIf a single node has large amounts of data to sort after you remove the limit, a large number of resources and much time are consumed.
For more information about ORDER BY, see ORDER BY (order_condition).
When I execute a MaxCompute SQL statement with NOT IN that is followed by a subquery, the subquery is expected to return tens of thousands of data records. However, if a subquery that follows IN or NOT IN returns partition data, the maximum number of data records that can be returned is 1,000. How do I ensure that the subquery returns the expected number of data records and the logic of NOT IN is implemented?
You can use LEFT OUTER JOIN
to query data.
select * from a where a.ds not in (select ds from b);
Replace the preceding statement with the following statement:
select a.* from a left outer join (select distinct ds from b) bb on a.ds=bb.ds where bb.ds is null;
How do I merge two tables that are not associated with each other?
You can perform UNION ALL
operations to complete vertical merging. If you want to implement horizontal merging, you can use the row_number
function to add an ID column to both tables. Then, associate the tables by using the ID columns and read the fields of the tables. For more information, see UNION or ROW_NUMBER.
What do I do if the "ValidateJsonSize error" error message appears when I perform UNION ALL operations?
Problem description
When I execute the SQL statement
select count(1) as co from client_table union all ...
that contains 200 UNION ALL operations, the following error message appears:FAILED: build/release64/task/fuxiWrapper.cpp(344): ExceptionBase: Submit fuxi Job failed, { "ErrCode": "RPC_FAILED_REPLY", "ErrMsg": "exception: ExceptionBase:build/release64/fuxi/fuximaster/fuxi_master.cpp(1018): ExceptionBase: StartAppFail: ExceptionBase:build/release64/fuxi/fuximaster/app_master_mgr.cpp(706): ExceptionBase: ValidateJsonSize error: the size of compressed plan is larger than 1024KB\nStack
Causes
Cause 1: After an SQL statement is converted into an execution plan, the length of the execution plan exceeds 1024 KB, which is the maximum size allowed by the underlying architecture. As a result, an SQL execution error is returned. The length of the execution plan is not directly related to the length of the SQL statement. Therefore, the length of the execution plan cannot be estimated.
Cause 2: The length of the execution plan exceeds the limit due to a large number of partitions.
Cause 3: The SQL statement fails to run due to excessive small files.
Solutions
Solution to Cause 1: If an SQL statement is excessively long, we recommend that you split the statement into multiple statements. This prevents the generated execution plan from exceeding the maximum length.
Solution to Cause 2: If a large number of partitions exist, adjust the number of partitions. For more information, see Partition.
Solution to Cause 3: If excessive small files exist, merge the small files.
What do I do if the "Both left and right aliases encountered in JOIN" error message appears when I perform a JOIN operation?
Problem description
When I execute a MaxCompute SQL statement, the following error message appears:
FAILED: ODPS-0130071:Semantic analysis exception - Both left and right aliases encountered in JOIN : line 3:3 'xx': . I f you really want to perform this join, try mapjoin
Causes
Cause 1: The ON condition in the SQL statement includes a non-equi join, such as
table1.c1>table2.c3
.Cause 2: The data on one side of the JOIN condition in the SQL statement comes from two tables, such as
table1.col1 = concat(table1.col2,table2.col3)
.
Solutions
Solution to Cause 1: Change the non-equi join in the ON condition in the SQL statement to an equi-join.
NoteIf you must use a non-equi join, you can add a MAPJOIN hint to the SQL statement. For more information, see ODPS-0130071.
Solution to Cause 2: If one of the tables is small, use the MAPJOIN method.
What do I do if the "Maximum 16 join inputs allowed" error message appears when I perform a JOIN operation?
Problem description
When I execute a MaxCompute SQL statement, the following error message appears:
FAILED: ODPS-0123065:Join exception - Maximum 16 join inputs allowed
Cause
A MaxCompute SQL statement can perform MAPJOIN on a maximum of six small tables and can consecutively join a maximum of 16 tables.
Solution
Join several small tables into a temporary table as an input table to reduce the number of input tables.
What do I do if the number of returned data records is greater than the number of data records in one of the source tables after I perform a JOIN operation?
Problem description
After I execute the following MaxCompute SQL statement, the number of returned data records is greater than the number of data records in the table1 table.
select count(*) from table1 a left outer join table2 b on a.ID = b.ID;
Cause
In the preceding SQL statement, a left outer join is performed on the ID fields of table1 and table2. Therefore, the following situations may occur:
If the data that you want to join cannot be found in table2, table1 still returns a data record.
If the data that you want to join cannot be found in table1 but can be found in table2, no data is returned.
If the data that you want to join can be found in table1 and table2, the join logic is the same as inner joins. If an ID has values in both table1 and table2, the returned result is the Cartesian product of table1 and table2.
The following table provides sample data in table1.
id
values
1
a
1
b
2
c
The following table provides sample data in table2.
id
values
1
A
1
B
3
D
The following table lists the returned results after
select count(*) from table1 a left outer join table2 b on a.ID = b.ID;
is executed.id1
values1
id2
values2
1
b
1
B
1
b
1
A
1
a
1
B
1
a
1
A
2
c
NULL
NULL
Both tables have data whose value of the ID field is 1. Therefore, the Cartesian product operation is performed and four data records are returned.
Only table1 has data whose value of the ID field is 2. Therefore, one data record is returned.
Only table2 has data whose value of the ID field is 3. Therefore, no data is returned.
Solution
Check whether the increase in the number of data records is caused by the data in table2. The following statement shows an example. In this statement,
limit 10
is added to prevent data from flooding your screen if table2 has a large amount of data. You need only to check the first few data records to determine the reason for the increase in the number of data records.select id, count() as cnt from table2 group by id having cnt>1 limit 10;
If you do not want to perform the Cartesian product operation when duplicate data exists and you want to achieve the effect that is equivalent to IN in SQL, you can use the following statement:
select * from table1 a left outer join (select distinct id from table2) b on a.id = b.id;
I specify the partition condition when I perform a JOIN operation, but the system prompts that full table scan is prohibited. Why?
Problem description
When I execute the following statement in two projects, the statement is successfully executed only in one of the projects.
select t.stat_date from fddev.tmp_001 t left outer join (select '20180830' as ds from fddev.dual ) t1 on t.ds = 20180830 group by t.stat_date;
The following error message appears:
Table(fddev,tmp_001) is full scan with all partitions,please specify partitions predicates.
Cause
To specify a partition in a SELECT statement, you must use a WHERE clause. The ON condition in your SELECT statement does not comply with the SQL standard.
The
set odps.sql.outerjoin.supports.filters=false
command is run in the project in which the statement is successfully executed. This configuration converts the condition in the ON clause into a filter condition to allow non-standard SQL statements. This configuration is compatible with the Hive syntax but does not comply with the SQL standard.Solution
We recommend that you place the partition filter conditions in a WHERE clause.
When I perform a JOIN operation, does partition pruning take effect if the partition pruning condition is specified in an ON clause or a WHERE clause?
Partition pruning takes effect if the partition pruning condition is specified in a WHERE clause. If the partition pruning condition is specified in an ON clause, partition pruning takes effect on the secondary table. Partition pruning does not take effect on the primary table. Therefore, a full table scan is triggered. For more information about partition pruning, see Check whether partition pruning is effective.
How do I use MAPJOIN to cache multiple small tables?
You can specify the aliases of the tables that you want to cache in the MAPJOIN statement.
For example, a table named iris exists in a project. The table has the following data:
+------------------------------------------+
| Field | Type | Label | Comment |
+------------------------------------------+
| sepal_length | double | | |
| sepal_width | double | | |
| petal_length | double | | |
| petal_width | double | | |
| category | string | | |
+------------------------------------------+
The following sample code shows how to use MAPJOIN to cache small tables.
select
/*+ mapjoin(b,c) */
a.category,
b.cnt as cnt_category,
c.cnt as cnt_all
from iris a
join
(
select count() as cnt,category from iris group by category
) b
on a.category = b.category
join
(
select count(*) as cnt from iris
) c;
Can I exchange the large table and small tables that are specified in the MAPJOIN statement?
The large table and small tables in the MAPJOIN statement are distinguished based on the size of the space that is used by each table.
The system loads all data in the specified small tables to the memory of the program that performs the JOIN operation. This helps accelerate the execution of the JOIN operation. If you exchange the large table and small tables in the MAPJOIN statement, no error is returned but the processing performance decreases.
After I configure filter conditions in a MaxCompute SQL statement, an error message indicating that the size of the input data exceeds 100 GB appears. What do I do?
Filter partitions before you obtain data. After you obtain data, filter non-partition fields. The size of the input table varies based on the size of the table after the partitions are filtered and before the non-partition fields are filtered.
Does the WHERE condition of a fuzzy query in MaxCompute SQL support regular expressions?
Yes, the WHERE condition of a fuzzy query in MaxCompute SQL supports regular expressions. For example, select * from user_info where address rlike '[0-9]{9}';
indicates that the IDs that consist of nine digits are searched.
If I want to synchronize only 100 data records, how do I use LIMIT to specify the number of data records that I want to synchronize in a WHERE clause?
LIMIT cannot be used in a WHERE clause. You can execute an SQL statement to read 100 data records before you synchronize data.
How can I improve the query efficiency? Can I adjust the partition settings?
If you use partition fields to partition a table, a full table scan is not triggered when partitions are added or when partition data is updated or read. This improves the efficiency of data processing. For more information, see Table operations.
Does MaxCompute SQL support the WITH AS statement?
Yes, MaxCompute SQL supports the WITH AS statement. MaxCompute supports SQL-compliant common table expressions (CTEs) to improve the readability and execution efficiency of SQL statements. For more information, see Common table expressions.
How do I split one row of data into multiple rows?
You can use Lateral View with table generation functions, such as SPLIT and EXPLODE, to split one row of data into multiple rows of data and aggregate the split data.
After I specify use_instance_tunnel=false and instance_tunnel_max_record=10 in the odps_config.ini file of the client, the SELECT statement still generates a large number of output records. Why?
To use instance_tunnel_max_record
to control the number of output records, you must change use_instance_tunnel=false
to use_instance_tunnel=true
.
How do I use a regular expression to determine whether the values of a field are in Chinese?
The following statement shows an example.
select 'Field name' rlike '[\\x{4e00}-\\x{9fa5}]+';