In most cases, you need to analyze query statements or table schemas to find performance bottlenecks during development. MaxCompute SQL provides the EXPLAIN
statement to help you analyze query statements. This topic describes the features and syntax of the EXPLAIN
statement. This topic also provides examples on using the EXPLAIN statement.
Description
The EXPLAIN
statement can display the execution plan structure of a DQL statement in MaxCompute SQL. This statement helps you understand how an SQL statement is executed and provides guidance for optimizing SQL statements. One query statement corresponds to multiple jobs, and one job corresponds to multiple tasks.
If the query statement is complex and the size of rows in the output results of the EXPLAIN
statement exceed 4 MB, the threshold specified by the API of the upper-layer application is reached. As a result, the output results cannot be completely displayed. To address this issue, you can split the query statement into multiple subqueries and execute the EXPLAIN
statement on each subquery to obtain the structure of the job.
Syntax
EXPLAIN <dml query>;
dml query: required. The SELECT
statement. For more information, see SELECT syntax.
Return value
The output results of the EXPLAIN
statement include the following information:
Dependencies between jobs
For example,
job0 is a root job
. If a query requires onlyjob0
, only one row of data is displayed.Dependencies between tasks
In Job job0: root Tasks: M1, M2 J3_1_2_Stg1 depends on: M1, M2
job0
contains the following tasks:M1
,M2
, andJ3_1_2_Stg1
. MaxCompute runs theJ3_1_2_Stg1
task after theM1
andM2
tasks are run.Naming conventions of tasks:
MaxCompute provides four task types: map, reduce, join, and local work. The first letter in a task name indicates the type of the task. For example,
M2Stg1
is a map task.The digit that follows the first letter indicates the task ID. This ID is unique among all tasks that correspond to a specific query.
Digits separated by underscores (_) represent the direct dependency of a task. For example,
J3_1_2_Stg1
indicates that the task with the ID of 3 depends on the M1 and M2 tasks.
Dependencies between all operators in a task
The operator string describes the execution semantics of a task. Operator string structure:
In Task M2: Data source: mf_mc_bj.sale_detail_jt/sale_date=2013/region=china # Data source describes the input of the task. TS: mf_mc_bj.sale_detail_jt/sale_date=2013/region=china # TableScanOperator FIL: ISNOTNULL(customer_id) # FilterOperator RS: order: + # ReduceSinkOperator nullDirection: * optimizeOrderBy: False valueDestLimit: 0 dist: HASH keys: customer_id values: customer_id (string) total_price (double) partitions: customer_id In Task J3_1_2: JOIN: # JoinOperator StreamLineRead1 INNERJOIN StreamLineRead2 keys: 0:customer_id 1:customer_id AGGREGATE: group by:customer_id # GroupByOperator UDAF: SUM(total_price) (__agg_0_sum)[Complete],SUM(total_price) (__agg_1_sum)[Complete] RS: order: + nullDirection: * optimizeOrderBy: True valueDestLimit: 10 dist: HASH keys: customer_id values: customer_id (string) __agg_0 (double) __agg_1 (double) partitions: In Task R4_3: SEL: customer_id,__agg_0,__agg_1 # SelectOperator LIM:limit 10 # LimitOperator FS: output: Screen # FileSinkOperator schema: customer_id (string) AS ashop __agg_0 (double) AS ap __agg_1 (double) AS bp
Operator descriptions:
TableScanOperator (TS): describes the logic of
FROM
statement blocks in a query statement. The alias of the input table is displayed in the output results of theEXPLAIN
statement.SelectOperator (SEL): describes the logic of
SELECT
statement blocks in a query statement. The columns that are passed to the next operator are displayed in the execution results of theEXPLAIN
statement. Multiple columns are separated by commas (,).If a column is specified, the value is displayed in the
<alias>.<column_name>
format.If an expression is specified, the value is displayed as a list of functions, such as
func1(arg1_1, arg1_2, func2(arg2_1, arg2_2))
.If a constant is specified, the constant value is displayed.
FilterOperator (FIL): describes the logic of
WHERE
statement blocks in a query statement. The output results of theEXPLAIN
statement include aWHERE
expression, which is in a form that is similar to that of SelectOperator.JoinOperator (JOIN): describes the logic of
JOIN
statement blocks in a query statement. The output results of theEXPLAIN
statement show which tables are joined in which way.GroupByOperator (AGGREGATE): describes the logic of aggregate operations. This operator is displayed if an aggregate function is used in a query statement. The content of the aggregate function is displayed in the execution results of the
EXPLAIN
statement.ReduceSinkOperator (RS): describes the logic of data distribution between tasks. If the result of a task is transferred to another task, ReduceSinkOperator must be used to distribute data at the last stage of the task. The result sorting method, distributed keys, distributed values, and columns that are used to calculate the hash value are displayed in the output results of the
EXPLAIN
statement.FileSinkOperator (FS): describes the logic of storage operations on final data records. If
INSERT
statement blocks are included in a query statement, the name of the table into which you want to insert data is displayed in the output results of theEXPLAIN
statement.LimitOperator (LIM): describes the logic of
LIMIT
statement blocks in a query statement. The number of returned rows that are specified in aLIMIT
statement block is displayed in the execution results of theEXPLAIN
statement.MapjoinOperator (HASHJOIN): describes
JOIN
operations on large tables. This operator is similar to JoinOperator.
Sample data
Sample source data is provided for you to better understand the examples in this topic. The following statements show how to create the sale_detail and sale_detail_jt tables and insert data into the tables.
-- Create two partitioned tables named sale_detail and sale_detail_jt.
CREATE TABLE if NOT EXISTS sale_detail
(
shop_name STRING,
customer_id STRING,
total_price DOUBLE
)
PARTITIONED BY (sale_date STRING, region STRING);
CREATE TABLE if NOT EXISTS sale_detail_jt
(
shop_name STRING,
customer_id STRING,
total_price DOUBLE
)
PARTITIONED BY (sale_date STRING, region STRING);
-- Add partitions to the two tables.
ALTER TABLE sale_detail ADD PARTITION (sale_date='2013', region='china') PARTITION (sale_date='2014', region='shanghai');
ALTER TABLE sale_detail_jt ADD PARTITION (sale_date='2013', region='china');
-- Insert data into the tables.
INSERT INTO sale_detail PARTITION (sale_date='2013', region='china') VALUES ('s1','c1',100.1),('s2','c2',100.2),('s3','c3',100.3);
INSERT INTO sale_detail PARTITION (sale_date='2014', region='shanghai') VALUES ('null','c5',null),('s6','c6',100.4),('s7','c7',100.5);
INSERT INTO sale_detail_jt PARTITION (sale_date='2013', region='china') VALUES ('s1','c1',100.1),('s2','c2',100.2),('s5','c2',100.2);
-- Query data from the sale_detail and sale_detail_jt tables. Sample statements:
SET odps.sql.allow.fullscan=true;
SELECT * FROM sale_detail;
-- The following result is returned:
+------------+-------------+-------------+------------+------------+
| shop_name | customer_id | total_price | sale_date | region |
+------------+-------------+-------------+------------+------------+
| s1 | c1 | 100.1 | 2013 | china |
| s2 | c2 | 100.2 | 2013 | china |
| s3 | c3 | 100.3 | 2013 | china |
| null | c5 | NULL | 2014 | shanghai |
| s6 | c6 | 100.4 | 2014 | shanghai |
| s7 | c7 | 100.5 | 2014 | shanghai |
+------------+-------------+-------------+------------+------------+
SET odps.sql.allow.fullscan=true;
SELECT * FROM sale_detail_jt;
-- The following result is returned:
+------------+-------------+-------------+------------+------------+
| shop_name | customer_id | total_price | sale_date | region |
+------------+-------------+-------------+------------+------------+
| s1 | c1 | 100.1 | 2013 | china |
| s2 | c2 | 100.2 | 2013 | china |
| s5 | c2 | 100.2 | 2013 | china |
+------------+-------------+-------------+------------+------------+
-- Create a table for the JOIN operation.
SET odps.sql.allow.fullscan=true;
CREATE TABLE shop AS SELECT shop_name, customer_id, total_price FROM sale_detail;
Examples
The following examples are based on the Sample data.
Example 1
Query statements:
SELECT a.customer_id AS ashop, SUM(a.total_price) AS ap,COUNT(b.total_price) AS bp FROM (SELECT * FROM sale_detail_jt WHERE sale_date='2013' AND region='china') a INNER JOIN (SELECT * FROM sale_detail WHERE sale_date='2013' AND region='china') b ON a.customer_id=b.customer_id GROUP BY a.customer_id ORDER BY a.customer_id LIMIT 10;
Obtain the execution plan of the query statement:
EXPLAIN SELECT a.customer_id AS ashop, SUM(a.total_price) AS ap,COUNT(b.total_price) AS bp FROM (SELECT * FROM sale_detail_jt WHERE sale_date='2013' AND region='china') a INNER JOIN (SELECT * FROM sale_detail WHERE sale_date='2013' AND region='china') b ON a.customer_id=b.customer_id GROUP BY a.customer_id ORDER BY a.customer_id LIMIT 10;
The following result is returned:
job0 is root job In Job job0: root Tasks: M1 M2_1 depends on: M1 R3_2 depends on: M2_1 R4_3 depends on: R3_2 In Task M1: Data source: doc_****.default.sale_detail/sale_date=2013/region=china TS: doc_****.default.sale_detail/sale_date=2013/region=china Statistics: Num rows: 3.0, Data size: 324.0 FIL: ISNOTNULL(customer_id) Statistics: Num rows: 2.7, Data size: 291.6 RS: valueDestLimit: 0 dist: BROADCAST keys: values: customer_id (string) total_price (double) partitions: Statistics: Num rows: 2.7, Data size: 291.6 In Task M2_1: Data source: doc_****.default.sale_detail_jt/sale_date=2013/region=china TS: doc_****.default.sale_detail_jt/sale_date=2013/region=china Statistics: Num rows: 3.0, Data size: 324.0 FIL: ISNOTNULL(customer_id) Statistics: Num rows: 2.7, Data size: 291.6 HASHJOIN: Filter1 INNERJOIN StreamLineRead1 keys: 0:customer_id 1:customer_id non-equals: 0: 1: bigTable: Filter1 Statistics: Num rows: 3.6450000000000005, Data size: 787.32 RS: order: + nullDirection: * optimizeOrderBy: False valueDestLimit: 0 dist: HASH keys: customer_id values: customer_id (string) total_price (double) total_price (double) partitions: customer_id Statistics: Num rows: 3.6450000000000005, Data size: 422.82000000000005 In Task R3_2: AGGREGATE: group by:customer_id UDAF: SUM(total_price) (__agg_0_sum)[Complete],COUNT(total_price) (__agg_1_count)[Complete] Statistics: Num rows: 1.0, Data size: 116.0 RS: order: + nullDirection: * optimizeOrderBy: True valueDestLimit: 10 dist: HASH keys: customer_id values: customer_id (string) __agg_0 (double) __agg_1 (bigint) partitions: Statistics: Num rows: 1.0, Data size: 116.0 In Task R4_3: SEL: customer_id,__agg_0,__agg_1 Statistics: Num rows: 1.0, Data size: 116.0 SEL: customer_id ashop, __agg_0 ap, __agg_1 bp, customer_id Statistics: Num rows: 1.0, Data size: 216.0 FS: output: Screen schema: ashop (string) ap (double) bp (bigint) Statistics: Num rows: 1.0, Data size: 116.0 OK
Example 2
Query statements:
SELECT /*+ mapjoin(a) */ a.customer_id AS ashop, SUM(a.total_price) AS ap,COUNT(b.total_price) AS bp FROM (SELECT * FROM sale_detail_jt WHERE sale_date='2013' AND region='china') a INNER JOIN (SELECT * FROM sale_detail WHERE sale_date='2013' AND region='china') b ON a.total_price<b.total_price GROUP BY a.customer_id ORDER BY a.customer_id LIMIT 10;
Obtain the execution plan of the query statement:
EXPLAIN SELECT /*+ mapjoin(a) */ a.customer_id AS ashop, SUM(a.total_price) AS ap,COUNT(b.total_price) AS bp FROM (SELECT * FROM sale_detail_jt WHERE sale_date='2013' AND region='china') a INNER JOIN (SELECT * FROM sale_detail WHERE sale_date='2013' AND region='china') b ON a.total_price<b.total_price GROUP BY a.customer_id ORDER BY a.customer_id LIMIT 10;
The following result is returned:
job0 is root job In Job job0: root Tasks: M1 M2_1 depends on: M1 R3_2 depends on: M2_1 R4_3 depends on: R3_2 In Task M1: Data source: doc_****.sale_detail_jt/sale_date=2013/region=china TS: doc_****.sale_detail_jt/sale_date=2013/region=china Statistics: Num rows: 3.0, Data size: 324.0 RS: valueDestLimit: 0 dist: BROADCAST keys: values: customer_id (string) total_price (double) partitions: Statistics: Num rows: 3.0, Data size: 324.0 In Task M2_1: Data source: doc_****.sale_detail/sale_date=2013/region=china TS: doc_****.sale_detail/sale_date=2013/region=china Statistics: Num rows: 3.0, Data size: 24.0 HASHJOIN: StreamLineRead1 INNERJOIN TableScan2 keys: 0: 1: non-equals: 0: 1: bigTable: TableScan2 Statistics: Num rows: 9.0, Data size: 1044.0 FIL: LT(total_price,total_price) Statistics: Num rows: 6.75, Data size: 783.0 AGGREGATE: group by:customer_id UDAF: SUM(total_price) (__agg_0_sum)[Partial_1],COUNT(total_price) (__agg_1_count)[Partial_1] Statistics: Num rows: 2.3116438356164384, Data size: 268.1506849315069 RS: order: + nullDirection: * optimizeOrderBy: False valueDestLimit: 0 dist: HASH keys: customer_id values: customer_id (string) __agg_0_sum (double) __agg_1_count (bigint) partitions: customer_id Statistics: Num rows: 2.3116438356164384, Data size: 268.1506849315069 In Task R3_2: AGGREGATE: group by:customer_id UDAF: SUM(__agg_0_sum)[Final] __agg_0,COUNT(__agg_1_count)[Final] __agg_1 Statistics: Num rows: 1.6875, Data size: 195.75 RS: order: + nullDirection: * optimizeOrderBy: True valueDestLimit: 10 dist: HASH keys: customer_id values: customer_id (string) __agg_0 (double) __agg_1 (bigint) partitions: Statistics: Num rows: 1.6875, Data size: 195.75 In Task R4_3: SEL: customer_id,__agg_0,__agg_1 Statistics: Num rows: 1.6875, Data size: 195.75 SEL: customer_id ashop, __agg_0 ap, __agg_1 bp, customer_id Statistics: Num rows: 1.6875, Data size: 364.5 FS: output: Screen schema: ashop (string) ap (double) bp (bigint) Statistics: Num rows: 1.6875, Data size: 195.75 OK