DataWorks provides ODPS Script nodes for you to develop MaxCompute script tasks in the code editor provided by the MaxCompute V2.0 SQL engine. This topic describes how to use an ODPS Script node to develop a MaxCompute script task.
Prerequisites
An ODPS Script node is created. For more information, see Create and manage ODPS nodes.
Background information
The SQL engine of MaxCompute supports the script mode. In script mode, an SQL script file that contains multiple SQL statements is compiled as a whole. Each statement in the SQL script file does not need to be separately compiled. The script mode is used in the scenario where a single statement with nested subqueries needs to be rewritten, or a complex script must be split into multiple statements. After the SQL script file is compiled, it is submitted to MaxCompute, and an execution plan is generated. This way, the statements in the SQL script file are scheduled in one queue and are executed once. This allows you to fully utilize the resources in MaxCompute. For more information, see SQL in script mode. In DataWorks, you can create ODPS Script nodes to develop code for MaxCompute script tasks in script mode, and integrate MaxCompute script tasks with other tasks.
Scenarios
You can use the script mode in the following scenarios:
A single statement with nested subqueries needs to be rewritten, or a complex script must be split into multiple statements.
The script mode is not applicable in this scenario: Data of multiple data sources needs to be added and a long time interval for data preparation exists between different data sources. For example, data of a data source is ready at 01:00, and data of another data source is ready at 07:00. In this case, the table variables are not suitable for packaging multiple statements in an SQL script file.
You must assign a constant value to a variable and execute the
SELECT * FROM Variable
statement to convert the variable into a scalar value. The scalar value is used together with other columns for calculation. The constant value can also be stored in a table that has only one row. The following statements provide an example. For more information about the syntax of the SELECT * FROM Variable statement, see Subqueries.@a := SELECT 10; -- Assign the constant value 10 to @a. You can also use SELECT col1 FROM t1 to assign the constant value stored in table t1 that has only one row to @a. @b := SELECT key,value+(SELECT * FROM @a) FROM t2 WHERE key >10000; -- Use the value of @a with the value in table t2 for calculation. SELECT * FROM @b;
Syntax
A complete MaxCompute script contains SET statements, DDL statements, and DML statements. You can include zero or more statements of each type in a script. However, statements of different types cannot be mixed. The at signs (@) preceding statements indicate that these statements are connected by using variables. Syntax:
-- SET statements
set odps.sql.type.system.odps2=true;
[set odps.stage.reducer.num=***;]
[...]
-- DDL statements
create table table1 xxx;
[create table table2 xxx;]
[...]
-- DML statements
@var1 := SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table3
[WHERE where_condition];
@var2 := SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table4
[WHERE where_condition];
@var3 := SELECT [ALL | DISTINCT] var1.select_expr, var2.select_expr, ...
FROM @var1 join @var2 on ...;
INSERT OVERWRITE|INTO TABLE [PARTITION (partcol1=val1, partcol2=val2 ...)]
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM @var3;
[@var4 := SELECT [ALL | DISTINCT] var1.select_expr, var.select_expr, ... FROM @var1
UNION ALL | UNION
SELECT [ALL | DISTINCT] var1.select_expr, var.select_expr, ... FROM @var2;
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
AS
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM var4;]
Limits
The following limits are imposed when you use ODPS Script nodes:
ODPS Script nodes support SET statements, DML statements, and some DDL statements in script mode. The DDL statements that display results, such as
DESC
andSHOW
, are not supported.
You can write only one statement that displays results, such as a SELECT statement, in a script. If you write multiple such statements in a script, an error occurs. We recommend that you do not use SELECT statements that display results in a script.
You can write only one
CREATE TABLE AS
statement in a script, and this statement must be the last statement in the script. We recommend that you write CREATE TABLE statements and INSERT statements in different sections to separate them.In script mode, if one statement in a script fails, the whole script fails.
In script mode, a job is generated to process data only after all input data is prepared for a script.
In script mode, if you specify a statement for writing data to a table and then a statement for reading data from the table in the same script, an error occurs. For example, if you write the following statements in a script, an error occurs:
insert overwrite table src2 select * from src where key > 0; @a := select * from src2; select * from @a;
To avoid the error, change the preceding statements to the following statements:
@a := select * from src where key > 0; insert overwrite table src2 select * from @a; select * from @a;
Simple code editing example
ODPS Script nodes allow you to write SQL statements based on your business logic in a similar way as you use a common programming language. You do not need to consider how to organize the SQL statements. The following example describes how to use an ODPS Script node:
create table if not exists dest(key string , value bigint) ;
create table if not exists dest2(key string,value bigint ) ;
@a := select * from src where value >0;
@b := select * from src2 where key is not null;
@c := select * from src3 where value is not null;
@d := select a.key,b.value from @a left outer join @b on a.key=b.key and b.value>0;
@e := select a.key,c.value from @a inner join @c on a.key=c.key;
@f := select * from @d union select * from @e union select * from @a;
insert overwrite table dest select * from @f;
@g := select e.key,c.value from @e join @c on e.key=c.key;
insert overwrite table dest2 SELECT * from @g;
In this example, the src, src1, and src2 tables are used. Statements to create the tables:
-- Create tables.
create table if not EXISTS src(key string ,value BIGINT);
insert into src values ('1',11) ;
insert into src values ('1',11) ;
create table if not EXISTS src2(key string ,value BIGINT);
insert into src2 values ('1',22);
insert into src2 values ('2',22);
create table if not EXISTS src3(key string ,value BIGINT);
insert into src3 values ('1',33);
insert into src3 values ('3',33);
Advanced code editing examples
The script mode of MaxCompute supports the IF statement. The IF statement allows the system to automatically select the execution logic based on specific conditions. MaxCompute supports the following IF syntax based on the condition types: BOOLEAN and BOOLEAN-type scalar subquery.
The IF condition is a BOOLEAN-type expression.
A BOOLEAN-type expression in the
IF ELSE
statement determines the branch to execute during code compilation. Sample code:-- Data processing set odps.sql.allow.fullscan=true; set odps.optimizer.cbo.rule.filter.black=LM; @date := '${var}'; @row TABLE(key int,value bigint); -- Declare the row variable. The row type is TABLE and the schema is the STRING data type. IF ( cast(@date as bigint) % 2 == 0 ) BEGIN @row := SELECT key,value from src1; END ELSE BEGIN @row := SELECT key,value from src2; END INSERT OVERWRITE TABLE dest1 partition(p='${var}') SELECT key,value FROM @row;
NoteA variable named var is defined in the code. You need to assign a value to the var variable in the Scheduling Parameter section of the Properties tab.
The IF condition is a BOOLEAN-type scalar subquery.
A BOOLEAN-type scalar subquery in the
IF ELSE
statement determines the branch to execute during job running rather than during code compilation. Therefore, you must submit multiple jobs. Sample code:@i bigint; @t table(id bigint, value bigint); IF ((SELECT count(*) FROM src WHERE a = '5') > 1) BEGIN @i := 1; @t := select @i, @i*2; END ELSE BEGIN @i := 2; @t := select @i, @i*2; END select id, value from @t;
Embedded user-defined function (UDF) development
You can use the MaxCompute script mode to implement embedded UDF development by embedding Java or Python code into an SQL script. For more information, see Code-embedded UDFs.
What to do next
After you complete the development of a task by using the created node, you can perform the following operations:
Configure scheduling properties: You can configure properties for periodic scheduling of the node. If you want the system to periodically schedule and run the task, you must configure items for the node, such as rerun settings and scheduling dependencies. For more information, see Overview.
Debug the node: You can debug and test the code of the node to check whether the code logic meets your expectations. For more information, see Debugging procedure.
Deploy the node: After you complete all development operations, you can deploy the node. After the node is deployed, the system periodically schedules the node based on the scheduling properties of the node. For more information, see Deploy nodes.