×
Community Blog MaxCompute Unleashed - Part 9: Script Mode and Parameter View

MaxCompute Unleashed - Part 9: Script Mode and Parameter View

Part 9 of the "Unleash the Power of MaxCompute" series introduces the script mode and parameterized views of MaxCompute.

By Haiqing

MaxCompute (formerly known as ODPS) is a leading distributed big data processing platform developed by Alibaba Cloud. It is widely utilized, especially within the Alibaba Group, and supports the core businesses of multiple business units (BUs). ODPS V2.0, continuously optimizing performance, aims to enhance the user experience and expression capabilities of SQL, as well as improve the productivity of ODPS developers.

Building upon the SQL engine of ODPS V2.0, MaxCompute simplifies SQL compilation and enhances the language's expressiveness. We present a series of articles titled Unleash the Power of MaxCompute to explore the capabilities of MaxCompute (ODPS V2.0).

In the eighth article, parameterized views are mentioned. This article introduces the script mode and parameterized views of MaxCompute.

Scenario 1: In a project with complex business processing logic, there is a need to read multiple tables, perform table joins, and join the results. Additionally, different tables need to be output at different stages of the operation. Even with the use of Common Table Expressions (CTEs), it becomes challenging to express the logic. As a result, the business logic has to be split into multiple jobs and submitted sequentially, leading to complex jobs and poor performance.

Scenario 2: In the current project, there is a data table that needs to be shared with multiple teams. To ensure security, a view is created using an algorithm mode. Other teams find this algorithm useful and wish to utilize it. However, the underlying access to the data table is different, and some mode recognition parameters may vary. Consequently, a new view needs to be created. If any bugs are identified in the original view, they can only be fixed individually.

The script mode and parameterized views provided by MaxCompute basically solve the problems in the above scenarios.

Script Mode

This article will take MaxCompute Studio as an example. First, install MaxCompute Studio, import and test a MaxCompute project, create a project, and create a new MaxCompute script file, as shown in the following figure:

1

Note: Make sure to select the script mode in MaxCompute Studio. In this mode, the compiler will compile the entire file as a whole, rather than compiling individual statements one by one. The script is also submitted as a whole.

Upon observing the script, you can see that it reads data from multiple tables, such as src, src1, and src2, and performs various operations such as filtering, unions, and joins. Finally, the results are output to multiple tables, dest and dest1. Multiple statements are connected using table variables starting with @. After compilation, submit the script for execution. You can use the job view in MaxCompute Studio to view the execution plan, as shown in the following figure:

2

By right-clicking on any task in the figure and selecting expand all from the pop-up menu, you can see all the operators of the execution plan and their relationships in a single view, as shown in the following figure:

3

You can observe that although there are multiple statements, the execution plan forms an interconnected DAG (Directed Acyclic Graph). Such an execution plan cannot be generated using traditional SQL statements, including CTE or MULTI INSERT. This ensures that the execution plan is queued and executed only once, efficiently utilizing the computing resources of MaxCompute. Additionally, it simplifies the writing process. You can focus on the business logic and write in a similar manner to ordinary programming languages without worrying about optimizing performance through statement organization, such as using CTE or MULTI INSERT.

You can observe that although there are multiple statements, the execution plan forms an interconnected DAG (Directed Acyclic Graph). Such an execution plan cannot be generated using traditional SQL statements, including CTE or MULTI INSERT. This ensures that the execution plan is queued and executed only once, efficiently utilizing the computing resources of MaxCompute. Additionally, it simplifies the writing process. You can focus on the business logic and write in a similar manner to ordinary programming languages without worrying about optimizing performance through statement organization, such as using CTE or MULTI INSERT.

The script mode also supports SET statements and some DDL statements. Example:

4

A complete script consists of SET, DDL, and DML statements, which are arranged in sequence. Each part can have 0 to multiple statements, but statements of different types cannot be interleaved.

It should be noted that in the script mode, if a table is written and then read, an error will be reported to avoid confusion. For example, the following statement will report an error.

insert overwrite table src2 select * from src where key > 0;
@a := select * from src2;
select * from @a;

We recommend that you avoid writing before reading. For example, you can modify the preceding script to:

@a := select * from src where key > 0;
insert overwrite table src2 select * from @a;
select * from @a;

In this way, the content that will be written to src2 can be printed out.

There are some limitations in the script mode.

• At present, a script supports at most one on-screen statement. Otherwise, an error is reported. We recommend that you do not execute the SELECT statement displayed on the screen in a script.

• At present, a script supports at most one CREATE TABLE AS statement and must be the last statement. We recommend that you write the CREATE TABLE statements separately from the INSERT statements, such as the scripts that mix DDL and DML statements mentioned above.

Use the Script Mode Through MaxCompute CMD

The script mode is a new mode. If you use MaxCompute CMD to submit scripts, you must use a version later than 0.27. We recommend that you download and install the latest version. After installation, use the new -s parameter to submit.

If you want to edit the myscript.sql file in the script mode, run the odpscmd command as follows:

odpscmd -s myscript.mysql;

Note: Similar to the -f and -e options, the -s option is a command-line option for the MaxCompute client but not a command in an interactive environment. If the MaxCompute client (odpscmd) is used in an interactive environment, the script mode and table variables are not supported.

If you want to pass some setting but do not want to write it in the script, you can use the -e parameter to pass the setting separately as follows.

odpscmd -e 'set odps.sql.setting1=a;odps.sql.setting2=b='-s myscript.sql

Use the Script Mode Through DataWorks

In DataWorks, you can create an ODPS Script node in the script mode, as shown in the following figure.

5

Edit the script mode in this node, and then click the Run button in the toolbar to submit the script to MaxCompute for execution. You can use the Logview URL in the output to view the graphical execution plan and results.

When Should You Not Combine Multiple Small SQL Statements Into a Large Script Mode SQL?

The script mode can support complex logic, allowing a script with multiple rows to be compiled into an execution plan. This makes it easy to connect statements in multiple SQL nodes using table variables and assemble them into a large script mode SQL. However, it is not suitable for all scenarios.

For example, let's consider a scenario where upstream data comes from two tables that are updated daily. One table is ready at 1:00 a.m., and the other table is ready at 7:00 a.m. An SQL node processes the first data source at 1:00 a.m., and the result is stored as an intermediate table. Another SQL node performs further operations on this intermediate table and the second data source at 7:00 a.m. to obtain the result data. In this scenario, it is not suitable to combine the SQL statements in the two nodes into a large script. This is because the SQL node could have preprocessed the data earlier. If a large script is combined, a job can only be generated after all the data arrives at 7:00 a.m. This means that the preprocessing that could have been done earlier now needs to be completed in a large job, resulting in a longer End-to-End (E2E) time for the entire process and a higher probability of errors.

The script mode is more suitable for rewriting a single statement that uses layers of nested subqueries or for scripts that need to be split into multiple statements due to their complexity.

Parameterized View

The traditional view in MaxCompute allows encapsulating a complex SQL script. For example, it can access multiple tables and perform intricate operations. However, the caller can treat the view like an ordinary table, without needing to know the underlying implementation. This provides encapsulation and code reuse, making it widely used.

However, traditional views have a limitation: they cannot accept parameters passed by the caller. For instance, a view may read an underlying table and expect the caller to provide a table for filtering the data or pass other parameters. Unfortunately, traditional views do not support these requirements. This restricts their ability to reuse code, often resulting in the need to duplicate similar code multiple times during development.

MaxCompute supports parameterized views, which allow passing in any table or other variables to customize their behavior.

To start, create a new MaxCompute script file, as shown in the following figure:

6

As you can see, pv has two parameters, a table parameter and a string parameter. The parameter can be any table or basic type. Create another script to call this view, as shown in the following figure:

7

You can see that pv1 can be called by different parameters. Table parameters can be physical tables, views, table variables, or table aliases in the CTE, and common parameters can be variables or constants.

Click the graph page of this script (the red box at the bottom of the above figure) to see the corresponding execution plan:

8

As you can see, the MaxCompute compiler compiles the script into an execution plan regardless of how many times the script is called and what parameters are used.

In fact, a parameterized view is not necessarily an SQL statement but can contain multiple statements like a script. Example:

9

The statements between BEGIN and END are the script of this view.

Note: The last statement, @pv2 := ..., is equivalent to a RETURN statement in other languages. The method assigns a value to an implicit table variable with the same name as the view.

The matching rules for actual and formal view parameters are the same as those specified in a normal programming language. If view parameters can be implicitly converted, these parameters can be matched. For example, the BIGINT value can match the parameters of the DOUBLE type. Table variables are more complex, but the rules are also simple. If the schema of one table can be inserted into another table, it can also be used to match the table type parameters of the same table schema.

In some situations, you can declare the return type to make the code easier to read. You can declare the return type. Example:

10

The RETURNS @ret TABLE (x string, y string) defines the following information:

• The return type is TABLE (x string, y string). That is, the type returned to the caller. Note that the schema of the table can be customized here.

• The return parameter is @ret. A value is assigned to the parameter in the view script.

You can consider a view that contains no BEGIN/END or return variables as a simplified parameterized view.

There are some additional restrictions on scripts when parameterizing a view.

• Only DML statements can be used in scripts. The INSERT and CREATE TABLE AS statements cannot be included in scripts.

• PRINT statements cannot be included in scripts.

Conclusion

In conclusion, MaxCompute provides a new script mode and parameterized views, which significantly improve developers' programming efficiency and code reusability. Additionally, performance is also enhanced. We highly recommend utilizing script mode and parameterized views whenever possible.

0 1 0
Share on

Alibaba Cloud MaxCompute

137 posts | 20 followers

You may also like

Comments

Alibaba Cloud MaxCompute

137 posts | 20 followers

Related Products