By Peng Qiu, senior technical expert of Alibaba Cloud
MaxCompute (formerly known as ODPS) is a leading distributed big data processing platform developed by Alibaba Cloud. It is widely utilized, especially within the Alibaba Group, and supports the core businesses of multiple business units (BUs). ODPS V2.0, continuously optimizing performance, aims to enhance the user experience and expression capabilities of SQL, as well as improve the productivity of ODPS developers.
Building upon the SQL engine of ODPS V2.0, MaxCompute simplifies SQL compilation and enhances the language's expressiveness. We present a series of articles titled Unleash the Power of MaxCompute to explore the capabilities of MaxCompute (ODPS V2.0).
In the previous article, I introduced CTE, VALUES, SEMIJOIN. This article introduces the support of MaxCompute for other scripting languages - SELECT TRANSFORM.
• Scenario 1
I needed to migrate my system to the MaxCompute platform, which originally relied on scripts like python, shell, and ruby. Converting these scripts into UDFs, UDAFs, and UDTFs for MaxCompute migration was time-consuming and required extensive testing. I wished for a simpler migration approach.
• Scenario 2
While SQL is adept at set operations, I required more intricate calculations on individual data records. The existing built-in functions and UDF framework were not flexible enough, and I lacked familiarity with Java/Python. I preferred writing scripts and wanted to input data into my script, perform custom calculations, and output results. MaxCompute could handle data splitting, distributed execution of my scripts, and management of input/output tables and relational operations like JOIN and UNION.
To achieve these features, you can utilize SELECT TRANSFORM.
This article uses MaxCompute Studio as an example. First, please install MaxCompute Studio, connect to a MaxCompute project, and create a new MaxCompute script file, as follows.
Submitting a job allows you to view the execution plan (expanded view):
SELECT TRANSFORM enables SQL users to execute a shell command on the server. It separates input data fields with tabs, passes each line as stdin to the shell command, and reads data from stdout as output for downstream processing. The shell command is essentially a Unix utility that can also invoke other script interpreters such as python, java, php, awk, and ruby.
This command is compatible with Hive's Transform feature. For more information, refer to the Hive document.
Some key points to note are::
set odps.sql.session.resources=foo.sh,bar.txt;
command. Multiple resource files can be specified, separated by commas (,). Alternatively, the resources clause can be used after the using clause to specify resources ('foo.sh', 'bar.txt'). Both methods are equivalent (see the example of run tests with odps).Currently, odps select transform is fully compatible with Hive syntax, features, and behaviors. Most Hive scripts can be directly added to the SELECT TRANSFORM statement, while others may require minor changes. Additionally, many performance optimizations have been implemented using a more efficient language (C++) compared to Hive.
In theory, all features that can be implemented by UDTFs can also be implemented by SELECT TRANSFORM. However, SELECT TRANSFORM is much more flexible than UDTFs. It supports not only Java and Python but also other scripts and tools like shell and perl. The process of writing SELECT TRANSFORM is simple, making it particularly suitable for implementing adhoc features. Here are a few examples:
1. Generating data out of nothing.
select transform(script) using 'sh' as (data) from
(
select 'for i in `seq 1 50`; do echo $i; done' as script
) t;
Or use python
select transform('for i in xrange(1, 51): print i;') using 'python' as (data);
The provided statement creates a data table with 50 rows, containing values from 1 to 50. Generating test data becomes much easier with this feature. Although it may seem simple, it used to be a pain point in ODPS without a convenient way to create data, which made testing and exploration inconvenient for beginners. While this functionality can also be implemented using UDTFs, it requires a complex process: go to the IDE -> write UDTFs -> package -> add jar/python -> create function -> execute -> drop function -> drop resource.
2. awk users will find this feature very useful
select transform(*) using "awk '//{print $2}'" as (data) from src;
The above statement only outputs value as it is, but users who are familiar with awk have lived a life of writing awk scripts instead of SQL.
3. Run tests with ODPS.
select transform(key, value)
using 'java -cp a.jar org.junit.runner.JUnitCore MyTestClass'
resources 'a.jar'
from testdata;
Or
set odps.sql.session.resources=a.jar;
select transform(key, value)
using 'java -cp a.jar org.junit.runner.JUnitCore MyTestClass'
from testdata;
This example is to illustrate that a lot of utility of Java can be run directly. Although Java and Python have existing UDTF frameworks, it is simpler to write with SELECT TRANSFORM, and there is no additional dependency or format requirement. Even offline scripts can be used directly.
4. Support other scripting languages.
select transform (key, value) using "perl -e 'while($input = <STDIN>){print $input;}'" from src;
The above example uses Perl. This is not just a language support extension. For some simple functionalities, awk, python, perl, and shell all support writing scripts directly in the command without the need for script files and resource uploads, making the development process simpler. Additionally, PHP and Ruby are not currently supported on our computing cluster, so these two scripting languages are not supported.
5. It can be used in chaining operations, preprocessing input data with distribute by and sort by.
select transform(key, value) using 'cmd2' from
(
select transform(*) using 'cmd1' from
(
select * from data distribute by col2 sort by col1
) t distribute by key sort by value
) t2;
Or you can use the keywords of the map and reduce to make the logic clear.
@a := select * from data distribute by col2 sort by col1;
@b := map * using 'cmd1' distribute by col1 sort by col2 from @a;
reduce * using 'cmd2' from @b;
In theory, the OpenMR models can be mapped to the calculation process described above. It's important to note that the syntax of map, reduce, and select transform is essentially the same. The choice of keywords or writing method does not affect the process or the final result.
Both SELECT TRANSFORM and UDTFs have their own advantages in terms of performance. After comparing and testing in various scenarios, SELECT TRANSFORM tends to perform better in most scenarios when dealing with smaller data volumes. On the other hand, UDTFs have an advantage when dealing with larger data volumes. SELECT TRANSFORM is particularly suitable for adhoc data analysis due to its simpler development process.
The MaxCompute SQL engine based on ODPS V2.0 provides the SELECT TRANSFORM feature, which significantly simplifies the reference to script code while improving performance. We highly recommend using SELECT TRANSFORM whenever possible.
• [1]: The string after USING is used to start a subprocess directly in the background to invoke commands. This does not involve invoking a shell, so certain shell syntaxes, such as input/output redirection and pipelines, are not supported. If you need to use a shell command and treat the actual command as data input, refer to the Generating data out of nothing section.
• [2]: The actual paths for Java and Python can be obtained from the JAVA_HOME and PYTHON_HOME environment variables.
137 posts | 20 followers
FollowAlibaba Cloud MaxCompute - January 29, 2024
Alibaba Cloud MaxCompute - January 29, 2024
Alibaba Cloud MaxCompute - February 18, 2024
Alibaba Cloud MaxCompute - February 7, 2024
Alibaba Cloud MaxCompute - January 22, 2024
Alibaba Cloud MaxCompute - February 7, 2024
137 posts | 20 followers
FollowConduct large-scale data warehousing with MaxCompute
Learn MoreAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreRealtime Compute for Apache Flink offers a highly integrated platform for real-time data processing, which optimizes the computing of Apache Flink.
Learn MoreMore Posts by Alibaba Cloud MaxCompute