MaxCompute allows you to write user-defined table-valued functions (UDTFs) in Java or Python to extend the capabilities of MaxCompute functions and accommodate your business requirements. This topic describes the types, limits, usage notes, and development process of UDTFs. This topic also describes how to use UDTFs.
Background information
You can use UDTFs to return multiple values for a single data input. The input and
output data of a UDTF have a one-to-many relationship. Each time a UDTF reads a row
of data, the UDTF returns multiple values. The returned values are considered a table.
MaxCompute allows you to write UDTFs in Java or Python.
UDTF type | Description |
---|---|
Java UDTF | MaxCompute allows you to write UDTFs in Java. For more information, see Java UDTFs. |
Python UDTF | MaxCompute allows you to write UDTFs in Python 2 and Python 3.
|
The built-in functions of MaxCompute include some UDTFs, such as EXPLODE. For more information about built-in UDTFs, see Other functions or Complex type functions.
Limits
- You cannot access the Internet by using user-defined functions (UDFs). If you want to access the Internet by using UDFs, fill in the network connection application form based on your business requirements and submit the application. After the application is approved, the MaxCompute technical support team will contact you and help you establish network connections. For more information about how to fill in the network connection application form, see Network connection process.
- If you use a UDTF in a
SELECT
statement, you cannot specify other columns or use other expressions in this statement. The following sample code shows an incorrect SQL statement.-- The statement contains a UDTF and another column. select value, user_udtf(key) as mycol ...
- UDTFs cannot be nested. The following sample code shows an incorrect SQL statement.
-- A UDTF named user_udtf2 is nested in a UDTF named user_udtf1. select user_udtf1(user_udtf2(key)) as mycol...;
- A UDTF cannot be used with a
GROUP BY
,DISTRIBUTE BY
, orSORT BY
clause in the sameSELECT
statement. The following sample code shows an incorrect SQL statement.-- A UDTF is used together with a GROUP BY clause. select user_udtf(key) as mycol ... group by mycol;
Usage notes
Before you use UDFs, take note of the following items:
- UDFs cannot compete with built-in functions in performance. We recommend that you preferentially use built-in functions to implement your business logic.
- If you use a UDF in SQL statements, the memory usage of a computing job may exceed
the default allocated memory size if a large amount of data is computed and data skew
occurs. In this case, you can run the
set odps.sql.udf.joiner.jvm.memory=xxxx;
command at the session level to resolve the issue. For more information about the MaxCompute UDF FAQ, see FAQ about MaxCompute UDFs. - If the name of a UDF is the same as that of a built-in function, the UDF is preferentially
called. For example, if UDF CONCAT and built-in function CONCAT both exist in MaxCompute,
the system automatically calls UDF CONCAT instead of the built-in function CONCAT.
If you want to call the built-in function, you must add the symbol
::
before the built-in function, for example,select ::concat('ab', 'c');
.
Development process
The following figure shows how to write a MaxCompute UDTF in Java and Python.
- The following figure demonstrates how to write a MaxCompute UDF in Java.
No. Required Description Platform References 1 No Before you can use Maven to write code, you must add the related SDK dependencies to the POM file. This ensures that the code can be compiled. The following SDK dependency shows an example:
You can search for<dependency> <groupId>com.aliyun.odps</groupId> <artifactId>odps-sdk-udf</artifactId> <version>0.29.10-public</version> </dependency>
odps-sdk-udf
from Maven repositories to obtain the version of the SDK dependency.IntelliJ IDEA (Maven) None 2 Yes Write a UDF based on your business requirements. IntelliJ IDEA (Maven) and MaxCompute Studio Develop a UDF in Java 3 Yes Debug the UDF by running it on your on-premises machine or by performing unit testing to check whether the result meets expectations. 4 Yes Debug the UDF code to ensure that the code is packaged into a JAR file after it is successfully run on your on-premises machine. 5 Yes Upload the JAR file as a resource to your MaxCompute project. MaxCompute client, MaxCompute Studio, and DataWorks - MaxCompute client
- MaxCompute Studio
Package a Java program, upload the package, and create a MaxCompute UDF
6 Yes Create a UDF based on the JAR file that you uploaded. 7 No Call the UDF in the query data code. None - The following figure demonstrates how to write a MaxCompute UDF in Python.
No. Required Description Platform References 1 Yes Write a UDF based on your business requirements. MaxCompute Studio Develop a Python UDF 2 Yes Debug the UDF by running it on your on-premises machine or by performing unit testing to check whether the result meets expectations. 3 Yes Upload Python files or required resources, such as file resources, table resources, and third-party packages, to a MaxCompute project. MaxCompute client, MaxCompute Studio, and DataWorks - MaxCompute client
- MaxCompute Studio
4 Yes Create a UDF based on the uploaded Python files or required resources. 5 No Call the UDF in the query data code. None
Instructions
Use the following methods to call UDFs:
- Use a UDF in a MaxCompute project: The method is similar to that of using built-in functions.
- Use a UDF across projects: Use a UDF of Project B in Project A. The following statement
shows an example:
select B:udf_in_other_project(arg0, arg1) as res from table_t;
. For more information about resource sharing across projects, see Package-based resource sharing across projects.