MaxCompute allows you to write a user-defined function (UDF) in Python 3 to meet your business logic requirements. This topic describes how to write a UDF in Python 3.
UDF code structure
You can use MaxCompute Studio to write UDF code in Python 3. The UDF code can contain the following information:
Module import: required.
UDF code must include
from odps.udf import annotate
, which is used to import the function signature. This way, MaxCompute can identify the function signature that is defined in the code. If you want to reference files or tables in UDF code, the UDF code must includefrom odps.distcache import get_cache_file
orfrom odps.distcache import get_cache_table
.Function signature: required.
The function signature is in the
@annotate(<signature>)
format. Thesignature
parameter is used to define the data types of the input parameters and return value of the UDF. For more information about function signatures, see Function signatures and data types.Custom Python class: required.
A custom Python class is the organizational unit of UDF code. This class defines the variables and methods that are used to meet your business requirements. In UDF code, you can also reference third-party libraries that are installed in MaxCompute or reference files or tables. For more information, see Third-party libraries or Reference resources.
evaluate
method: required.The evaluate method is contained in the custom Python class. The
evaluate
method defines the input parameters and return value of the UDF. Each Python class can contain only oneevaluate
method.
Sample code:
# Import the function signature.
from odps.udf import annotate
# The function signature.
@annotate("bigint,bigint->bigint")
# The custom Python class.
class MyPlus(object):
# The evaluate method.
def evaluate(self, arg0, arg1):
if None in (arg0, arg1):
return None
return arg0 + arg1
Limits
Access the Internet by using UDFs
By default, MaxCompute does not allow you to access the Internet by using UDFs. If you want to access the Internet by using UDFs, fill in the network connection application form based on your business requirements and submit the application. After the application is approved, the MaxCompute technical support team will contact you and help you establish network connections. For more information about how to fill in the network connection application form, see Network connection process.
Access a VPC by using UDFs
By default, MaxCompute does not allow you to access resources in VPCs by using UDFs. To use UDFs to access resources in a VPC, you must establish a network connection between MaxCompute and the VPC. For more information about related operations, see Use UDFs to access resources in VPCs.
Read table data by using UDFs, UDAFs, or UDTFs
You cannot use UDFs, UDAFs, or UDTFs to read data from the following types of tables:
Table on which schema evolution is performed
Table that contains complex data types
Table that contains JSON data types
Transactional table
Precautions
Python 3 is incompatible with Python 2. Due to this reason, you cannot use Python 2 code and Python 3 code in a single SQL statement at the same time.
Python Software Foundation announced the end of life (EOL) for Python 2 in early 2020. Therefore, we recommend that you port Python 2 UDFs. For an existing MaxCompute project, we recommend that you port Python 2 UDFs. For a new project, we recommend that you use Python 3 to write all Python UDFs.
Development process
When you develop a UDF, you must make preparations, write UDF code, upload the Python program, create the UDF, debug the UDF, and call the UDF. MaxCompute allows you to use multiple tools to develop a UDF, such as MaxCompute Studio, DataWorks, and the MaxCompute client (odpscmd). This section provides examples on how to develop a UDF by using MaxCompute Studio, DataWorks, and the MaxCompute client (odpscmd).
Use MaxCompute Studio
Make preparations.
Before you use MaxCompute Studio to develop and debug a UDF, you must install MaxCompute Studio and connect MaxCompute Studio to a MaxCompute project. For more information about how to install MaxCompute Studio and connect MaxCompute Studio to a MaxCompute project, see the following topics:
Write UDF code.
In the Project section, right-click scripts under the MaxCompute script module and choose .
In the Create new MaxCompute python class dialog box, enter a class name in the Name field, select python UDF from the Kind drop-down list, and then click OK.
Write UDF code in the code editor.
from odps.udf import annotate @annotate("string,bigint->string") class GetUrlChar(object): def evaluate(self, url, n): if n == 0: return "" try: index = url.find(".htm") if index < 0: return "" a = url[:index] index = a.rfind("/") b = a[index + 1:] c = b.split("-") if len(c) < n: return "" return c[-n] except Exception: return "Internal error"
NoteYou can debug the UDF on your on-premises machine if necessary. For more information, see Test the Python UDF.
Upload the Python program and create the UDF.
Right-click the desired Python program in the scripts folder and select Deploy to server…. In the Submit resource and register function dialog box, configure the name of the function and click OK. For more information, see Upload a Python program and create a MaxCompute UDF.
In this example, the function name is UDF_GET_URL_CHAR.
Call the UDF.
In the left-side navigation pane, click the Project Explore tab. Right-click the MaxCompute project to which the UDF belongs, select Open Console, enter the SQL statement that is used to call the UDF, and then press Enter to execute the SQL statement. Sample statement:
set odps.sql.python.version=cp37; -- Enable Python 3. select UDF_GET_URL_CHAR("http://www.taobao.com/a.htm", 1);
The following result is returned:
+-----+ | _c0 | +-----+ | a | +-----+
Use DataWorks
Make preparations.
Before you use DataWorks to develop and debug a UDF, you must activate DataWorks and associate a DataWorks workspace with a MaxCompute project. For more information, see DataWorks.
Write UDF code.
You can write UDF code by using a Python development tool and package the code as a code package. Sample UDF code:
from odps.udf import annotate @annotate("string,bigint->string") class GetUrlChar(object): def evaluate(self, url, n): if n == 0: return "" try: index = url.find(".htm") if index < 0: return "" a = url[:index] index = a.rfind("/") b = a[index + 1:] c = b.split("-") if len(c) < n: return "" return c[-n] except Exception: return "Internal error"
Upload the Python program and create the UDF.
You can upload the code package that you package in the DataWorks console and create the UDF. For more information, see the following topics:
Call the UDF.
After you create a UDF, you can create an ODPS SQL node in the DataWorks console. You can write and create SQL statements in the ODPS SQL node to call and debug the UDF. For more information about how to create an ODPS SQL node, see Develop a MaxCompute SQL task. Sample statement:
set odps.sql.python.version=cp37; -- Enable Python 3. select UDF_GET_URL_CHAR("http://www.taobao.com/a.htm", 1);
Use the MaxCompute client (odpscmd)
Make preparations.
Before you use the MaxCompute client to develop and debug a UDF, you must download the MaxCompute client installation package (GitHub), install the MaxCompute client, and then configure the config file to connect to the MaxCompute project. For more information, see MaxCompute client (odpscmd).
Write UDF code.
You can write UDF code by using a Python development tool and package the code as a code package. Sample UDF code:
from odps.udf import annotate @annotate("string,bigint->string") class GetUrlChar(object): def evaluate(self, url, n): if n == 0: return "" try: index = url.find(".htm") if index < 0: return "" a = url[:index] index = a.rfind("/") b = a[index + 1:] c = b.split("-") if len(c) < n: return "" return c[-n] except Exception: return "Internal error"
Upload the Python program and create the UDF.
You can upload the JAR file that you package on the MaxCompute client and create the UDF. For more information, see the following topics:
Call the UDF.
After you create a UDF, you can write and create SQL statements to call and debug the UDF. Sample statement:
set odps.sql.python.version=cp37; -- Enable Python 3. select UDF_GET_URL_CHAR("http://www.taobao.com/a.htm", 1);
Third-party libraries
NumPy is not installed in the Python 3 runtime environment in MaxCompute. To use a NumPy UDF, you must manually upload a NumPy wheel package. If you obtain this package from Python Package Index (PyPI) or an image, the package is named numpy-<Version>-cp37-cp37m-manylinux1_x86_64.whl. For more information about how to upload a file, see Resource operations or Reference third-party packages in Python UDFs.
For more information about standard libraries that are supported by Python 3, see The Python Standard Library.
Function signatures and data types
Format of function signatures:
@annotate(<signature>)
The signature
parameter is a string that specifies the data types of input parameters and return value. When you run a UDF, the data types of the input parameters and return value of the UDF must be consistent with the data types specified in the function signature. The data type consistency is checked during semantic parsing. If the data types are inconsistent, an error is returned. Format of a signature:
'arg_type_list -> type'
Parameter description:
arg_type_list
: specifies the data types of input parameters. If multiple input parameters are used, their data types are separated by commas (,). The following data types are supported: BIGINT, STRING, DOUBLE, BOOLEAN, DATETIME, DECIMAL, FLOAT, BINARY, DATE, DECIMAL(precision,scale), CHAR, and VARCHAR. Complex data types, such as ARRAY, MAP, and STRUCT, and nested complex data types are also supported.arg_type_list
can be represented by an asterisk (*) or left empty ('').If
arg_type_list
is represented by an asterisk (*), a random number of input parameters are allowed.If
arg_type_list
is left empty (''), no input parameters are used.
type
: specifies the data type of the return value. For a UDF, only one column of values is returned. The following data types are supported: BIGINT, STRING, DOUBLE, BOOLEAN, DATETIME, DECIMAL, FLOAT, BINARY, DATE, and DECIMAL(precision,scale). Complex data types, such as ARRAY, MAP, and STRUCT, and nested complex data types are also supported.
When you write UDF code, you can select a data type based on the MaxCompute data type edition that is used by your MaxCompute project. For more information about MaxCompute data type editions and the data types supported in each edition, see Data type editions.
The following table provides examples of valid function signatures.
Function signature | Description |
| The data types of the input parameters are BIGINT and DOUBLE and the data type of the return value is STRING. |
| A random number of input parameters are used and the data type of the return value is STRING. |
| No input parameters are used and the data type of the return value is DOUBLE. |
| The data type of the input parameters is ARRAY<BIGINT> and the data type of the return value is STRUCT<x:STRING, y:INT>. |
| No input parameters are used and the data type of the return value is MAP<BIGINT, STRING>. |
The following table describes the mappings between the data types that are supported in MaxCompute SQL and the Python 2 data types. You must write Python UDFs based on the mappings to ensure the consistency of data types.
MaxCompute SQL Type | Python 3 Type |
BIGINT | INT |
STRING | UNICODE |
DOUBLE | FLOAT |
BOOLEAN | BOOL |
DATETIME | DATETIME.DATETIME |
FLOAT | FLOAT |
CHAR | UNICODE |
VARCHAR | UNICODE |
BINARY | BYTES |
DATE | DATETIME.DATE |
DECIMAL | DECIMAL.DECIMAL |
ARRAY | LIST |
MAP | DICT |
STRUCT | COLLECTIONS.NAMEDTUPLE |
Reference resources
You can reference files or tables in Python 2 UDF code by using the odps.distcache
module.
odps.distcache.get_cache_file(resource_name, mode)
: returns the content of a specified file based on the value ofmode
that you specified.resource_name
is a string that specifies the name of an existing table in your MaxCompute project. If the table name is invalid or the table does not exist, an error is returned.The value of
mode
is of the STRING type. Default value:'t'
. If the value ofmode
is't'
, the file is displayed in text mode. If the value ofmode
is'b'
, the file is displayed in binary mode.The return value is a file-like object. If this object is no longer used, you must call the
close
method to release the open file.
The following code shows how to reference a file.
from odps.udf import annotate from odps.distcache import get_cache_file @annotate('bigint->string') class DistCacheExample(object): def __init__(self): cache_file = get_cache_file('test_distcache.txt') kv = {} for line in cache_file: line = line.strip() if not line: continue k, v = line.split() kv[int(k)] = v cache_file.close() self.kv = kv def evaluate(self, arg): return self.kv.get(arg)
odps.distcache.get_cache_table(resource_name)
: returns the content of a specified table.resource_name
specifies the name of the table in your MaxCompute project. If the table name is invalid or the table does not exist, an error is returned. Data of the following types in the table can be read: BIGINT, STRING, DOUBLE, BOOLEAN, DATETIME, FLOAT, CHAR, VARCHAR, BINARY, DATE, DECIMAL, ARRAY, MAP, and STRUCT.The return value is of the GENERATOR data type. The caller traverses the table to obtain the table content. A record of the ARRAY type is obtained each time the caller traverses the table.
The following code shows how to reference a table.
from odps.udf import annotate
from odps.distcache import get_cache_table
@annotate('->string')
class DistCacheTableExample(object):
def __init__(self):
self.records = list(get_cache_table('udf_test'))
self.counter = 0
self.ln = len(self.records)
def evaluate(self):
if self.counter > self.ln - 1:
return None
ret = self.records[self.counter]
self.counter += 1
return str(ret)
Usage notes
After you develop a Python 3 UDF, you can use MaxCompute SQL to call the UDF. For more information about how to call a Python 3 UDF, see Development process. You can call a UDF in Python 3 by using one of the following methods:
Enable Python 3
By default, Python 2 is used to write UDFs in a MaxCompute project. If you want to write UDFs in Python 3, add the following command before the SQL statement that you want to execute. Then, commit and execute the statement.
set odps.sql.python.version=cp37;
Call a UDF
Use a UDF in a MaxCompute project: The method is similar to that of using built-in functions.
Use a UDF across projects: Use a UDF of Project B in Project A. The following statement shows an example:
select B:udf_in_other_project(arg0, arg1) as res from table_t;
. For more information about cross-project sharing, see Cross-project resource access based on packages.
Port Python 2 UDFs
Python Software Foundation announced the EOL for Python 2 in early 2020. Therefore, we recommend that you port Python 2 UDFs.
In a new project or an existing project for which you write UDFs in Python for the first time, we recommend that you use Python 3 to write all Python UDFs.
In an existing project where a large number of Python 2 UDFs exist, proceed with caution when you enable Python 3. If you want to replace Python 2 UDFs with Python 3 UDFs, use the following methods:
Use Python 3 to write new UDFs and enable Python 3 for new jobs at the session level. For more information about how to enable Python 3, see Enable Python 3.
Rewrite Python 2 UDFs in a manner in which the UDFs are compatible with Python 2 and Python 3. For more information about how to rewrite UDFs, see Porting Python 2 Code to Python 3.
NoteIf you want to write a public UDF that is shared among multiple projects, we recommend that you use a UDF that is compatible with Python 2 and Python 3.