This topic describes how to reference a third-party package in a PyODPS node. For more information about how to generate a third-party package for PyODPS, see Generate a third-party package for PyODPS.
Prerequisites
MaxCompute is activated. For more information, see Activate MaxCompute.
DataWorks is activated. For more information, see Activate DataWorks.
Upload a third-party package
Before you reference a third-party package, you must make sure that the package has been uploaded to MaxCompute as an archive resource. You can upload a third-party package by using one of the following methods:
Use code to upload a third-party package. In the following sample code, replace
packages.tar.gz
with the path and name of the package that you want to upload.import os from odps import ODPS # Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_ID to the AccessKey ID of your Alibaba Cloud account. # Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET to the AccessKey secret of your Alibaba Cloud account. # We recommend that you do not directly use your AccessKey ID or AccessKey secret. o = ODPS( os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'), os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'), project='<your-default-project>', endpoint='<your-end-point>', ) o.create_resource("test_packed.tar.gz", "archive", fileobj=open("packages.tar.gz", "rb"))
Use DataWorks to upload a third-party package. For more information, see Step 1: Create a resource or upload an existing resource.
Reference a third-party package in a Python UDF
Before you reference a third-party package in a Python user-defined function (UDF), you must modify the Python UDF. Procedure:
Add a reference to the third-party package in the
_init_
method of the UDF class.Reference the third-party package by using the UDF code, such as the evaluate function or the process method.
Example
In this example, a third-party package is referenced by using a Python UDF named psi in SciPy.
Run the following command to package SciPy.
pyodps-pack -o scipy-bundle.tar.gz scipy
Write the following code and save it as a file named
test_psi_udf.py
.import sys from odps.udf import annotate @annotate("double->double") class MyPsi(object): def __init__(self): # Add the path to the reference path. sys.path.insert(0, "work/scipy-bundle.tar.gz/packages") def evaluate(self, arg0): # Put the IMPORT statement inside the evaluate function. from scipy.special import psi return float(psi(arg0))
Code description: The
__init__
function addswork/scipy-bundle.tar.gz/packages
tosys.path
. This is because MaxCompute decompresses all archive resources that are referenced by the UDF to folders in thework
directory. The folder name is the same as the resource name. Thepackages
directory is the subdirectory of the package that is generated by usingpyodps-pack
. The IMPORT statement of SciPy is placed within the evaluate function body. This is because the third-party package is available only during runtime. When the UDF is parsed on the MaxCompute server, the parsing environment does not contain the third-party package. If you put the IMPORT statement outside the function body to import the third-party package, an error is reported.Upload
test_psi_udf.py
as a MaxCompute Python resource and uploadscipy-bundle.tar.gz
as an archive resource.Create a UDF named
test_psi_udf
, reference the two uploaded resource files, and specify the class name astest_psi_udf.MyPsi
.You can perform Step 3 and Step 4 in a PyODPS node or on the MaxCompute client.
In a PyODPS node:
import os from odps import ODPS # Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_ID to the AccessKey ID of your Alibaba Cloud account. # Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET to the AccessKey secret of your Alibaba Cloud account. # We recommend that you do not directly use your AccessKey ID or AccessKey secret. o = ODPS( os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'), os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'), project='<your-default-project>', endpoint='<your-end-point>', ) bundle_res = o.create_resource( "scipy-bundle.tar.gz", "archive", fileobj=open("scipy-bundle.tar.gz", "rb") ) udf_res = o.create_resource( "test_psi_udf.py", "py", fileobj=open("test_psi_udf.py", "rb") ) o.create_function( "test_psi_udf", class_type="test_psi_udf.MyPsi", resources=[bundle_res, udf_res] )
On the MaxCompute client:
add archive scipy-bundle.tar.gz; add py test_psi_udf.py; create function test_psi_udf as test_psi_udf.MyPsi using test_psi_udf.py,scipy-bundle.tar.gz;
After you complete the preceding operations, use the UDF to execute SQL statements.
set odps.pypy.enabled=false; set odps.isolation.session.enable=true; select test_psi_udf(sepal_length) from iris;
Reference a third-party package in PyODPS DataFrame
You can reference a third-party library in PyODPS DataFrame by specifying the libraries
parameter in the execute or persist method. This section describes how to reference a third-party package in PyODPS DataFrame if you use the map method. The procedure is similar if you use the apply or map_reduce method.
Run the following command to package SciPy:
pyodps-pack -o scipy-bundle.tar.gz scipy
A table named
test_float_col
is available. This table contains only one column of the FLOAT type.col1 0 3.75 1 2.51
Run the following code to calculate the value of
psi(col1)
:import os from odps import ODPS, options def my_psi(v): from scipy.special import psi return float(psi(v)) # If the isolation feature is enabled for your project, you are not required to configure the following option: options.sql.settings = {"odps.isolation.session.enable": True} # Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_ID to the AccessKey ID of your Alibaba Cloud account. # Set the environment variable ALIBABA_CLOUD_ACCESS_KEY_SECRET to the AccessKey secret of your Alibaba Cloud account. # We recommend that you do not directly use your AccessKey ID or AccessKey secret. o = ODPS( os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'), os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'), project='<your-default-project>', endpoint='<your-end-point>', ) df = o.get_table("test_float_col").to_df() # Run the following code and obtain the result. df.col1.map(my_psi).execute(libraries=["scipy-bundle.tar.gz"]) # Save the result to another table. df.col1.map(my_psi).persist("result_table", libraries=["scipy-bundle.tar.gz"])
Optional. If you want to use the same third-party package throughout the runtime, you can configure the global parameter.
from odps import options options.df.libraries = ["scipy-bundle.tar.gz"]
After you perform the preceding operations, reference the third-party package in PyODPS DataFrame.
Reference a third-party package in DataWorks
A DataWorks PyODPS node provides built-in third-party packages and also provides the load_resource_package
method for you to reference other packages. For more information, see Use a third-party package.
Manually upload and reference a third-party package
You can follow the instructions in this section to maintain an existing project or environment. For newly created projects, we recommend that you use pyodps-pack
.
In some existing projects, you may manually upload all WHL dependencies and reference them in the code, or use MaxCompute of an early version that does not support binary packages. This section describes how to manually upload and reference a third-party package in these existing projects. In this example, python_dateutil is used in the map method.
Run the
pip download
command in Linux Bash to download the third-party package and its dependencies to a directory. Two packages are downloaded:six-1.10.0-py2.py3-none-any.whl
andpython_dateutil-2.5.3-py2.py3-none-any.whl
.pip download python-dateutil -d /to/path/
NoteNote that the downloaded packages must support the Linux operating system. We recommend that you run this command in the Linux operating system.
Upload the downloaded packages to MaxCompute.
Method 1: Use code.
# You must make sure that file name extensions are valid. odps.create_resource('six.whl', 'file', file_obj=open('six-1.10.0-py2.py3-none-any.whl', 'rb')) odps.create_resource('python_dateutil.whl', 'file', file_obj=open('python_dateutil-2.5.3-py2.py3-none-any.whl', 'rb'))
Method 2: Use DataWorks.
You can upload and submit the destination resources by following instructions in Step 1: Create a resource or upload an existing resource.
Reference the third-party package.
In this example, a DataFrame object contains only a field of the STRING type. The field contains the following content.
datestr 0 2016-08-26 14:03:29 1 2015-08-26 14:03:29
Use the following third-party libraries in global configurations:
from odps import options def get_year(t): from dateutil.parser import parse return parse(t).strftime('%Y') options.df.libraries = ['six.whl', 'python_dateutil.whl'] df.datestr.map(get_year).execute()
datestr 0 2016 1 2015
Use the
libraries
parameter of an immediately-invoked method to specify the libraries:def get_year(t): from dateutil.parser import parse return parse(t).strftime('%Y') df.datestr.map(get_year).execute(libraries=['six.whl', 'python_dateutil.whl'])
datestr 0 2016 1 2015
By default, PyODPS supports Python libraries that contain only Python code and do not involve file operations. In later versions of MaxCompute, PyODPS also supports Python libraries that contain binary code or involve file operations. The library names must have suffixes. The following table describes the suffixes that are supported by different Python libraries.
Platform | Python version | Supported suffix |
RHEL 5 x86_64 | Python 2.7 | cp27-cp27m-manylinux1_x86_64 |
RHEL 5 x86_64 | Python 3.7 | cp37-cp37m-manylinux1_x86_64 |
RHEL 7 x86_64 | Python 2.7 | cp27-cp27m-manylinux1_x86_64, cp27-cp27m-manylinux2010_x86_64, cp27-cp27m-manylinux2014_x86_64 |
RHEL 7 x86_64 | Python 3.7 | cp37-cp37m-manylinux1_x86_64, cp37-cp37m-manylinux2010_x86_64, cp37-cp37m-manylinux2014_x86_64 |
RHEL 7 Arm64 | Python 3.7 | cp37-cp37m-manylinux2014_aarch64 |
All WHL packages must be uploaded to MaxCompute as archive resources. Before you upload the packages, you must convert the packages into ZIP files by changing file name extensions. You also need to set the odps.isolation.session.enable
parameter to True for the job or your project. The following example demonstrates how to upload and use special functions in SciPy:
# Packages that contain binary code must be uploaded as archive resources. You must convert WHL packages into ZIP files before you upload the packages.
odps.create_resource('scipy.zip', 'archive', file_obj=open('scipy-0.19.0-cp27-cp27m-manylinux1_x86_64.whl', 'rb'))
# If the isolation feature is enabled for your project, you are not required to configure the following option:
options.sql.settings = { 'odps.isolation.session.enable': True }
def my_psi(value):
# We recommend that you put the IMPORT statement inside a function to import third-party libraries. This prevents runtime errors caused by structural differences of binary packages in different operating systems.
from scipy.special import psi
return float(psi(value))
df.float_col.map(my_psi).execute(libraries=['scipy.zip'])
You can package binary packages that contain only source code into WHL files by running the following shell command in Linux and then upload them. WHL files generated in macOS or Windows cannot be used in MaxCompute.
python setup.py bdist_wheel