This topic describes how to use a PyODPS node in DataWorks to reference a third-party package. You can reference a common Python script or a third-party open source package.
Background information
If a third-party package needs to be referenced when you run a PyODPS node on a DataWorks resource group, you need to install the third-party package based on the resource group that you use.
If you use an exclusive resource group for scheduling, install the third-party package on the O&M Assistant page.
If you use a new-version resource group (general-purpose resource group), follow the instructions that are described in custom image to install the third-party package.
NoteIf a third-party package needs to be referenced in user-defined functions (UDFs) in task code, you cannot use the preceding methods to install the third-party package. For information about how to reference third-party packages in UDFs, see Example: Reference third-party packages in Python UDFs.
If your PyODPS node needs to access a data source or service in a special network environment, such as a virtual private cloud (VPC) or data center, use a new-version resource group or an old-version exclusive resource group for scheduling to run the node, and establish a network connection between the resource group and the data source or service.
For information about the PyODPS syntax, see PyODPS documentation.
PyODPS nodes are classified into two types: PyODPS 2 and PyODPS 3. The two types of PyODPS nodes use different Python versions at the underlying layer. PyODPS 2 nodes use Python 2, and PyODPS 3 nodes use Python 3. You can create a PyODPS node based on the Python version in use. For more information about how to create a PyODPS node, see Create a PyODPS 2 node and Create a PyODPS 3 node.
Limits
Due to the specifications of resources in the resource group that is used to run a node, we recommend that you use a PyODPS node to locally process no more than 50 MB of data. If a PyODPS node processes more than 50 MB of data, an out-of-memory (OOM) exception may occur, and the system may report Got killed. We recommend that you do not write excessive data processing code for a PyODPS node. For more information, see Overview.
If the system reports Got killed, the memory usage exceeds the limit, and the system terminates the related processes. We recommend that you do not perform local data operations. However, the limits on the memory usage do not apply to SQL or DataFrame tasks that are initiated by PyODPS. Take note that to_pandas tasks are excluded.
You can use the NumPy and pandas libraries that are pre-installed in DataWorks to run functions other than UDFs on an exclusive resource group for scheduling. Third-party packages that contain binary code are not supported.
For compatibility reasons, options.tunnel.use_instance_tunnel is set to False in DataWorks by default. If you want to globally enable InstanceTunnel, you must set this parameter to True.
Reference a common Python script
Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Create a Python resource
On the DataStudio page, move the pointer over the icon and choose .
Alternatively, you can click the name of the desired workflow in the Business Flow section, right-click MaxCompute, and then choose
.In the Create Resource dialog box, configure the Name parameter. In this example, the Name parameter is set to pyodps_packagetest.py.
ImportantThe resource name can contain only letters, digits, periods (.), underscores (_), and hyphens (-). It must end with .py.
Click Create.
On the configuration tab of the newly created Python resource, enter the common Python script that you want to reference. In this example, the following script is used:
# import os # print os.getcwd() # print os.path.abspath('.') # print os.path.abspath('..') # print os.path.abspath(os.curdir) def printname(): print 'test2' print 123
Click the icon in the top toolbar.
Create a PyODPS 2 node.
In the Business Flow section, find the workflow in which you want to create a PyODPS 2 node, right-click MaxCompute, and then choose .
In the Create Node dialog box, configure the Name parameter. In this example, the Name parameter is set to pyodps_testpackage.
NoteThe node name cannot exceed 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
Click Confirm.
Open the configuration tab of the newly created PyODPS 2 node. Then, right-click the name of the Python resource in the Resource folder of your workflow and select Insert Resource Path.
After the resource is referenced, the
##@resource_reference{"pyodps_packagetest.py"}
statement is automatically written in the code editor of the PyODPS 2 node.Enter the code that is used to reference the common Python script in the code editor of the PyODPS 2 node. In this example, the following code is used:
##@resource_reference{"pyodps_packagetest.py"} # This statement is required to reference the created Python resource. import sys import os sys.path.append(os.path.dirname(os.path.abspath('pyodps_packagetest.py'))) # Import the resource to the workspace. import pyodps_packagetest # Reference the resource. You must delete the .py suffix in the resource name. pyodps_packagetest.printname() # Call the method.
Click the icon in the top toolbar and view the results on the Runtime Log tab in the lower part of the configuration tab.
Reference a third-party open source package
Before you reference a third-party open source package, you must use pip to install the package and configure the package based on the resource group that you select.
Use a new-version resource group (general-purpose resource group) to configure a third-party open source package
New-version resource groups support custom image. When you create a custom image, you can select and install a third-party open source package based on your business requirements. Then, you can use the image when you configure scheduling properties for a PyODPS node.
Configuration description
In the Create Image panel, configure a third-party open source package based on your business requirements.
Key parameters:
Image Name/ID: Set the value to
dataworks_pyodps_task_pod
.Supported Task Type: Select
PyODPS 2
orPyODPS 3
.Installation Package: Select the desired third-party open source package.
Use an exclusive resource group for scheduling to configure a third-party open source package
An exclusive resource group for scheduling is used. The desired third-party open source package is installed on the O&M Assistant page of the exclusive resource group for scheduling. For more information, see Use the O&M Assistant feature.
PyODPS nodes include PyODPS 2 nodes and PyODPS 3 nodes.
If you want to use a PyODPS 2 node to reference the third-party open source package, run the following command to install the package:
pip install <Package that you want to reference> -i https://pypi.tuna.tsinghua.edu.cn/simple
If you are prompted to upgrade pip after you run the preceding command, run the following command:
pip install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple
If you want to use a PyODPS 3 node to reference the third-party open source package, run the following command to install the package:
/home/tops/bin/pip3 install <Package that you want to reference> -i https://pypi.tuna.tsinghua.edu.cn/simple
After the package is installed, run the
import
command to import the package. For example, run thepip3 -install oss2
command on the O&M Assistant page to install the oss2 package. Then, run theimport oss2
command in the PyODPS 3 node to import and reference oss2.If you are prompted to upgrade pip after you run the preceding command, run the following command:
/home/tops/bin/pip3 install --upgrade pip -i https://pypi.tuna.tsinghua.edu.cn/simple
If the following error occurs when you use the PyODPS 3 node, submit a ticket to apply for permissions:
"/home/admin/usertools/tools/cmd-0.sh:Line 3: /home/tops/bin/python3: The file or directory does not exist."