If a task that is run on an exclusive resource group for scheduling depends on a third-party package, you must use the O&M Assistant feature to install the third-party package on the exclusive resource group for scheduling first. This ensures that the task can be run as expected. DataWorks O&M Assistant provides multiple built-in third-party packages. You can directly install and use a built-in third-party package. If the built-in third-party packages cannot meet your business requirements, you can use a Shell command to upload the required package or resource file for installation.
Prerequisites
An exclusive resource group for scheduling is purchased. Only exclusive resource groups for scheduling support the O&M Assistant feature. For information about how to purchase an exclusive resource group for scheduling, see Create and use an exclusive resource group for scheduling.
The AliyunDataWorksFullAccess policy or a policy that contains the ModifyResourceGroup permission is attached to the account that you want to use. For more information, see Manage permissions on the DataWorks services and the entities in the DataWorks console by using RAM policies.
Limits
You must take note of the following limits when you use the O&M Assistant feature:
You can use the O&M Assistant feature only on an exclusive resource group for scheduling. You cannot use the feature on an exclusive resource group for Data Integration or a serverless resource group.
You can only create a command that is used to install a third-party package. You cannot modify the command.
You can upload only a resource file whose size is no more than 50 MB to an exclusive resource group for scheduling.
NoteIf you want to upload a MaxCompute resource file whose size is greater than 50 MB, you can upload the resource file in a visualized manner in the DataWorks console. For more information, see Create and use MaxCompute resources.
A third-party Python package that is installed on an exclusive resource group for scheduling by using the O&M Assistant feature can be referenced only when the exclusive resource group for scheduling is used to run PyODPS tasks.
NoteFor information about how to reference a third-party Python package in a Python user-defined function (UDF) of MaxCompute, see Example: Reference third-party packages in Python UDFs.
Go to the O&M Assistant page
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, click Resource Group.
On the Exclusive Resource Groups tab of the Resource Groups page, find the resource group whose purpose is Data Scheduling, click the icon in the Actions column of the resource group, and then select O&M Assistant.
You can create a command based on your business requirements and run the command to install the third-party package that is required to run tasks.
NoteYou can use the O&M Assistant feature only on an exclusive resource group for scheduling. You cannot use the feature on an exclusive resource group for Data Integration or a serverless resource group.
Install a third-party package
Create a command that can be used to install a third-party package.
The installed third-party package can be referenced when the exclusive resource group for scheduling is used to run scheduling tasks. On the O&M Assistant page, click Create Command. In the Create Command panel, you can set the Command Type parameter to Quick Installation or Manual Installation.
Method 1: Quick Installation
This method is used to install a built-in third-party package.
The following table describes the key parameters that you must configure when you install a built-in third-party package by using this method.
Parameter
Description
Command Name
The name of the command. You can specify a name based on your business requirements.
Command Type
The method used to create the command. Select Quick Installation.
If you select this method, DataWorks automatically generates a Shell command that can be used to install the third-party package you select.
Package to Install
The third-party package that you want to install and the version of the package.
DataWorks provides multiple built-in third-party packages of the Python2, Python3, and Yum types. You can select a built-in third-party package based on your business requirements. Example for commonly used built-in third-party packages:
Aliyun-python-sdk-core: the core library of Alibaba Cloud SDK for Python. This library provides the basic API operation call and authentication features and is the basic library used for interactions with Alibaba Cloud services.
NumPy: the basic library used for scientific computing and data analysis. This library provides a high-performance computing feature for multi-dimensional arrays and numeric values.
Pandas: provides high-performance, easy-to-use data structure and data analysis tools and is used to process and analyze structured data.
You can view all built-in third-party packages that are supported in the DataWorks console.
Generated Shell Script
The Shell command that is generated by DataWorks based on the built-in third-party package you select.
You can run the Shell command to install the built-in third-party package.
For example, after you select the aliyun-python-sdk-core package, DataWorks automatically generates the pip install aliyun-python-sdk-core command. You can run the command to install the package.
Timeout
The timeout period for running the command. Unit: seconds. If the running duration of the command exceeds the timeout period that you specify, the command is forcefully stopped.
Method 2: Manual Installation
If the built-in third-party packages cannot meet your business requirements, you can manually specify a Shell command and use the command to upload the required package or resource file.
NoteIf you use this method, you cannot run pip commands to install third-party packages.
Parameter
Description
Command Name
The name of the command. You can specify a name based on your business requirements.
Command Type
Select Manual Installation.
You must manually enter the Shell command that you want to run to upload a package or resource file from your on-premises machine.
Command Content
Enter the content of the command that you want to run. Example:
yum install -y git
.NoteThe success rate of running manually entered Shell commands cannot be ensured.
After a resource file is uploaded, you must use the absolute path of the resource when you reference the resource in nodes and tasks in DataStudio.
Installation Directories
The directory to which you want to install the third-party package by using the command. DataWorks automatically adds the directory to the related directory whitelist to ensure that the directory is accessible. If you specify multiple directories, separate them with semicolons (;).
NoteYou can install the third-party package in a /home/ directory or a directory other than /home/. If you want to install the third-party package in a /home/ directory, you can use only the /home/admin/usertools/tools/ path of the exclusive resource group for scheduling.
If no directory is specified, the third-party package is automatically installed in the /home/admin/usertools/tools/ path.
Timeout
The timeout period for running the command. Unit: seconds. If the running duration of the command exceeds the timeout period that you specify, the command is forcefully stopped.
After you complete the configuration, click Create.
Run the following command.
After the command is created, you can find the command on the O&M Assistant page and click Run command in the Actions column of the command to install the third-party package. After the third-party package is installed, you can use the package to run scheduling tasks on the exclusive resource group for scheduling.
Manage a command
After a command is created, you can find the command on the O&M Assistant page and perform the following operations that are related to the command.
You can click View Detailed Environment Configuration to view the overall environment configuration of the related resource group. For example, you can view the third-party package that is installed, and the version and status of the package.
View information of the command: You can view information such as the running status, execution ID, and content of the command.
View the running result: You can view whether the running of the command is successful or fails. If the running fails, you can analyze the cause for the failure based on logs and resolve the issue.
Contact technical support: If you cannot resolve the issue that you encounter, you can follow the instructions that are displayed to join the DataWorks user DingTalk group and contact technical support.
What to do next
After a third-party package is installed, you can reference the package when you use the related exclusive resource group for scheduling to run scheduling tasks. References: