DataWorks allows you to migrate tasks from open source scheduling engines, such as Oozie, Azkaban, Airflow, and DolphinScheduler, to DataWorks. This topic describes how to export tasks from open source scheduling engines.
Background information
Before you import a task of an open source scheduling engine to DataWorks, you must export the task to your on-premises machine or Object Storage Service (OSS). For more information about the import procedure, see Import tasks of open source engines.
Limits
For Airflow tasks, you can export only tasks of Airflow 1.10.x. In addition, the export of Airflow tasks depends on Python 3.6 or later.
Export a task from Oozie
Export requirements
The package must contain XML-formatted definition files and configuration files of a task. The package is exported in the ZIP format.
Structure of the package
Oozie task descriptions are stored in a Hadoop Distributed File System (HDFS) directory. For example, on the Apache Oozie official website, each subdirectory under the apps directory in the Examples package is a task of Oozie. Each subdirectory contains XML-formatted definition files and configuration files of a task. The following figure shows the structure of the exported package.
Export a job from Azkaban
Download a flow
You can download a specific flow in the Azkaban console.
Log on to the Azkaban console and go to the Projects page.
Select a project whose package you want to download. On the project page, click Flows to show all flows of the project.
Click Download in the upper-right corner of the page to download the package of the project.
Native Azkaban packages can be exported. No format limit is imposed on the packages of Azkaban. The exported package in the ZIP format contains information about all jobs and relationships of a specific project of Azkaban. You can directly upload the ZIP package exported from the Azkaban console to the Scheduling Engine Import page in DataWorks.
Conversion logic
The following table describes the mappings between Azkaban items and DataWorks items and the conversion logic.
Azkaban item | DataWorks item | Conversion logic |
Flow | Workflow in DataStudio | Jobs in a flow are placed in the workflow that corresponds to the flow and used as nodes in the workflow. Nested flows in a flow are converted into separate workflows in DataWorks. After the conversion, the dependencies between the nodes in the workflow are automatically established. |
Command-type job | Shell node | In DataWorks on EMR mode, a command-type job is converted into an E-MapReduce (EMR) Shell node. You can specify the mapped node type by configuring the related parameter in the Advanced Settings dialog box. If you call other scripts in the CLI of a command-type job, the script file obtained after analysis can be registered as a resource file of DataWorks and the resource file is referenced in the converted Shell code. |
Hive-type job | ODPS SQL node | In DataWorks on MaxCompute mode, a Hive-type job is converted into an ODPS SQL node. You can specify the mapped node type by configuring the related parameter in the Advanced Settings dialog box. |
Other types of nodes that are not supported by DataWorks | Zero load node or Shell node | You can specify the mapped node type by configuring the related parameter in the Advanced Settings dialog box. |
Export a task from Airflow
Procedure
Go to the runtime environment of Airflow.
Use the Python library of Airflow to load the directed acyclic graph (DAG) folder that is scheduled on Airflow. The DAG Python file is stored in the DAG folder.
Use the export tool to read the task information and dependencies stored in the DAG Python file based on the Python library of Airflow in memory. Then, write the generated DAG information to a JSON file and export the file.
You can download the export tool on the Scheduling Engine Export page of Cloud tasks in DataWorks Migrant Assistant. For information about how to go to the Scheduling Engine Export page, see Export a task of another open source engine.
Usage notes for the export tool
Usage notes for the export tool:
Run the following command to decompress the airflow-exporter.tgz package:
tar zxvf airflow-exporter.tgz
Run the following command to set the PYTHONPATH parameter to the directory of the Python library:
export PYTHONPATH=/usr/local/lib/python3.6/site-packages
Run the following command to export the task from Airflow:
cd airflow-exporter python3.6 ./parser -d /path/to/airflow/dag/floder/ -o output.json
Run the following command to compress the exported output.json file into a ZIP package:
zip out.zip output.json
After the ZIP package is generated, you can perform the following operations to create an import task to import the ZIP package to DataWorks: Go to the Migrant Assistant page in the DataWorks console. In the left-side navigation pane, choose Import tasks of open source engines.
. For more information, seeExport a node from DolphinScheduler
How it works
The DataWorks export tool obtains the JSON configurations of a process in DolphinScheduler by calling the API operation that is used to export multiple processes from DolphinScheduler at a time. The export tool generates a ZIP file based on the JSON configurations. In the left-side navigation pane of the Migrant Assistant page, you can choose Scheduling Engine Import.
, and create an import task of the DolphinScheduler type to import the ZIP file. DataWorks Migration Assistant parses and converts code and dependencies of nodes in a process in the ZIP file into valid file configurations for related DataWorks nodes. For information about how to import a task of an open source scheduling engine, seeLimits
Limits on version: DolphinScheduler 1.3.x, 2.x, 3.x allows you to export DolphinScheduler nodes.
Limits on node type conversion:
SQL nodes: Only some types of compute engines support the conversion of SQL nodes. During the node type conversion, the syntax of SQL code is not converted and the SQL code is not modified.
Cron expressions: In specific scenarios, cron expressions may be pruned or cron expressions may not be supported. You must check whether the scheduling time that is configured meets your business requirements. For information about scheduling time, see Configure time properties.
Python nodes: DataWorks does not provide Python nodes. DataWorks can convert a Python node in DolphinScheduler into a Python file resource and a Shell node that references the Python file resource. However, issues may occur when scheduling parameters of the Python node are passed. Therefore, debugging and checks are required. For information about scheduling parameters, see Configure and use scheduling parameters.
Depend nodes: DataWorks cannot convert Depend nodes for which cross-cycle scheduling dependencies are configured. If cross-cycle scheduling dependencies are configured for a Depend node, DataWorks converts the dependencies into the same-cycle scheduling dependencies of the mapped auto triggered node in DataWorks. For information about how to configure same-cycle scheduling dependencies, see Configure same-cycle scheduling dependencies.
Conversion logic
The following table describes the mappings between DolphinScheduler items and DataWorks items and the conversion logic.
DolphinScheduler item | DataWorks item | Conversion logic |
Process | Nodes in a DolphinScheduler process are converted into nodes in a DataWorks workflow. In addition, zero load nodes used as the start and end nodes in the DataWorks workflow are automatically added. | |
SubProcess node |
| |
Conditions node | A dependency item and a conditional relation of a Conditions node are converted into a DataWorks merge node and related logic. The outermost logical dependency configured for the Conditions node uses the mapped merge node and another two merge nodes to determine whether a success or failure path is used. Note DataWorks cannot convert Conditions nodes for which cross-cycle scheduling dependencies are configured. If cross-cycle scheduling dependencies are configured for a Conditions node, DataWorks converts the dependencies into the same-cycle scheduling dependencies of the mapped auto triggered node in DataWorks. | |
Depend node | The dependencies of a Depend node are converted into the input of the mapped zero load node based on the scheduling configurations. Concatenation rule for the input of the mapped zero load node:
| |
SQL node | Types of the compute engines that support conversion of SQL nodes in DolphinScheduler:
| You can specify the type of mapped compute engine node by configuring the related parameter in the Advanced Settings dialog box of an import task. Note The syntax of SQL code is not converted and the SQL code is not modified. |
Python node | A Python file resource and a Shell node that references the Python file resource | Issues may occur when scheduling parameters of the Python node are passed. Therefore, debugging and checks are required. For information about scheduling parameters, see Configure and use scheduling parameters. |
MR node |
| You can specify the type of mapped compute engine node by configuring the related parameter in the Advanced Settings dialog box of an import task. |
Spark node |
| You can specify the type of mapped compute engine node by configuring the related parameter in the Advanced Settings dialog box of an import task. |
Sqoop node | Batch synchronization node that is configured in script mode | The data sources used for the batch synchronization node vary based on your business requirements. For information about how to configure a batch synchronization node by using the code editor, see Configure a batch synchronization task by using the code editor. |
Other types of nodes that are not supported by DataWorks | Zero load node | N/A. |
Environment preparations
Dependency settings: Install Java Development Kit (JDK) 1.8 or later, and Python 2.7 or later.
Export procedure
Decompress the export tool package.
Run the following commands to decompress the export tool package:
$ tar xzvf migrationx-reader.zip $ cd migrationx-reader/
Generate a token used to call DolphinScheduler APIs.
For information about how to generate a token, see DolphinScheduler documentation.
Export a ZIP file.
Run the following command to export the required ZIP file:
$ python ./bin/reader.py -a dolphinscheduler -e http://dolphinschedulerhost:port -t token -v 1.3.9 -p project_name -f ds_dump.zip
After you export the ZIP file, choose Scheduling Engine Import.
in the left-side navigation pane of the Migrant Assistant page, and create an import task of the DolphinScheduler type to import the ZIP file. DataWorks Migration Assistant parses and converts code and dependencies of nodes in a process in the ZIP file into valid file configurations for related DataWorks nodes. For information about how to import a task of an open source scheduling engine, seeExport tasks from other open source engines
DataWorks provides a standard template for you to export tasks of open source engines other than Oozie, Azkaban, Airflow, and DolphinScheduler. Before you export a task, you must download the standard template and modify the content based on the file structure in the template. You can go to the Scheduling Engine Export page to download the standard template and view the file structure.
Log on to the DataWorks console. In the top navigation bar, select the desired region. Then, choose in the left-side navigation pane. On the page that appears, select the desired workspace from the drop-down list and click Go to DataStudio.
On the DataStudio page, click the icon in the upper-left corner and choose .
In the left-side navigation pane, choose .
Click the Standard Template tab.
On the Standard Template tab, click standard format Template to download the template.
Modify the content of the template to generate a package that you want to export.