All Products
Search
Document Center

DataWorks:Create a workflow

Last Updated:Nov 28, 2024

This topic describes how to create a workflow, create nodes in the workflow, and configure node dependencies. After you create a workflow, you can use the DataStudio service to perform data analysis and computing in the workspace.

Prerequisites

The bank_data table for storing business data and the result_table table for storing results are created in a workspace. Data is imported to the bank_data table. For more information, see Create tables and upload data.

Background information

The DataStudio service in DataWorks allows you to configure node dependencies by dragging lines between nodes in a workflow. You can process data and configure node dependencies based on the workflow. You can create multiple workflows in a workspace. For more information, see Create a workflow.

Create a workflow

  1. Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and Governance > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.

  2. On the DataStudio page, move the pointer over the Create icon and select Create Workflow.

  3. In the Create Workflow dialog box, configure Workflow Name and Description.

  4. Click Create.

Create nodes and configure node dependencies

In the workflow, create a zero load node named start and an ODPS SQL node named insert_data, and configure the insert_data node to depend on the start node.

Important
  • A zero load node is a control node that is used to maintain and control its descendant nodes in a workflow. A zero load node does not generate data.

  • If other nodes depend on a zero load node and the status of the zero load node is set to Failed by O&M personnel, the pending descendant nodes cannot run. During the O&M process, a zero load node can be disabled to prevent incorrect data of ancestor nodes from being obtained by their descendant nodes.

  • In most cases, the root node of the workspace is used as the ancestor node of a zero load node in a workflow. The root node of a workspace is named in the Workspace name_root format.

  • DataWorks automatically creates an output name for a node. The name is in the Workspace name.Node name format. If a workspace contains two nodes with the same name, rename one of the two nodes.

When you design a workflow, we recommend that you create a zero load node as the root node of the workflow to control the entire workflow. To design a workflow, perform the following steps:

  1. In the left side of the Scheduled Workflow pane, find the workflow that you created in the Business Flow section and double-click the workflow name. On the configuration tab that appears, choose General > Zero-Load Node.

    You can also drag Zero-Load Node to the canvas on the right side to go to the Create Node dialog box.

    Create and use a zero load node

  2. In the Create Node dialog box, configure the Path parameter and set the Name parameter to start. Then, click Confirm.

  3. Use the same method to create an ODPS SQL node named insert_data.

  4. Drag a line from the start node to the insert_data node to configure the start node as the ancestor node of the insert_data node.

    Configure node dependencies

Configure the ancestor node of the zero load node

In a workflow, a zero load node is used to control the entire workflow and serves as the ancestor node of all nodes in the workflow.

In most cases, a zero load node depends on the root node of the workspace.

  1. Double-click the name of the zero load node to go to the node configuration tab.

  2. Click Properties in the right-side navigation pane.

  3. In the Dependencies section of the Properties tab, click Add Root Node to configure the root node of the workspace as the ancestor node of the zero load node.

    image.png

  4. Save and commit the node.

    Important

    You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the node.

    1. Click the Save icon in the top toolbar to save the node.

    2. Click the Submit icon in the toolbar to commit the node.

    3. In the Commit Node dialog box, enter your comments in the Change description field.

    4. Click Determine.

Edit and run the ODPS SQL node

This section describes how to use SQL code in the ODPS SQL node insert_data to query the number of singles who have different education levels and have mortgage loans and save the query result. The query result can be used for descendant nodes to continue to analyze or present data.

  1. Go to the configuration tab of the ODPS SQL node and enter the following code.

    For more information about the syntax, see Overview of MaxCompute SQL.

    INSERT OVERWRITE TABLE result_table  -- Insert data into the result_table table. 
    SELECT education
        , COUNT(marital) AS num
    FROM bank_data
    WHERE housing = 'yes'
        AND marital = 'single'
    GROUP BY education;
  2. Right-click bank_data in the code and select Delete Input.

    The bank_data table is not generated by an auto-triggered node. For more information about how to create a table and import data into the table, see Create tables and upload data. If the SELECT statement in the code of a node specifies a table that is not generated by an auto-triggered node, you can right-click the name of the table that you want to manage and click Delete input. You can also add a comment for a rule at the top of the code. This way, the system does not automatically parse the dependency based on the rule. Delete Input

    Note

    Scheduling dependencies ensure that a node can obtain the table data generated by its ancestor node that is scheduled to run. However, if the ancestor node of a node is not scheduled to run, the system cannot monitor the generation of the latest table data by the ancestor node. If a node uses a SELECT statement to query data of a table that is not generated by an auto-triggered node, you must manually delete the dependency of the node that is automatically generated by the SELECT statement.

  3. Click the Save icon in the top toolbar. This prevents code loss.

  4. Click the Run icon in the top toolbar.

    After the node is run, you can view the run log and result in the lower part of the tab.

Commit a workflow

  1. After you run and debug the ODPS SQL node insert_data, return to the configuration tab of the workflow.

  2. Click the Submit icon in the top toolbar.

  3. In the Commit dialog box, select the node that you want to commit, enter your comments in the Change description field, select a value for Forcefully Modify, and then select Ignore I/O Inconsistency Alerts.

  4. Click Commit.

    After the workflow is committed, you can view the node status from the node list in the workflow. If the image.png icon is not displayed on the left of the node name, the node is committed. If the image.png icon is displayed, the node is not committed.

What to do next

You have learned how to create and commit a workflow. You can proceed with the next tutorial. You can create a data synchronization node to synchronize data between different types of data sources. For more information, see Create a synchronization task.