All Products
Search
Document Center

DataWorks:Scheduling dependencies

Last Updated:Nov 28, 2024

This topic provides answers to some frequently asked questions about scheduling dependencies.

What are scheduling dependencies?

Scheduling dependencies define the relationships between nodes. After you configure scheduling dependencies for a node, the node can start to run only after its ancestor node is successfully run.

Note

After scheduling dependencies are configured for a node, one of the prerequisites to run the node is that its ancestor node is successfully run. For information about how to configure scheduling dependencies, see Scheduling dependency configuration guide.

Why are scheduling dependencies required?

Scheduling dependencies ensure that the current node can obtain the required data from its ancestor node when the current node is scheduled to run. After the ancestor node of the current node is successfully run, DataWorks can detect that the ancestor node has generated the latest data in the related table based on the status of the ancestor node. Then, the current node can successfully obtain the data. This prevents the node from obtaining invalid data or obtaining no data before the ancestor node of the current node generates data as expected.

How do I configure scheduling dependencies for a node?

Use the output of a node as the input of another node to establish a dependency between the nodes.

Note
  • The system automatically configures an input or output for an SQL node by using one of the following methods:

    • The system automatically finds the node that generates the table specified in the SELECT statement in the code of the SQL node, and configures the output of the node as the input of the SQL node based on the automatic parsing feature.

    • The system automatically configures the table specified in the INSERT or CREATE statement in the code of the SQL node as the output of the SQL node based on the automatic parsing feature.

  • You must manually configure the table generated by a synchronization node of Data Integration as the output of the node in the Project name.Table name format. This way, the system can automatically find the node that generates the table and configure the output as the input of the descendant node of the node based on the automatic parsing feature.

  • The output name of a node must be in the Project name.Table name format and must be unique. This way, the system can find the node that generates the output based on the output name.

Which scenarios do not support scheduling dependencies?

Scheduling dependencies ensure that an auto triggered node can successfully obtain the latest data from its ancestor node at regular intervals. DataWorks can monitor only the data that is generated by auto triggered nodes.

If a node uses a SELECT statement to query data of a table that is not generated by an auto triggered node, you must manually delete the scheduling dependency that is automatically generated based on the SELECT statement for the node. Tables that are not generated by auto triggered nodes include the following types:

  • Tables uploaded from on-premises machines to DataWorks

  • Dimension tables

  • Tables that are not generated by nodes scheduled by DataWorks

  • Tables generated by manually triggered nodes

How do I delete a table on which a node does not depend?

Go to the Scheduled Workflow pane of the DataStudio page, find the node for which you want to delete a scheduling dependency, and then go to the configuration tab of the node. In the code of the node, find the name of the table that you want to delete, right-click the table name, and then select Delete Input. Then, configure settings related to the automatic parsing feature in the Dependencies section of the Properties tab of the node to enable the system to automatically configure scheduling dependencies for the node.删除输入

When I commit a node, the system reports an error that the output name of the ancestor node of the node does not exist. What do I do?

When I commit a node, the system reports an error that the output name of the ancestor node of the node does not exist. For information about the causes of and solutions to the issue, see When I commit Node A, the system reports an error that the output name of the dependent ancestor node of Node A does not exist. What do I do?.

When I commit a node, the system reports an error that the input and output of the node are inconsistent with the data lineage in the code developed for the node. What do I do?

When I commit a node, the system reports an error that the input and output of the node are not consistent with the data lineage in the code developed for the node. For information about the causes of and solutions to the issue, see When I commit a node, the system reports an error that the input and output of the node are not consistent with the data lineage in the code developed for the node. What do I do?.

The system automatically adds an output name to Parent Nodes for my node based on the automatic parsing feature, but an error message indicating that the output represented by the output name does not exist appears. What do I do?

提交失败

The system fails to find the node that generates the output based on the output name.

This error may be caused by the following reasons:

  • The node that generates the output is not committed. You can commit the node and try again.

  • The node that generates the output is committed, but the output name of the node is different from the output name that is automatically added by the system.

Note
  • If tb_2 in the preceding figure is the output table of a node, you must add tb_2 to Output Name of Current Node for the node in the Project name.Table name format. For more information, see Scheduling dependency configuration guide.

  • If tb_2 is a table that is not generated by an auto triggered node, you must right-click the table name in the code of the node that generates the table and select Delete Input to delete the table. In the Dependencies section of the Properties tab, configure settings related to the automatic parsing feature to enable the system to automatically configure scheduling dependencies for the node.

For information about the tables that are not generated by auto triggered nodes, see Which scenarios do not support scheduling dependencies?.

The name and ID of the descendant node of my node are empty and cannot be specified in the output of my node. Why does this happen?

After you configure the output of a node as the input of another node, scheduling dependencies are established between the two nodes. If a node has no descendant node, the name and ID of the descendant node are empty. After you configure a descendant node for your node, the name and ID of the descendant node are automatically displayed.

How do I delete the tables on which my node does not depend?

On the configuration tab of the node, find the name of the table that you want to delete in the code of the node, right-click the table name, and then select Delete Input. In the Dependencies section of the Properties tab of the node, configure settings related to the automatic parsing feature to enable the system to automatically configure scheduling dependencies for the node.删除输入

What rules are used when a node needs to depend on its ancestor nodes to run?

In the scheduling system of DataWorks, scheduling dependencies are configured to ensure that a node can successfully obtain the required data generated by another node. You can determine whether to configure scheduling dependencies between nodes based on the data lineages of the tables generated by the nodes. For more information, see Scheduling dependency configuration guide.

What is the output name of a node used for?

The output name of a node is used to establish a dependency with another node. For example, if the output name of Node A is ABC and Node B uses ABC as its input name, a dependency is established between Node A and Node B.

Can a node have multiple output names?

Yes, a node can have multiple output names. The output name of a node defines the node. If a node (Node A) needs to depend on another node (Node B), Node A can reference an output name of Node B as its input name. This way, a dependency is established between Node A and Node B.

Can multiple nodes have the same output name?

No, multiple nodes cannot have the same output name. The output name of each node must be unique. This way, if a node references the output of another node, the system can find the node that generates the output based on the unique output name and the automatic parsing feature, and a dependency can be established between the two nodes. If multiple nodes generate data to the same table, you must determine the last node that generates data to the table, and change the output names of the remaining nodes to ensure that the output names of all nodes are unique. This ensures that another node can successfully obtain the required data from the table.

If two auto triggered nodes in the same workspace generate data to the same table, the system reports the following error message for one of the nodes in automatic parsing scenarios: The ${nodename1} node and the ${nodename2} node in the ${projectname} workspace use the same output name ${node_outputname}. Multiple nodes cannot have the same output name.

How do I prevent DataWorks from parsing temporary tables when DataWorks parses the scheduling dependencies of a node?

On the configuration tab of the node, right-click a temporary table name in the SQL code for the node and select Delete Input or Delete Output. In the Dependencies section of the Properties tab, click Parse Input and Output from Code to parse the input and output for the node.

How do I configure an ancestor node for the start node of a workflow?

If you want to configure an ancestor node for the start node of a workflow, you can create a zero load node in the workflow and use the zero load node as the start node of the workflow. Then, you can configure the root node of the workspace as the ancestor node of the zero load node. For information about how to use zero load nodes, see Create and use a zero load node.

Why do I find a non-existent output name of Node B when I enter an output name to search for the ancestor nodes of Node A?

DataWorks searches for the ancestor nodes of a node among the output names of nodes that are committed and deployed to the scheduling system based on the automatic parsing feature. After Node B is committed, if you delete the output name of Node B and do not commit Node B to the scheduling system again, the deleted output name of Node B can still be found.

When I undeploy a node, the system displays an error message indicating that the node has descendant nodes and cannot be undeployed. However, no descendant nodes can be found for the node on the Properties tab. Why does this happen?

You can undeploy a node only after no nodes depend on the node in the development and production environments. You can go to Operation Center in the development environment and production environment to check whether some nodes still depend on the node.

Why do some scheduling dependencies of nodes appear as dashed lines in Operation Center?

If the scheduling dependencies of a node appear as dashed lines, cross-cycle scheduling dependencies are configured for the node. For information about cross-cycle scheduling dependencies, see Scenario 2: Configure scheduling dependencies for a node that depends on last-cycle instances.

I configure the instance generated for a node scheduled by hour in the current cycle to depend on the instance generated for the node in the previous cycle. What are the impacts on this node and its descendant node?

  • Impact on the current node: The instance generated for the node in the current cycle can start to run only after the instance generated for the node in the previous cycle is successfully run.

    Scenario: If a node that is scheduled by hour starts to run at 00:00 and needs to run every hour, the instance generated for the node in the second cycle can start to run only after the instance generated for the node in the first cycle is successfully run.

  • Impact on the descendant node of the current node: If the current node has a descendant node that is scheduled by day, the instance generated for the descendant node no longer directly depends on multiple instances generated for the current node but instead directly depends only on a specific instance generated for the current node. In this case, the instance generated for the descendant node indirectly depends on multiple instances generated for the current node.

How do I configure a scheduling dependency in which a node scheduled by day depends on a node scheduled by hour?

  • Scenario 1: Configure the instance generated for a node scheduled by day to depend on all instances generated on the current day for a node scheduled by hour.

    Configure the node scheduled by day to directly depend on the node scheduled by hour. This way, the instance generated for the node scheduled by day depends on all instances generated on the current day for the node scheduled by hour.天任务直接依赖小时任务

  • Scenario 2: Configure the instance generated for a node scheduled by day to depend on a specific instance generated on the current day for a node scheduled by hour.

    • For the node scheduled by hour, configure the instance generated for the node in the current cycle to depend on the instance generated for the node in the previous cycle. This indicates that you must set the Cross-Cycle Dependency (Original Previous-Cycle Dependency) parameter to Instances of Current Node for the node scheduled by hour in the Dependencies section of the Properties tab in the DataWorks console.

    • For the node scheduled by day, configure the node to depend on the node scheduled by hour. This indicates that you must add the output name of the node scheduled by hour to Parent Nodes for the node scheduled by day in the Dependencies section of the Properties tab in the DataWorks console.

    小时任务设置自依赖

  • Scenario 3: Configure the instance generated for a node scheduled by day to depend on all instances generated on the previous day for a node scheduled by hour.

    • In the Dependencies section of the Properties tab of the node scheduled by day, set the Cross-Cycle Dependency (Original Previous-Cycle Dependency) parameter to Other Nodes and enter the ID of the node scheduled by hour in the field that appears.

    • In the Dependencies section of the Properties tab of the node scheduled by day, remove the output name of the node scheduled by hour from Parent Nodes for the node scheduled by day.

Note

If you configured a node scheduled by day to depend on a node scheduled by hour in the Dependencies section of the Properties tab, you must remove the output name of the node scheduled by hour from Parent Nodes for the node scheduled by day. Otherwise, the instance generated for the node scheduled by day depends on all instances that are generated on the previous day and the current day for the node scheduled by hour.

When does a node scheduled by day start to run if I configure a node scheduled by hour as the ancestor node of the node scheduled by day?

Principle: If a node scheduled by hour is configured as the ancestor node of a node scheduled by day, the instance generated for the node scheduled by day depends on all instances generated on the current day for the node scheduled by hour. This indicates that the instance generated for the node scheduled by day can start to run only after the last instance generated on the current day for the node scheduled by hour is successfully run.

Scenarios:

  • The node scheduled by hour starts to run at 00:00 and runs every hour. In this case, the instance generated for the node scheduled by day can start to run only after all 24 instances generated for the node scheduled by hour are successfully run.

  • View the scheduling dependencies of the node scheduled by day in Operation Center: Find the node scheduled by day on the Cycle Task page of Operation Center, open the directed acyclic graph (DAG) of the node, right-click the node name in the DAG, and then select Show Ancestor Nodes to view all 24 instances generated on the current day for the node scheduled by hour. The dependencies of the node scheduled by day in the DAG appear as solid lines.

How do I configure the instance generated for a node scheduled by day to depend on a specific instance that is generated on the current day for a node scheduled by hour?

Principle: If you want to configure the instance generated for a node scheduled by day to depend on a specific instance generated on the current day for a node scheduled by hour, you must configure the instance generated for the node scheduled by hour in the current cycle to depend on the instance generated for the node scheduled by hour in the previous cycle, and set the scheduling time of the instance generated for the node scheduled by day to the scheduling time of the specified instance generated for the node scheduled by hour.

Scenario: Configure the instance generated for a node scheduled by day to depend on an instance that is generated on the current day for a node scheduled by hour and starts to run at 12:00.

  • Dependency configuration:

    • For the node scheduled by hour: Go to the Properties tab of the node, and set the Cross-Cycle Dependency (Original Previous-Cycle Dependency) parameter to Instances of Current Node in the Dependencies section.

    • For the node scheduled by day: Set the scheduling time of the node to 12:00.

  • View dependencies in Operation Center:

    • Find the node scheduled by day on the Cycle Task page of Operation Center, open the DAG of the node, right-click the node name in the DAG, and then select Show Ancestor Nodes to view the instance that is generated on the current day for the node scheduled by hour and starts to run at 12:00. The dependencies of the node scheduled by day in the DAG appear as solid lines.

    • Find the node scheduled by hour on the Cycle Task page of Operation Center, open the DAG of the node, right-click the node name in the DAG, and then select Show Ancestor Nodes to view the instance that starts to run at 11:00. The instance that starts to run at 12:00 depends on the instance that starts to run at 11:00. The scheduling dependency of the node scheduled by hour appears as a dashed line. This is because the following configuration is performed for the node scheduled by hour: The instance generated for the node in the current cycle depends on the instance generated for the node in the previous cycle.

How do I configure the instance generated for a node scheduled by day to depend on all the instances that are generated on the previous day instead of the current day for a node scheduled by hour?

Principle: If you want to configure the instance generated for a node scheduled by day to depend on all the instances that are generated on the previous day for a node scheduled by hour, you must configure a cross-cycle dependency on the node scheduled by hour for the node scheduled by day.

Scenario: Configure the instance generated for a node scheduled by day to depend on all the instances that are generated on the previous day for the node scheduled by hour.

  • Dependency configuration:

    • For the node scheduled by day: Go to the Properties tab of the node, set the Cross-Cycle Dependency (Original Previous-Cycle Dependency) parameter to Other Nodes in the Dependencies section, and then enter the ID of the node scheduled by hour in the field that appears.

    • For the node scheduled by hour: You do not need to configure scheduling dependencies.

  • View dependencies in Operation Center:

    Find the node scheduled by day on the Cycle Task page of Operation Center, open the DAG of the node, right-click the node name in the DAG, and then select Show Ancestor Nodes to view all instances generated on the previous day for the node scheduled by hour. The scheduling dependencies of the node scheduled by day appear as dashed lines because this node is configured with a cross-cycle dependency on the node scheduled by hour.

In which scenarios do I need to configure the instance generated for a node in the current cycle to depend on the instance generated for the node in the previous cycle?

Scenario: If a node needs to use data that is generated by the same node in the previous cycle, you can configure the instance generated for the node in the current cycle to depend on the instance generated for the same node in the previous cycle. In this case, the instance generated for the node in the current cycle can start to run only after the instance generated for the same node in the previous cycle is successfully run. This ensures that the instance in the current cycle can successfully obtain data from the instance in the previous cycle.

  • The instance generated for a node in the current cycle needs to use the data of the instance generated for the same node in the previous cycle. In this case, you must set the Cross-Cycle Dependency (Original Previous-Cycle Dependency) parameter to Instances of Current Node for the node in the Dependencies section of the Properties tab.

  • A node scheduled by hour depends on a node scheduled by day. After the instance generated on a day for the node scheduled by day is successfully run, the scheduling time of all instances generated on the same day for the node scheduled by hour arrives. As a result, all instances of the node scheduled by hour are run in parallel. To resolve this issue, set the Cross-Cycle Dependency (Original Previous-Cycle Dependency) parameter to Instances of Current Node in the Dependencies section of the Properties tab for the node scheduled by hour.

How do I configure dependencies for a node that needs to depend on multiple nodes?

If a node needs to depend on multiple nodes, you must determine whether to configure scheduling dependencies between the node and these nodes. If the node strongly depends on the table data generated by these nodes, we recommend that you configure scheduling dependencies between the node and these nodes. For more information about how to determine whether to configure scheduling dependencies between nodes, see Why are scheduling dependencies required?.

For example, Node A is scheduled by hour and generates Table A, and Node B is scheduled by day and generates Table B. Node C depends on Node A and Node B and needs to use data in Table A and Table B.

If you add the output name of Node A to Parent Nodes for Node C, but do not add the output name of Node B to Parent Nodes for Node C, Node C may start to run even if Node B is still running. As a result, Node C fails to obtain data in Table B, and an error occurs on Node C. To resolve this issue, you must add the output names of both Node A and Node B to Parent Nodes for Node C.

If a node does not strongly depend on the table data generated by another node and the node can obtain the data even if the latest data is not generated by another node, you do not need to configure a scheduling dependency between the two nodes.

Node B scheduled by day depends on Node A scheduled by hour, and Node B starts to run only after all the instances that are generated on the current day for Node A are successfully run. Will the running of Node B be affected if Node A still runs on the next day?

The instance generated for Node B depends on all the instances that are generated on the current day for Node A. Node B automatically runs every day after all the instances of Node A are successfully run. If the last instance of Node A is successfully run on the next day, Node B still runs, but at a point in time that is different from the specified time. Scheduling parameters can be replaced as expected.

Node A runs every hour on the hour, and Node B runs once every day. How do I configure Node B to automatically run after the first instance of Node A is successfully run every day?

When you configure time properties for Node A in the Dependencies section of the Properties tab, you must set the Cross-Cycle Dependency (Original Previous-Cycle Dependency) parameter to Instances of Current Node. In addition, you must set the scheduling time to 00:00 for Node B. This way, the instance generated for Node B depends only on the first instance that is generated at 00:00 every day for Node A. This instance is the first instance or Node A.

How do I configure Node A, Node B, and Node C to run in sequence once per hour?

  1. Dependencies: Configure the output of Node A as the input of Node B and the output of Node B as the input of Node C.

  2. Scheduling cycle: Configure Node A, Node B, and Node C to be scheduled by hour.

How do I configure scheduling dependencies across workflows or configure scheduling dependencies across workspaces that reside in the same region?

Principle: Use the output of a node as the input of another node to establish a dependency between the two nodes. You must add the output name of a node to Parent Nodes for another node to establish a dependency between the nodes. The nodes can belong to different workspaces and workflows.

I have configured rerun properties for my node, but the node does not rerun after it fails. In addition, the error message "Task Run Timed Out, Killed by System!!! appears. What do I do?

  • Problem description:

    The Rerun parameter in the Schedule section of the Properties tab is set to Allow Regardless of Running Status or Allow upon Failure Only for the node. However, the node does not rerun after it fails, and the error message Task Run Timed Out, Killed by System!!! appears when the node is run.

  • Possible cause:

    The Timeout definition parameter in the Schedule section of the Properties tab is configured for the node. If the running duration of the node exceeds the value of the Timeout definition parameter, the node automatically stops and does not rerun. A node that fails to be run due to a timeout cannot be rerun.

  • Solution:

    Manually rerun the node.