All Products
Search
Document Center

DataWorks:Configure a Check node

Last Updated:Nov 13, 2024

DataWorks allows you to use a Check node to check the availability of MaxCompute partitioned tables, File Transfer Protocol (FTP) files, Object Storage Service (OSS) objects, Hadoop Distributed File System (HDFS) files, OSS-HDFS objects, and real-time synchronization tasks based on check policies. If the condition that is specified in the check policy for a Check node is met, the task on the Check node is successfully run. If the running of a task depends on an object, you can use a Check node to check the availability of the object and configure the task as a descendant task of the Check node. If the condition that is specified in the check policy for the Check node is met, the task on the Check node is successfully run and then its descendant task is triggered to run. This topic describes the supported check objects and check policies, and also describes how to configure a Check node.

Supported check objects and check policies

Check nodes can be used to perform checks based only on data sources and real-time synchronization tasks. The following information describes check policies that are supported by DataWorks:

  • Data sources

    • MaxCompute partitioned tables

      Note

      You can use a Check node to check the availability of a MaxCompute partitioned table rather than a MaxCompute non-partitioned table.

      The following check policies are provided in Check nodes to help you check the availability of a MaxCompute partitioned table:

      • Policy 1: Check whether a specified partition exists.

        If the partition exists, the system considers that the operation of writing data to the partition is complete and the MaxCompute partitioned table is available.

      • Policy 2: Check whether data in a specified partition is updated within a specified period of time.

        If the data in the partition is not updated within the specified period of time, the system considers that the operation of writing data to the partition is complete and the MaxCompute partitioned table is available.

    • FTP files, OSS objects, HDFS files, or OSS-HDFS objects

      If a specified FTP file, OSS object, HDFS file, or OSS-HDFS object exists, the system considers that the file or object is available.

  • Real-time synchronization tasks

    For this type of check object, the point in time when a Check node starts to be scheduled is used as the time for judgment. If a real-time synchronization task has finished synchronizing data that is generated at the point in time and generated earlier than the point in time, the system considers that the real-time synchronization task passes the check and is available.

In addition, you must specify an interval at which a check is triggered and a condition for stopping a check task on the Check node. The condition can be the end time for check or the maximum number of checks. If the real-time synchronization task still fails the check after the specified time elapses or the maximum number of checks is reached, the Check node exits and enters the failed state. For more information about policy configuration, see the Step 2: Configure a check policy for the Check node section in this topic.

Note

The preceding check objects can be periodically checked by using Check nodes. You must configure the scheduling time to run a task on a Check node based on the expected check start time. The Check node remains in the running state after the conditions for node scheduling are met. If the condition in the check policy is met, the task on the Check node is successfully run. If the check is not passed for a long period of time, the task on the Check node fails. For more information about scheduling configuration, see the Step 3: Configure scheduling properties for the Check node section in this topic.

Limits

  • Limits on resource groups

    • You can use serverless resource groups or exclusive resource groups for scheduling to run tasks on Check nodes. We recommend that you use serverless resource groups. For more information about how to purchase a serverless resource group, see Create and use a serverless resource group.

  • Limits on node features

    • A Check node can be used to check only one object. If your task depends on multiple objects, such as multiple MaxCompute partitioned tables, you must create multiple Check nodes to separately check these objects.

    • The check interval of a Check node ranges from 1 to 30, in minutes.

  • Limits on DataWorks editions

    You can use Check nodes only in DataWorks Professional Edition or a more advanced edition. If you are using DataWorks Basic Edition or DataWorks Standard Edition, you must upgrade your DataWorks service to DataWorks Professional Edition or a more advanced edition before you use Check nodes. For more information, see the Edition upgrade and downgrade section of the "Billing of DataWorks editions" topic.

Prerequisites

  • Before you use a Check node to perform a check based on a data source, you must first prepare the data source that you want to use. The following table describes the details.

    Check object type

    Preparation

    References

    MaxCompute partitioned table

    1. A MaxCompute data source is added to DataWorks and is associated with DataStudio.

      You must add a MaxCompute project to a DataWorks workspace as a MaxCompute data source before you can use the data source to access data in the MaxCompute project.

    2. A MaxCompute partitioned table is created.

    FTP file

    An FTP data source is added.

    You must add the FTP service to a DataWorks workspace as an FTP data source before you can use the data source to access data of the FTP service.

    FTP data source

    OSS object

    An OSS data source is added and an AccessKey pair is configured to access the OSS data source.

    You must add an OSS bucket to a DataWorks workspace as an OSS data source before you can use the data source to access data in the bucket.

    Note

    You can access an OSS data source only by using an AccessKey pair in a Check node. You cannot use an OSS data source that is added in RAM role-based authorization mode in a Check node.

    HDFS file

    An HDFS data source is added.

    You must add an HDFS file to a DataWorks workspace as an HDFS data source before you can use the data source to access data in the file.

    HDFS data source

    OSS-HDFS object

    An OSS-HDFS data source is added.

    You must add the OSS-HDFS service to a DataWorks workspace as an OSS-HDFS data source before you can use the data source to access data of the OSS-HDFS service.

    OSS-HDFS data source

  • If you want to use a Check node to perform a check based on a real-time synchronization task, you must make sure that the real-time synchronization task is used to synchronize data from Kafka to MaxCompute. Before you perform a check, you must first create such a real-time synchronization task. For more information, see Configure a real-time synchronization task in DataStudio.

Step 1: Create a Check node

  1. Go to the DataStudio page.

    Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and Governance > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.

  2. Move the pointer over the image.png icon and choose Create Node > General > Check Node.

    In the Create Node dialog box, configure the Path and Name parameters as prompted and click Confirm.

Step 2: Configure a check policy for the Check node

You can configure a check policy to perform a check based on a data source or a real-time synchronization task.

Data source

Configure a check policy for a MaxCompute partitioned table

image.png

The following table describes the parameters.

Parameter

Description

Data Source Type

Select MaxCompute.

Data Source Name

The name of the data source to which the MaxCompute partitioned table that you want to check belongs.

If no data source is available, click New data source to add a data source. For more information about how to add a MaxCompute data source, see Add a MaxCompute data source.

Table Name

The name of the MaxCompute partitioned table that you want to check.

Note

You can select only a MaxCompute partitioned table that belongs to the specified data source.

Partition

The name of the partition in the MaxCompute partitioned table that you want to check.

You can click Preview Table Information next to Table Name to obtain the partition name. You can also use scheduling parameters to obtain the partition name. For more information about how to use scheduling parameters, see Supported formats of scheduling parameters.

Condition For Check Passing

Specifies the check method and check passing condition of the partitioned table. Valid values:

  • Partition Existed: checks whether the specified partition exists.

    • If the partition exists, the partitioned table passes the check, and the system considers that the partitioned table is available.

    • If the partition does not exist, the partitioned table fails the check, and the system considers that the partitioned table is unavailable.

  • Last Modification Time Not Updated for Specific Duration: checks whether data in the specified partition is updated within a specified period of time. This method is used based on the LastModifiedTime parameter.

    • If data in the partition is not updated within the specified period of time, the partitioned table passes the check, and the system considers that the data write operation is complete and the partitioned table is available.

    • If data in the partition is updated within the specified period of time, the partitioned table fails the check, and the system considers that the data write operation is not complete and the partitioned table is unavailable.

    Note
    • You can check whether partition data is updated within 5, 10, 15, 20, 25, or 30 minutes.

    • For more information about the LastModifiedTime parameter, see Change the value of LastModifiedTime.

Policy For Stopping Check

The policy for stopping a check task on the current Check node. DataWorks allows you to specify the point in time at which a check is stopped or the maximum number of checks. You can also specify the check frequency.

  • Time for Stopping Check: You can specify the check interval and the end time for the check task. If the check is still not passed after the specified time elapses, the check task exits and enters the failed state.

    Note
    • The check interval ranges from 1 to 30, in minutes.

    • If the time when a check task starts to run is later than the specified end time due to a delay of an upstream task, the check task performs the check only once after the upstream task finishes running.

  • Checks Allowed Before Check Node Stops: You can specify the check interval and the maximum number of checks. If the number of times that a check is performed exceeds the specified number and the check is still not passed, the check task exits and enters the failed state.

    Note
    • The check interval ranges from 1 to 30, in minutes.

    • The maximum running duration of a check task is 24 hours (1,440 minutes). The maximum number of times that a check can be performed varies based on the selected check interval. For example, if you set the check interval to 5 minutes, you can perform a check for a maximum of 288 times. If you set the check interval to 10 minutes, you can perform a check for a maximum of 144 times. You can view the parameter settings in the DataWorks console.

Configure a check policy for an FTP file

image

The following table describes the parameters.

Parameter

Description

Data Source Type

Select FTP.

Data Source Name

The name of the data source to which the FTP file that you want to check belongs.

If no data source is available, click New data source to add a data source. For more information about how to add an FTP data source, see FTP data source.

File Path

The storage path of the FTP file that you want to check. Example: /var/ftp/test/.

If a specified path already exists, a file that has the same name exists.

You can directly enter a path or use scheduling parameters to obtain the path. For more information about how to use scheduling parameters, see Supported formats of scheduling parameters.

Condition For Check Passing

Specifies the check passing condition of the FTP file.

  • If the FTP file exists, the check is passed, and the system considers that the FTP file is available.

  • If the FTP file does not exist, the check is not passed, and the system considers that the FTP file is unavailable.

Policy For Stopping Check

The policy for stopping a check task on the current Check node. DataWorks allows you to specify the point in time at which a check is stopped or the maximum number of checks. You can also specify the check frequency.

  • Time for Stopping Check: You can specify the check interval and the end time for the check task. If the check is still not passed after the specified time elapses, the check task exits and enters the failed state.

    Note
    • The check interval ranges from 1 to 30, in minutes.

    • If the time when a check task starts to run is later than the specified end time due to a delay of an upstream task, the check task performs the check only once after the upstream task finishes running.

  • Checks Allowed Before Check Node Stops: You can specify the check interval and the maximum number of checks. If the number of times that a check is performed exceeds the specified number and the check is still not passed, the check task exits and enters the failed state.

    Note
    • The check interval ranges from 1 to 30, in minutes.

    • The maximum running duration of a check task is 24 hours (1,440 minutes). The maximum number of times that a check can be performed varies based on the selected check interval. For example, if you set the check interval to 5 minutes, you can perform a check for a maximum of 288 times. If you set the check interval to 10 minutes, you can perform a check for a maximum of 144 times. You can view the parameter settings in the DataWorks console.

Configure a check policy for an OSS object

image

The following table describes the parameters.

Parameter

Description

Data Source Type

Select OSS.

Data Source Name

The name of the data source to which the OSS object that you want to check belongs.

If no data source is available, click New data source to add a data source. For more information about how to add an OSS data source, see OSS data source.

File Path

The storage path of the OSS object that you want to check. You can perform the following operations to view the storage path of an OSS object: Log on to the OSS console. Go to the details page of a desired bucket. In the left-side navigation pane of the details page, choose Object Management > Objects.

The path must conform to the object path format of OSS.

  • If the path ends with a forward slash (/), the Check node checks whether a folder with the same name exists in OSS.

    Example: If the path is user/, the Check node checks whether a folder named user exists.

  • If the path does not end with a forward slash (/), the Check node checks whether an object with the same name exists in OSS.

    Example: If the path is user, the Check node checks whether an object named user exists.

Note

After you select a data source, the system automatically uses the bucket that is configured in the data source. You do not need to specify bucket information in the path. After you enter a path, you can click View Complete Path Information to view the endpoint and bucket information of the OSS data source in the development environment.

Condition For Check Passing

Specifies the check passing condition of the OSS object.

  • If the OSS object exists, the check is passed, and the system considers that the OSS object is available.

  • If the OSS object does not exist, the check is not passed, and the system considers that the OSS object is unavailable.

Policy For Stopping Check

The policy for stopping a check task on the current Check node. DataWorks allows you to specify the point in time at which a check is stopped or the maximum number of checks. You can also specify the check frequency.

  • Time for Stopping Check: You can specify the check interval and the end time for the check task. If the check is still not passed after the specified time elapses, the check task exits and enters the failed state.

    Note
    • The check interval ranges from 1 to 30, in minutes.

    • If the time when a check task starts to run is later than the specified end time due to a delay of an upstream task, the check task performs the check only once after the upstream task finishes running.

  • Checks Allowed Before Check Node Stops: You can specify the check interval and the maximum number of checks. If the number of times that a check is performed exceeds the specified number and the check is still not passed, the check task exits and enters the failed state.

    Note
    • The check interval ranges from 1 to 30, in minutes.

    • The maximum running duration of a check task is 24 hours (1,440 minutes). The maximum number of times that a check can be performed varies based on the selected check interval. For example, if you set the check interval to 5 minutes, you can perform a check for a maximum of 288 times. If you set the check interval to 10 minutes, you can perform a check for a maximum of 144 times. You can view the parameter settings in the DataWorks console.

Configure a check policy for an HDFS file

imageThe following table describes the parameters.

Parameter

Description

Data Source Type

Select HDFS.

Data Source Name

The name of the data source to which the HDFS file that you want to check belongs.

If no data source is available, click New data source to add a data source. For more information about how to add an HDFS data source, see HDFS data source.

File Path

The storage path of the HDFS file that you want to check. Example: /user/dw_test/dw.

If a specified path already exists, a file that has the same name exists.

You can directly enter a path or use scheduling parameters to obtain the path. For more information about how to use scheduling parameters, see Supported formats of scheduling parameters.

Condition For Check Passing

Specifies the check passing condition of the HDFS file.

  • If the HDFS file exists, the check is passed, and the system considers that the HDFS file is available.

  • If the HDFS file does not exist, the check is not passed, and the system considers that the HDFS file is unavailable.

Policy For Stopping Check

The policy for stopping a check task on the current Check node. DataWorks allows you to specify the point in time at which a check is stopped or the maximum number of checks. You can also specify the check frequency.

  • Time for Stopping Check: You can specify the check interval and the end time for the check task. If the check is still not passed after the specified time elapses, the check task exits and enters the failed state.

    Note
    • The check interval ranges from 1 to 30, in minutes.

    • If the time when a check task starts to run is later than the specified end time due to a delay of an upstream task, the check task performs the check only once after the upstream task finishes running.

  • Checks Allowed Before Check Node Stops: You can specify the check interval and the maximum number of checks. If the number of times that a check is performed exceeds the specified number and the check is still not passed, the check task exits and enters the failed state.

    Note
    • The check interval ranges from 1 to 30, in minutes.

    • The maximum running duration of a check task is 24 hours (1,440 minutes). The maximum number of times that a check can be performed varies based on the selected check interval. For example, if you set the check interval to 5 minutes, you can perform a check for a maximum of 288 times. If you set the check interval to 10 minutes, you can perform a check for a maximum of 144 times. You can view the parameter settings in the DataWorks console.

Configure a check policy for an OSS-HDFS object

imageThe following table describes the parameters.

Parameter

Description

Data Source Type

Select OSS_HDFS.

Data Source Name

The name of the data source to which the OSS-HDFS object that you want to check belongs

If no data source is available, click New data source to add a data source. For more information about how to add an OSS-HDFS data source, see OSS-HDFS data source.

File Path

The storage path of the OSS-HDFS object that you want to check. You can perform the following operations to view the storage path of an OSS-HDFS object: Log on to the OSS console. Go to the details page of a desired bucket. In the left-side navigation pane of the details page, choose Object Management > Objects.

The path must conform to the object path format of OSS-HDFS.

  • If the path ends with a forward slash (/), the Check node checks whether a folder with the same name exists in OSS-HDFS.

    Example: If the path is user/, the Check node checks whether a folder named user exists.

  • If the path does not end with a forward slash (/), the Check node checks whether an object with the same name exists in OSS-HDFS.

    Example: If the path is user, the Check node checks whether an object named user exists.

Condition For Check Passing

Specifies the check passing condition of the OSS-HDFS object.

  • If the OSS-HDFS object exists, the check is passed, and the system considers that the OSS-HDFS object is available.

  • If the OSS-HDFS object does not exist, the check is not passed, and the system considers that the OSS-HDFS object is unavailable.

Policy For Stopping Check

The policy for stopping a check task on the current Check node. DataWorks allows you to specify the point in time at which a check is stopped or the maximum number of checks. You can also specify the check frequency.

  • Time for Stopping Check: You can specify the check interval and the end time for the check task. If the check is still not passed after the specified time elapses, the check task exits and enters the failed state.

    Note
    • The check interval ranges from 1 to 30, in minutes.

    • If the time when a check task starts to run is later than the specified end time due to a delay of an upstream task, the check task performs the check only once after the upstream task finishes running.

  • Checks Allowed Before Check Node Stops: You can specify the check interval and the maximum number of checks. If the number of times that a check is performed exceeds the specified number and the check is still not passed, the check task exits and enters the failed state.

    Note
    • The check interval ranges from 1 to 30, in minutes.

    • The maximum running duration of a check task is 24 hours (1,440 minutes). The maximum number of times that a check can be performed varies based on the selected check interval. For example, if you set the check interval to 5 minutes, you can perform a check for a maximum of 288 times. If you set the check interval to 10 minutes, you can perform a check for a maximum of 144 times. You can view the parameter settings in the DataWorks console.

Real-time synchronization task

image

The following table describes the parameters.

Parameter

Description

Check Object

Select Real-time Synchronization Task.

Real-time Synchronization Task

The name of the real-time synchronization task that you want to check.

Note
  • Only real-time synchronization tasks that are used to synchronize data from Kafka to MaxCompute are supported.

  • If you have such a real-time synchronization task but cannot select it, check whether the real-time synchronization task is deployed to the production environment. If the real-time synchronization task is not deployed to the production environment, deploy the task first.

Policy For Stopping Check

The policy for stopping a check task on the current Check node. DataWorks allows you to specify the point in time at which a check is stopped or the maximum number of checks. You can also specify the check frequency.

  • Time for Stopping Check: You can specify the check interval and the end time for the check task. If the check is still not passed after the specified time elapses, the check task exits and enters the failed state.

    Note
    • The check interval ranges from 1 to 30, in minutes.

    • If the time when a check task starts to run is later than the specified end time due to a delay of an upstream task, the check task performs the check only once after the upstream task finishes running.

  • Checks Allowed Before Check Node Stops: You can specify the check interval and the maximum number of checks. If the number of times that a check is performed exceeds the specified number and the check is still not passed, the check task exits and enters the failed state.

    Note
    • The check interval ranges from 1 to 30, in minutes.

    • The maximum running duration of a check task is 24 hours (1,440 minutes). The maximum number of times that a check can be performed varies based on the selected check interval. For example, if you set the check interval to 5 minutes, you can perform a check for a maximum of 288 times. If you set the check interval to 10 minutes, you can perform a check for a maximum of 144 times. You can view the parameter settings in the DataWorks console.

Step 3: Configure scheduling properties for the Check node

If you want the system to periodically run a task on the Check node to check partition data, you can click Properties in the right-side navigation pane on the configuration tab of the Check node to configure properties for the node based on your business requirements. For more information, see Overview.

You must configure scheduling properties such as scheduling dependencies and scheduling time for a Check node in the way that you configure scheduling properties for other types of nodes. Each node in DataWorks must be configured with upstream dependencies. If the Check node does not have ancestor nodes, you can select a zero load node or the root node in the current workspace as the ancestor node of the Check node based on the complexity of your business. For more information, see Create and use a zero load node.

Note

You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit a task on the node.

Step 4: Commit and deploy a task on the node

After a node is configured, you must commit and deploy the node. After you commit and deploy the node, the system runs a task on the node on a regular basis based on scheduling configurations.

  1. Click the 保存 icon in the top toolbar to save the task.

  2. Click the 提交 icon in the top toolbar to commit a task on the node.

    In the Submit dialog box, configure the Change description parameter. Then, determine whether to review task code and perform smoke testing after you commit the task based on your business requirements.

    Note
    • You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit a task on the node.

    • You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the task code that is committed can be deployed only after the task code passes the code review. For more information, see Code review.

    • To ensure that a task on the node you created can be run as expected, we recommend that you perform smoke testing before you deploy the task. For more information, see Perform smoke testing.

If the workspace that you use is in standard mode, you must click Deploy in the upper-right corner of the node configuration tab to deploy a task on the node to the production environment for running after you commit the task on the node. For more information, see Deploy tasks.

What to do next

After you commit and deploy a task on the Check node to Operation Center in the production environment, DataWorks runs the task on the Check node on a regular basis based on the scheduling configurations of the node. You can view the check results of the node and perform O&M operations in Operation Center. For more information, see Perform basic O&M operations on auto triggered tasks.