Configure a Check node - - Alibaba Cloud Documentation Center

DataWorks allows you to use a Check node to check the availability of MaxCompute partitioned tables, File Transfer Protocol (FTP) files, Object Storage Service (OSS) objects, Hadoop Distributed File System (HDFS) files, OSS-HDFS objects, and real-time synchronization tasks based on check policies. If the condition that is specified in the check policy for a Check node is met, the task on the Check node is successfully run. If the running of a task depends on an object, you can use a Check node to check the availability of the object and configure the task as a descendant task of the Check node. If the condition that is specified in the check policy for the Check node is met, the task on the Check node is successfully run and then its descendant task is triggered to run. This topic describes the supported check objects and check policies, and also describes how to configure a Check node.

Supported check objects and check policies

Check nodes can be used to perform checks based only on data sources and real-time synchronization tasks. The following information describes check policies that are supported by DataWorks:

Data sources
- MaxCompute partitioned tables
  Note
  You can use a Check node to check the availability of a MaxCompute partitioned table rather than a MaxCompute non-partitioned table.
  The following check policies are provided in Check nodes to help you check the availability of a MaxCompute partitioned table:
  - Policy 1: Check whether a specified partition exists.
    If the partition exists, the system considers that the operation of writing data to the partition is complete and the MaxCompute partitioned table is available.
  - Policy 2: Check whether data in a specified partition is updated within a specified period of time.
    If the data in the partition is not updated within the specified period of time, the system considers that the operation of writing data to the partition is complete and the MaxCompute partitioned table is available.
- FTP files, OSS objects, HDFS files, or OSS-HDFS objects
  If a specified FTP file, OSS object, HDFS file, or OSS-HDFS object exists, the system considers that the file or object is available.
Real-time synchronization tasks
For this type of check object, the point in time when a Check node starts to be scheduled is used as the time for judgment. If a real-time synchronization task has finished synchronizing data that is generated at the point in time and generated earlier than the point in time, the system considers that the real-time synchronization task passes the check and is available.

In addition, you must specify an interval at which a check is triggered and a condition for stopping a check task on the Check node. The condition can be the end time for check or the maximum number of checks. If the real-time synchronization task still fails the check after the specified time elapses or the maximum number of checks is reached, the Check node exits and enters the failed state. For more information about policy configuration, see the Step 2: Configure a check policy for the Check node section in this topic.

Note

The preceding check objects can be periodically checked by using Check nodes. You must configure the scheduling time to run a task on a Check node based on the expected check start time. The Check node remains in the running state after the conditions for node scheduling are met. If the condition in the check policy is met, the task on the Check node is successfully run. If the check is not passed for a long period of time, the task on the Check node fails. For more information about scheduling configuration, see the Step 3: Configure scheduling properties for the Check node section in this topic.

Limits

Limits on resource groups
- You can use only serverless resource groups to run tasks on Check nodes. We recommend that you use serverless resource groups. For information about how to purchase and use a serverless resource group, see Create and use a serverless resource group.
Limits on node features
- A Check node can be used to check only one object. If your task depends on multiple objects, such as multiple MaxCompute partitioned tables, you must create multiple Check nodes to separately check these objects.
- The check interval of a Check node ranges from 1 to 30, in minutes.
Limits on DataWorks editions
You can use Check nodes only in DataWorks Professional Edition or a more advanced edition. If you are using DataWorks Basic Edition or DataWorks Standard Edition, you must upgrade your DataWorks service to DataWorks Professional Edition or a more advanced edition before you use Check nodes. For more information, see the Edition upgrade and downgrade section of the "Billing of DataWorks editions" topic.
Limits on regions
DataWorks allows you to use Check nodes in workspaces that reside only in the following regions: China (Hangzhou), China (Shanghai), China (Beijing), China (Shenzhen), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Germany (Frankfurt), UK (London), US (Silicon Valley), and US (Virginia).

Prerequisites

Before you use a Check node to perform a check based on a data source, you must first prepare the data source that you want to use. The following table describes the details.

Check object type	Preparation	References
MaxCompute partitioned table	A MaxCompute data source is added to DataWorks and is associated with DataStudio. You must add a MaxCompute project to a DataWorks workspace as a MaxCompute data source before you can use the data source to access data in the MaxCompute project. A MaxCompute partitioned table is created.	Add a MaxCompute data source Preparations before data development: Associate a data source or cluster Create and manage MaxCompute tables
FTP file	An FTP data source is added. You must add the FTP service to a DataWorks workspace as an FTP data source before you can use the data source to access data of the FTP service.	FTP data source
OSS object	An OSS data source is added and an AccessKey pair is configured to access the OSS data source. You must add an OSS bucket to a DataWorks workspace as an OSS data source before you can use the data source to access data in the bucket. Note You can access an OSS data source only by using an AccessKey pair in a Check node. You cannot use an OSS data source that is added in RAM role-based authorization mode in a Check node.	Create a bucket OSS data source
HDFS file	An HDFS data source is added. You must add an HDFS file to a DataWorks workspace as an HDFS data source before you can use the data source to access data in the file.	HDFS data source
OSS-HDFS object	An OSS-HDFS data source is added. You must add the OSS-HDFS service to a DataWorks workspace as an OSS-HDFS data source before you can use the data source to access data of the OSS-HDFS service.	OSS-HDFS data source

If you want to use a Check node to perform a check based on a real-time synchronization task, you must make sure that the real-time synchronization task is used to synchronize data from Kafka to MaxCompute. Before you perform a check, you must first create such a real-time synchronization task. For more information, see Configure a real-time synchronization task in DataStudio.

Step 1: Create a Check node

Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and Governance > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Move the pointer over the icon and choose Create Node > General > Check Node.
In the Create Node dialog box, configure the Path and Name parameters as prompted and click Confirm.

Step 2: Configure a check policy for the Check node

You can configure a check policy to perform a check based on a data source or a real-time synchronization task.