All Products
Search
Document Center

DataWorks:Create and use a data quality monitoring node

Last Updated:Nov 14, 2024

DataWorks allows you to create a data quality monitoring node and add monitoring rules to the node to monitor the data quality of a specific table of a data source. For example, you can use the data quality monitoring node to check whether dirty data exists. You can also configure a custom scheduling policy for the data quality monitoring node to periodically run a data quality monitoring task to check data. This topic describes how to create and use a data quality monitoring node to monitor the data quality of tables.

Background information

To ensure data quality, DataWorks Data Quality detects changes in source data and tracks dirty data that is generated during the extract, transform, load (ETL) process at the earliest opportunity. DataWorks Data Quality automatically blocks the running of tasks that involve dirty data to effectively stop the spread of dirty data to descendant tasks. This way, you can prevent tasks from producing unexpected dirty data that affects the smooth running of tasks and business decision-making. This also helps you reduce the time for troubleshooting issues and prevents the waste of resources caused by rerunning tasks. For more information, see Data Quality overview.

Limits

  • Supported data source types: MaxCompute, E-MapReduce (EMR), Hologres, Cloudera's Distribution Including Apache Hadoop (CDH) Hive, AnalyticDB for PostgreSQL, AnalyticDB for MySQL, and StarRocks.

  • Scope of tables that can be monitored:

    • You can monitor only the tables of a data source that is added to the workspace to which the current data quality monitoring node belongs.

    • Each data quality monitoring node can monitor the data quality of only one table. However, you can add multiple monitoring rules to a data quality monitoring node. The monitoring scope varies based on the type of a table.

      • Non-partitioned table: By default, all data in the table is monitored.

      • Partitioned table: You must specify a partition filter expression to determine the partition whose data you want to monitor.

      Note

      If you want to monitor the data quality of multiple tables, create multiple data quality monitoring nodes.

  • Supported operations:

    • After you create data quality monitoring rules in DataStudio, you can run, modify, and publish the monitoring rules or perform other management operations on the monitoring rules only in DataStudio. In DataWorks Data Quality, you can view the monitoring rules but cannot run or perform management operations on them.

    • If you modify the monitoring rules configured in a data quality monitoring node and deploy the node, the original monitoring rules are replaced.

Prerequisites

  • A workflow is created.

    In DataStudio, development operations are performed on different types of data sources based on workflows. Therefore, you must create a workflow before you create a data quality monitoring node. For more information, see Create a workflow.

  • A data source is added to a specific workspace, and a table whose data quality you want to monitor is created in the data source.

    Before you run a data quality monitoring task, you must create a data source table whose data quality you want to monitor. For more information, see Add and manage data sources, Preparations before data development: Associate a data source or a cluster with DataStudio, and Task development.

  • A resource group is created.

    You can run data quality monitoring nodes only by using a serverless resource group (recommended) or an exclusive resource group for scheduling. For more information, see Resource group management.

  • (Required if you use a RAM user to develop tasks) The RAM user is added to the DataWorks workspace as a member and is assigned the Develop or Workspace Administrator role. The Workspace Administrator role has extensive permissions. We recommend that you assign the Workspace Administrator role to a user only when necessary. For more information about how to add a member and assign roles to the member, see Add workspace members and assign roles to them.

Step 1: Create a data quality monitoring node

  1. Go to the DataStudio page.

    Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and Governance > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.

  2. In the Scheduled Workflow pane of the DataStudio page, find the desired workflow and choose Create Node > Data Quality > Data Quality Monitoring.

  3. In the Create Node dialog box, configure the Name parameter and click Confirm. Then, you can use the node to develop and configure a task.

Step 2: Configure data quality monitoring rules

1. Select a table that you want to monitor

In the Monitoring Rules section of the configuration tab of the data quality monitoring node, click Select Table. In the Select Table panel, specify filter conditions to search for and select the table that you want to monitor.image

2. Configure the monitored data range

  • Non-partitioned table: By default, all data in a table is monitored. You can skip this configuration.

  • Partitioned table: You must select the partition whose data you want to monitor. You can use scheduling parameters to specify the partition. You can click Preview to check whether the calculation result of the partition filter expression that you specified meets your expectations.image

3. Configure data quality monitoring rules

You can create a monitoring rule or import an existing monitoring rule. By default, the configured rules are enabled.

  • Create a monitoring rule

    Click Create Rule to create a monitoring rule based on a template or a custom SQL statement.

    Method 1: Create a monitoring rule based on a built-in rule template

    DataWorks provides various built-in rule templates that you can use to create a data quality monitoring rule. The following figure shows the procedure.

    Note

    You can also find the desired template in the built-in rule template list on the left side of the Create Rule panel and click + Use to create a monitoring rule.

    image

    Parameters for configuring a rule based on a built-in rule template

    Parameter

    Description

    Rule Name

    The name of the monitoring rule.

    Template

    The template used to create the monitoring rule.

    Data Quality provides a large number of built-in table-level and field-level rule templates that are ready for use. For more information, see Built-in monitoring rule templates.

    Note

    You can configure field-level monitoring rules of the following types only for numeric fields: average value, sum of values, minimum value, and maximum value.

    Rule Scope

    The application scope of the rule. For a table-level monitoring rule, the application scope is the current table by default. For a field-level monitoring rule, the application scope is a specific field.

    Comparison Method

    The comparison method that is used by the rule to check whether the table data is as expected.

    • Manual Settings: You can configure the comparison method to compare the data output result with the expected result based on your business requirements.

      You can select different comparison methods for different rule templates. You can view the comparison methods that are supported by a rule template in the DataWorks console.

      • Numeric results: You can compare a numeric result with a fixed value, which is the expected value. The following comparison methods are supported: greater than, greater than or equal to, equal to, not equal to, less than, and less than or equal to. You can configure the normal data range (normal threshold) and abnormal data range (critical threshold) based on your business requirements.

      • Fluctuation results: You can compare a fluctuation result with a fluctuation range. The following comparison methods are supported: absolute value, rise, and drop. You can configure the normal data range (normal threshold) based on your business requirements. You can also define data output exceptions (warning threshold) and unexpected data outputs (critical threshold) based on the degree of abnormal deviation.

    • Intelligent Dynamic Threshold: If you select this option, you do not need to manually configure the fluctuation threshold or expected value. The system automatically determines the reasonable threshold based on intelligent algorithms. If abnormal data is detected, an alert is immediately triggered or the related task is immediately blocked. When the Comparison Method parameter is set to Intelligent Dynamic Threshold, you can configure the Degree of importance parameter.

      Note

      Only monitoring rules that you configure based on a custom SQL statement, a custom range, or a dynamic threshold support the intelligent dynamic threshold comparison method.

    Monitoring Threshold

    • If you set the Comparison Method parameter to Manual Settings, you can configure the Normal threshold and Red Threshold parameters.

      • Normal threshold: If the data quality check result meets the specified condition, the data output is as expected.

      • Red Threshold: If the data quality check result meets the specified condition, the data output is not as expected.

    • If the rule that you configure is a rule of the fluctuation type, you must configure the warning threshold.

      • Warning Threshold: If the data quality check result meets the specified condition, the data is abnormal but your business is not affected.

    Retain problem data

    If the monitoring rule is enabled and a data quality check based on the rule fails, the system automatically creates a table to store the problematic data that is identified during the data quality check.

    Important
    • The Retain problem data parameter is available only for MaxCompute tables.

    • The Retain problem data parameter is available only for specific monitoring rules in Data Quality.

    • If you turn off Status for the monitoring rule, problematic data is not stored.

    Status

    Specifies whether to enable the rule in the production environment.

    Important

    If you turn off the switch for the rule, the rule cannot be triggered to perform a test run or triggered by the associated scheduling nodes.

    Degree of importance

    The strength of the rule in your business.

    • Strong rules are important rules. If you set the parameter to Strong rules and the critical threshold is exceeded, the scheduling node that you associate with the monitor is blocked by default.

    • Weak rules are regular rules. If you set the parameter to Weak rules and the critical threshold is exceeded, the scheduling node that you associate with the monitor is not blocked by default.

    Description

    You can add additional descriptions to the rule.

    Method 2: Create a monitoring rule based on a custom rule template

    Before you use this method, you must perform the following steps to create a custom rule template: Go to the Data Quality page. In the left-side navigation pane, choose Quality Assets > Rule Template Library. In the Templates section of the page that appears, click the plus icon to create a custom rule template. Then, you can create a monitoring rule based on the rule template. For more information, see Create and manage custom rule templates.

    The following figure shows how to create a monitoring rule based on a custom rule template.

    Note

    You can also find the desired template in the custom rule template list on the left side of the Create Rule panel and click + Use to create a monitoring rule.

    image

    Parameters for configuring a rule based on a custom rule template

    Only the parameters that are unique to rules based on custom rule templates are described in the following table. For information about other parameters, see the parameters for configuring a rule based on a built-in rule template.

    Parameter

    Description

    FLAG parameter

    The SET statement that you want to execute before the SQL statement in the rule is executed.

    SQL

    The SQL statement that determines the complete check logic. The returned results must be numeric and consist of one row and one column.

    In the custom SQL statement, enclose the partition filter expression in brackets []. Example:

    select count(*) from ${tableName} where ds=$[yyyymmdd];
    Note
    • In this statement, the value of the ${tableName} variable is dynamically replaced with the name of the table for which you are configuring monitoring rules.

    • For information about how to configure a partition filter expression, see the Appendix 2: Built-in partition filter expressions section in this topic.

    • If you have created a monitor for the table, the setting of the table partition that you specify in the Data Range parameter during the monitor configuration no longer takes effect for the table after you configure this parameter. The rule determines the table partition to be checked based on the setting of WHERE in the SQL statement.

    Method 3: Create a monitoring rule based on a custom SQL statement

    This method allows you to configure custom data quality check logic for tables.

    image

    Parameters for configuring a rule based on a custom SQL statement

    Only the parameters that are unique to rules based on a custom SQL statement are described in the following table. For information about other parameters, see the parameters for configuring a rule based on a built-in rule template.

    Parameter

    Description

    FLAG parameter

    The SET statement that you want to execute before the SQL statement in the rule is executed.

    SQL

    The SQL statement that determines the complete check logic. The returned results must be numeric and consist of one row and one column.

    In the custom SQL statement, enclose the partition filter expression in brackets []. Example:

    select count(*) from <table_name> where ds=$[yyyymmdd];
    Note
    • You must replace <table_name> with the name of the table for which you are configuring monitoring rules. The SQL statement determines the table that needs to be monitored.

    • For information about how to configure a partition filter expression, see the Appendix 2: Built-in partition filter expressions section in this topic.

    • If you have created a monitor for the table, the setting of the table partition that you specify in the Data Range parameter during the monitor configuration no longer takes effect for the table after you configure this parameter. The rule determines the table partition to be checked based on the setting of WHERE in the SQL statement.

  • Import an existing monitoring rule

    If you already created monitoring rules for the selected table in Data Quality, you can import the rules to clone the rules. If you did not create monitoring rules for the table, you can create monitoring rules for the table in Data Quality. For more information, see Configure a monitoring rule for a single table.

    Note

    You can import multiple rules at a time and configure monitoring rules for fields in a table.

    Click Import Rule. In the Batch Import panel, you can specify filter conditions, such as the rule ID or name, rule template, and association range, to search for and select the rules that you want to import. The association range specifies the range of data that you want to monitor, which can be the entire table or specific fields in the table.

    image

4. Configure runtime resources

Select the resources required to run the rules. This means that you must select the data source in which you want to run the related data quality monitoring task. By default, the data source to which the monitored table belongs is selected.

Note

If you select another data source, make sure that the data source can access the table that you want to monitor.

Step 3: Configure a handling policy for the check result

In the Handling Policy section of the configuration tab of the data quality monitoring node, configure a handling policy and a subscription method for the exception that is identified based on the monitoring rule.

Exception categories

The following table describes the categories of identified exceptions.

Exception category

Description

Strong rule - Check failed

  • Rule strength: The importance of the rule.

  • Critical Threshold Is Exceeded: The value of the metric for data quality check hits the critical threshold. In most cases, if the monitored data hits the critical threshold, the quality check result does not meet the expectation, which will severely affect subsequent business operations.

  • Warning Threshold Is Exceeded: The value of the metric for data quality check hits the warning threshold. In most cases, if the monitored data hits the warning threshold, exceptions are identified in the data but subsequent business operations are not affected.

  • Check Failed: The monitor fails to run. For example, the monitored partition is not generated or the SQL statement used to monitor data fails to execute.

Strong rule - Critical threshold exceeded

Strong rule - Warning threshold exceeded

Weak rule - Check failed

Weak rule - Critical threshold exceeded

Weak rule - Warning threshold exceeded

Handling policies for exceptions

You can configure a policy to handle the exceptions that are identified based on the monitoring rules.

  • Do not ignore: Stop the current node and set the node status to Failed when a specific exception is identified on the node. For example, you can use this policy to handle the exception that the critical threshold of a strong monitoring rule is exceeded.

    Note
    • If the current node fails to run, the nodes that depend on the current node do not run. This blocks the production link and prevents the spread of dirty data.

    • You can add multiple exception categories for detection.

    • You can use this policy when an exception has a large impact and blocks the running of descendant nodes.

  • Ignore: Ignore the exception and continue to run the descendant nodes.

Subscription method for exceptions

You can specify a method to receive information about exceptions, such as by email. When an exception is identified, DataWorks pushes information about the exception by using the specified method. This way, the related personnel can handle the exception at the earliest opportunity.

Note

DataWorks supports multiple methods to receive information about exceptions. You can view the methods in the DataWorks console. Take note of the following items:

  • If you use the email, email and text message, or phone call method, you can configure only the user to which the current account belongs as the recipient. Make sure that the email address or mobile phone number of the related user is correctly configured. For more information, see Configure and view alert contacts.

  • If you use other methods, specify the webhook URL used to receive the exception information. For information about how to obtain a webhook URL, see Obtain a webhook URL.

Step 4: Configure scheduling properties for the node

If you want to periodically run a task on the created node, click Properties in the right-side navigation pane of the configuration tab of the node and configure the scheduling properties for the node based on your business requirements. For more information, see Overview.

Note

Before you commit the node, you must configure the Rerun and Parent Nodes parameters on the Properties tab.

Step 5: Debug the node

Perform the following debugging operations to check whether the node is configured as expected:

  1. Optional. Select a resource group and assign custom parameters to variables.

    • In the top toolbar of the configuration tab of the node, click the 高级运行 icon. In the Parameters dialog box, select a resource group for scheduling that you want to use to debug and run the node code.

    • If you configure scheduling parameters for the node, you can assign values to the variables for debugging. For information about the value assignment logic of scheduling parameters, see Debugging procedure.

      The following figure shows how to configure scheduling parameters.

      image

  2. Save and run the node.

    In the top toolbar, click the 保存 icon to save the node and then click the 运行 icon to run the node.

    After node running is complete, you can view the running result in the lower part of the configuration tab of the node. If the node fails to run, troubleshoot the issue based on the reported error.

  3. Optional. Perform smoke testing.

    You can perform smoke testing on the node in the development environment when you commit the node or after you commit the node. For more information, see Perform smoke testing.

Step 6: Commit the node

After you configure the node, commit and deploy the node. After you commit and deploy the node, the system periodically runs the related task on a regular basis based on the scheduling configurations.

Note

When you commit and deploy the node, the monitoring rules configured in the node are also committed and deployed.

  1. In the top toolbar, click the 保存 icon to save the node.

  2. In the top toolbar, click the 提交 icon to commit the node.

    In the Submit dialog box, configure the Change description parameter. Then, determine whether to review the node code after you commit the node based on your business requirements.

    Note
    • Before you commit the node, you must configure the Rerun and Parent Nodes parameters on the Properties tab.

    • You can use the code review feature to ensure the quality of the node code and prevent errors caused by incorrect configurations that are directly deployed. If you use the code review feature, the committed node can be deployed only after the node code passes review. For more information, see Code review.

If you use a workspace in standard mode, you must deploy the node to the production environment after you commit the node. To deploy a node, click Deploy in the upper-right corner of the configuration tab of the node. For more information, see Deploy nodes.

What to do next

  • Node O&M: After you commit and deploy the node, the node is periodically run based on the configurations. To view the scheduling status of the node, such as the node running status and the details of triggered monitoring rules, you can click O&M in the upper-right corner of the configuration tab of the node to go to Operation Center. For more information, see View and manage auto triggered nodes.

  • Data Quality: After a data quality monitoring rule is published, you can go to the Data Quality page to view the details of the rule. However, you cannot perform management operations on the rule, such as modifying or deleting the rule. For more information, see Overview.