All Products
Search
Document Center

DataWorks:Data Quality

Last Updated:Dec 05, 2024

DataWorks Data Quality helps you ensure data quality by detecting changes in source data, tracking dirty data generated during the data extract, transform, and load (ETL) process, and automatically blocking the nodes that involve dirty data to stop the spread of dirty data to descendant nodes. This way, you can prevent nodes from producing unexpected dirty data that affects the running of nodes and business decision-making. This also helps minimize the waste of time, money, and resources, and ensure that your business always stays on the right track.

Billing

Data Quality checks data quality by using monitoring rules. The fees generated for Data Quality checks consist of the following two parts:

  • Fees included in your DataWorks bills

    You are charged by DataWorks based on the number of Data Quality checks. For more information, see Billing of Data Quality.

  • Fees not included in your DataWorks bills

    You are charged by the compute engines that are associated with your DataWorks workspace. When monitoring rules are triggered, SQL statements are generated and executed on specific compute engines. In this case, you are charged for the computing resources provided by the compute engines. For more information, see the topic about billing for each type of compute engine. For example, you associate a MaxCompute project that is billed based on the pay-as-you-go billing method with your DataWorks workspace. In this case, you are charged when you execute SQL statements and the fees are included in your MaxCompute bills instead of your DataWorks bills.

Features

Data Quality can check the data quality of common big data storage systems, such as MaxCompute, E-MapReduce (EMR), Hologres, AnalyticDB for PostgreSQL, AnalyticDB for MySQL, and Cloudera's Distribution Including Apache Hadoop (CDH). Data Quality allows you to configure monitoring rules that focus on multiple dimensions of data, such as integrity, accuracy, validity, consistency, uniqueness, and timeliness. You can configure a monitoring rule for a specific table and associate the monitoring rule with a scheduling node that generates the table data. After the node is run, the monitoring rule is triggered to check the data generated by the node and reports data anomalies at the earliest opportunity. You can also configure a monitoring rule as a strong rule or a weak rule to determine whether to terminate the associated node when Data Quality detects anomalies. This way, you can prevent dirty data from spreading downstream and minimize the waste of time and money on data restoration.

The following table describes the pages where you can use the features of Data Quality.

Feature

Description

Dashboard

The Dashboard page displays an overview of data quality in the current workspace, including the main data quality metrics, the trend and distribution of rule-based check instances, the top N tables that have the maximum number of data quality issues, the owners of the issues, and the coverage status of monitoring rules. This helps data quality owners understand the overall data quality condition of the current workspace and handle data quality issues at the earliest opportunity to improve data quality.

Quality Assets

Rules

On this page, you can view all configured monitoring rules.

Rule Template Library

On this page, you can manage a set of custom rule templates and use the rule templates to improve the efficiency of rule configuration.

Configure Rules

Configure by Table

Data Quality allows you to configure a monitoring rule for a single table or for multiple tables based on a rule template.

Configure by Template

Quality O&M

Monitor

On this page, you can view all monitors created in the current workspace.

Running Records

On this page, you can view the check results of monitors. After a monitor is run, you can view the details of the monitor on this page.

Quality Analysis

Quality Reports

Data Quality allows you to create report templates and add metrics for rule configuration and running based on your business requirements. Reports are generated and sent on a regular basis based on the configured statistical period, sending time, and subscription details.

Precautions

  • The following table describes the data source types and the regions in which the data source types are supported.

    Data source type

    Supported region

    E-MapReduce

    China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Germany (Frankfurt), and US (Silicon Valley)

    Hologres

    China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), China (Hong Kong), Japan (Tokyo), Singapore, Malaysia (Kuala Lumpur), Indonesia (Jakarta), Germany (Frankfurt), US (Silicon Valley), and US (Virginia)

    AnalyticDB PostgreSQL

    China (Hangzhou), China (Shanghai), China (Beijing), China (Shenzhen), and Japan (Tokyo)

    AnalyticDB MySQL

    China (Shenzhen), Singapore, and US (Silicon Valley)

    CDH

    China (Shanghai), China (Beijing), China (Zhangjiakou), China (Hong Kong), and Germany (Frankfurt)

  • Before you configure monitoring rules to monitor the data quality of EMR, Hologres, AnalyticDB for PostgreSQL, AnalyticDB for MySQL, and CDH data sources, you must collect the metadata of the data sources. For more information, see Collect metadata from an EMR data source.

  • After you configure monitoring rules for EMR, Hologres, AnalyticDB for PostgreSQL, AnalyticDB for MySQL, and CDH tables, the rules can be triggered only when the following condition is met: The scheduling node that generates the table data executes the monitoring rules by using resource groups that are connected to the EMR, Hologres, AnalyticDB for PostgreSQL, AnalyticDB for MySQL, CDH data sources.

  • You can configure multiple monitoring rules for a table.

Scenarios

To allow Data Quality to check offline data, you must configure a monitoring rule for a table by performing the following steps: Configure a partition filter expression for the table, create and configure a monitoring rule in which the partition filter expression is used for the table, and associate the monitoring rule with a scheduling node that generates the table data. After the node is run, the monitoring rule is triggered to check the data identified by the partition filter expression. Note that dry-run nodes do not trigger monitoring rules to run. To determine whether to terminate the node when Data Quality detects anomalies, you can configure the monitoring rule as a strong rule or a weak rule. This way, you can prevent dirty data from spreading downstream. On the rule configuration page of the table, you can also specify notification methods to receive alert notifications at the earliest opportunity.离线场景

Configure a monitoring rule

  • Create a monitoring rule: You can create a monitoring rule for a specific table. You can also create a monitoring rule for multiple tables based on a built-in template at the same time. For more information, see Configure monitoring rules for a single table and Configure a monitoring rule for multiple tables based on a template.

  • Subscribe to a monitoring rule: After a monitoring rule is created, you can subscribe to the monitoring rule to receive alert notifications of Data Quality checks by using the notification methods that you configure for the monitoring rule. The notification methods include Email, Email and SMS, DingTalk Chatbot, DingTalk Chatbot @ALL, Lark Group Chatbot, Enterprise WeChat Chatbot, and Custom Webhook.

    Note

    The Custom Webhook notification method is supported only in DataWorks Enterprise Edition.

Trigger the monitoring rule

After the code of the scheduling node that generates the table data whose quality you want to check is run in Operation Center, the monitor and the rule that is associated with the monitor are triggered to check the data quality. An SQL statement is generated and executed on the related compute engine. DataWorks determines whether to terminate a node and terminate the running of the node based on the strength of rules configured for the table that is generated by the node and the check result of the rules. This blocks the descendant nodes of the node from running and prevents dirty data from spreading downstream.

View the check result

You can view the check result on the node execution details page in Operation Center or on the Running Records page in Data Quality.

  • View the check result in Operation Center

    1. View the value of the Instance Status parameter. A possible reason why the instance generated for a node fails a quality check is that the node code is successfully executed, but the data generated by the node does not meet expectations. If the instance fails the quality check of a strong monitoring rule, the current instance fails and the descendant instances are blocked.

    2. Click DQC Log in the lower part of the Runtime Log tab to view the data quality check result. For more information, see View auto triggered instances.

  • View the check result on the Running Records page in Data Quality

    On the Running Records page, search for the check details of monitors by table name or node name. For more information, see View the details of a monitor.