Configure rules to monitor data quality - DataWorks - Alibaba Cloud Documentation Center

This topic describes how to use Data Quality to monitor the data quality of table data.

Prerequisites

The basic user information in the ApsaraDB RDS for MySQL table ods_user_info_d is synchronized to the MaxCompute table ods_user_info_d by using Data Integration.
The website access logs of users in user_log.txt in Object Storage Service (OSS) are synchronized to the MaxCompute table ods_raw_log_d by using Data Integration.
The collected data is processed into basic user profile data in DataStudio.

Background information

Data Quality is an end-to-end platform that allows you to check the data quality of heterogeneous data sources, configure alert notifications, and manage data sources. Data Quality monitors data in datasets. You can use Data Quality to monitor MaxCompute tables. When offline MaxCompute data changes, Data Quality checks the data and blocks nodes that use the data. This prevents downstream data from being affected by dirty data. In addition, Data Quality allows you to manage the check result history so that you can analyze and evaluate the data quality.

In this example, Data Quality is used to detect changes to source data in the user profile analysis case and dirty data that is generated when the extract, transform, and load (ETL) operations are performed on the source data at the earliest opportunity. The following table describes the monitoring requirements for the analysis and processing procedure of user profile data.

Table name	Detailed requirement
ods_raw_log_d	Configure a rule that monitors whether the number of rows synchronized to the raw log data table is 0 on a daily basis. This rule helps prevent invalid data processing.
ods_user_info_d	Configure a strong rule that monitors whether the number of rows synchronized to the user information table is 0 on a daily basis, and a weak rule that monitors whether the business primary key in the table is unique on a daily basis. These rules help prevent invalid data processing.
dwd_log_info_di	No separate rule is required.
dws_user_info_all_di	No separate rule is required.
ads_user_info_1d	Configure a rule that monitors the fluctuation of the number of rows in the user information table on a daily basis. The rule is used to observe the fluctuation of daily unique visitors (UVs) and helps you learn the application status at the earliest opportunity.

Go to the rule configuration page

Go to the Data Quality page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and Governance > Data Quality. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Quality.
Go to the rule configuration page.
In the left-side navigation pane of the Data Quality page, choose Configure Rules > Configure by Table. On the Configure by Table page, find the desired table based on the following parameters:
- Connection: Set it to MaxCompute.
- Database: Set it to workshop2024_01.
- On the right side of the Configure by Table page, specify filter conditions to find the ods_raw_log_d, ods_user_info_d, and ads_user_info_1d tables respectively.
Find the desired table in the search results and click Configure Monitoring Rule in the Actions column. The Table Quality Details page of the table appears. The following section describes how to configure monitoring rules for each table.

Configure monitoring rules

Configure monitoring rules for the ods_raw_log_d table

The ods_raw_log_d table is used to store website access logs of users synchronized from OSS. You can configure a monitoring rule that monitors whether the number of rows in the table is 0 for the table based on the business properties of the table. Then, you can associate the monitoring rule with a monitor to trigger quality check for the table.

1. Configure a monitor

You can use a monitor to check whether the quality of data in the specified range (partition) of a table meets your expectations.

In this step, you must set the Data Range parameter of the monitor to dt=$[yyyymmdd-1]. When the monitor is run, the monitor searches for the data partitions that match the parameter value and checks whether the quality of the data meets your expectations.

In this case, each time the scheduling node that is used to write data to the ods_raw_log_d table is run, the monitor is triggered and the rules that are associated with the monitor are used to check whether the quality of data in the specified range meets your expectations.

You need to perform the following steps:

On the Monitor tab, click Create Monitor.

Configure the parameters of the monitor.

The following table describes the key parameters.

Parameter	Description
Data Range	dt=$[yyyymmdd-1]
Trigger Method	The trigger method. Set this parameter to Triggered by Node Scheduling in Production Environment and select the `ods_raw_log_d` node that is created during data synchronization.
Monitoring Rule	You do not need to configure this parameter. The monitoring rules are configured in the Configure monitoring rules section.

Note

For more information about how to configure a monitor, see Configure a monitoring rule for a single table.

In this example, the monitoring rule is configured to monitor whether the table data generated by scheduling nodes every day is as expected. The table always generates data whose data timestamp is the previous day. In the Add Partition dialog box, if the value of the Scheduling Time parameter is the current day and the value of the Result parameter is the previous day, the table data is as expected.

2. Configure monitoring rules

The ods_raw_log_d table is used to store website access logs of users synchronized from OSS. The table is used as a source table in a user profile analysis scenario. To prevent invalid data processing and data quality issues, you need to create and configure a strong rule that monitors whether the number of rows in the table is greater than 0. This rule helps you determine whether synchronization tasks wrote data to the related partitions in the table.

If the number of rows in the related partitions of the ods_raw_log_d table is 0, an alert is triggered, the ods_raw_log_d node fails and exits, and the descendant nodes of the ods_raw_log_d node are blocked from running.

You need to perform the following steps:

In the Monitor Perspective section of the Rule Management tab, select a monitor. In this example, the raw_log_number_of_table_rows_not_0 monitor is selected. Then, click Create Rule on the right side of the tab. The Create Rule panel appears.
On the System Template tab of the Create Rule panel, find the Table is not empty rule and click Use. On the right side of the panel, set the Degree of Importance parameter to Strong Rule.
Note
In this example, the rule is defined as a strong rule. This indicates that when the number of rows in the ods_raw_log_d table is found to be 0, an alert is triggered and the descendant nodes are blocked from running.
Click Determine.
Note
For information about other parameters configured for a monitoring rule, see Configure a monitoring rule for a single table.

3. Perform a test run on the monitor

You can perform a test run to verify whether the configurations of the monitoring rules that are associated with the monitor work as expected. To ensure that the configurations of the rules are correct and meet your expectations, perform a test run on the monitor after you create the rules that are associated with the monitor.

Click Test Run. The Test Run dialog box appears.
In the Test Run dialog box, configure the Scheduling Time parameter and click Test Run.
After the test run is complete, click View Details to view the test result.

4. Subscribe to the monitor

Data Quality provides the monitoring and alerting feature. You can subscribe to monitors to receive alert notifications about data quality issues. This way, you can resolve the data quality issues at the earliest opportunity and ensure data security, data stability, and the timeliness of data generation.

After the subscription configuration is complete, choose Quality O&M > Monitor in the left-side navigation pane. Then, click My Subscriptions on the Monitor page to view and modify the subscribed monitors.

Configure monitoring rules for the ods_user_info_d table

The ods_user_info_d table is used to store basic user information synchronized from ApsaraDB RDS for MySQL. You can configure a rule that monitors whether the number of rows in the table is 0 and a rule that monitors whether the primary key values are unique for the table based on the business properties of the table. Then, you can associate the rules with a monitor to trigger quality check for the table.

1. Configure a monitor

You can use a monitor to check whether the quality of data in the specified range (partition) of a table meets your expectations.

In this case, each time the scheduling node that is used to write data to the table ods_user_info_d is run, the monitor is triggered and the rules that are associated with the monitor are used to check whether the quality of data in the specified range meets your expectations.

You need to perform the following steps:

On the Monitor tab, click Create Monitor.

Configure the parameters of the monitor.

The following table describes the key parameters.

Parameter	Description
Data Range	dt=$[yyyymmdd-1]
Trigger Method	The trigger method. Set this parameter to Triggered by Node Scheduling in Production Environment and select the `ods_user_info_d` node that is created during data synchronization.
Monitoring Rule	You do not need to configure this parameter. The monitoring rules are configured in the Configure monitoring rules section.

Note

For more information about how to configure a monitor, see Configure a monitoring rule for a single table.

2. Configure monitoring rules

The ods_user_info_d table is used to store basic user information synchronized from ApsaraDB RDS for MySQL. The table is used as a source table in a user profile analysis scenario. To prevent invalid data processing and data quality issues, you need to create and configure a strong rule that monitors whether the number of rows in the table is greater than 0. This rule helps you determine whether synchronization tasks wrote data to the related partitions in the table.

After the monitoring rules take effect, if the number of rows in the related partitions of the ods_user_info_d table is 0, an alert is triggered, the ods_user_info_d node fails and exits, and the descendant nodes of the ods_user_info_d node are blocked from running.

You need to perform the following steps:

In the Monitor Perspective section of the Rule Management tab, select a monitor. In this example, the user_info_quality_control monitor is selected. Then, click Create Rule on the right side of the tab. The Create Rule panel appears.
On the System Template tab of the Create Rule panel, find the Table is not empty rule and click Use. On the right side of the panel, set the Degree of Importance parameter to Strong Rule.
Note
In this example, the rule is defined as a strong rule. This indicates that when the number of rows in the ods_user_info_d table is found to be 0, an alert is triggered and the descendant nodes are blocked from running.
On the System Template tab of the Create Rule panel, find the Unique value. fixed value rule and click Use. On the right side of the panel, configure the Rule Effective Scope, Monitoring Threshold, and Degree of Importance parameters as shown in the following figure.
Click Determine.
Note
For information about other parameters configured for a monitoring rule, see Configure a monitoring rule for a single table.

3. Other configurations

The operations to perform a test run on the monitor and subscribe to the monitor are the same as the operations that are described in the Configure monitoring rules for the ods_raw_log_d table section.

Configure monitoring rules for the ads_user_info_1d table

The ads_user_info_1d table is the final result table. You can configure a rule that monitors the fluctuation of the number of rows in the table and a rule that monitors whether the primary key values are unique for the final result table based on the business properties of the table. This way, you can observe the fluctuation of daily UVs and learn the online traffic fluctuation at the earliest opportunity. Then, you can associate the rules with a monitor to trigger quality check for the table.

1. Configure a partition filter expression

You can use a monitor to check whether the quality of data in the specified range (partition) of a table meets your expectations.

In this case, each time the scheduling node that is used to write data to the table ads_user_info_1d is run, the monitor is triggered and the rules that are associated with the monitor are used to check whether the quality of data in the specified range meets your expectations.

You need to perform the following steps:

On the Monitor tab, click Create Monitor.

Configure the parameters of the monitor.

The following table describes the key parameters.

Parameter	Description
Data Range	dt=$[yyyymmdd-1]
Trigger Method	The trigger method. Set this parameter to Triggered by Node Scheduling in Production Environment and select the `ads_user_info_1d` node that is created during data synchronization.
Monitoring Rule	You do not need to configure this parameter. The monitoring rules are configured in the Configure monitoring rules section.

Note

For more information about how to configure a monitor, see Configure a monitoring rule for a single table.

2. Configure monitoring rules

The ads_user_info_1d table is used for user profile analysis. To detect the fluctuation of daily UVs, you need to create and configure a rule that monitors the fluctuation of the number of rows in the aggregate data in the table and a rule that monitors whether the primary key values are unique for the table. This helps you observe the fluctuation of daily UVs and learn the online traffic fluctuation at the earliest opportunity.

After the monitoring rules take effect, if repeated primary keys exist in the ads_user_info_1d table, an alert is triggered. If the fluctuation rate of the number of rows in the ads_user_info_1d table within seven days is greater than 10% and less than 50%, a warning alert is triggered. If the fluctuation rate of the number of rows in the ads_user_info_1d table within seven days is greater than or equal to 50%, a critical alert is triggered.

Note

A handling policy is configured in the monitor.

If a rule is defined as a strong rule and the critical threshold is exceeded, the handling policy is block. This indicates that if a data quality issue is detected in the table, the scheduling node in the production environment that is used to write data to the table is identified, and the system sets the running status of the node to Failed. In this case, the descendant nodes of the node cannot be run, which blocks the production link and prevents the spread of dirty data.
If other exceptions are detected, the handling policy is alert. This indicates that if a data quality issue is detected in the table, the system sends alert notifications to the alert recipient by using the notification method configured in the monitor.

Take note of the following items when you configure monitoring rules:

If a rule is defined as a strong rule and the critical threshold is exceeded, critical alerts are reported and descendant nodes are blocked. If other exceptions occur, alerts are reported but descendant nodes are not blocked.
If a rule is defined as a weak rule and the critical threshold is exceeded, critical alerts are reported but descendant nodes are not blocked. If other exceptions occur, alerts are reported but descendant nodes are not blocked.

You need to perform the following steps:

In the Monitor Perspective section of the Rule Management tab, select a monitor. In this example, the ads_user_info_quality_control monitor is selected. Then, click Create Rule on the right side of the tab. The Create Rule panel appears.
On the System Template tab of the Create Rule panel, find the Number of rows. 7-day volatility rule and click Use. On the right side of the panel, configure the Monitoring Scope and Degree of Importance parameters as shown in the following figure.
On the System Template tab of the Create Rule panel, find the Table is not empty rule and click Use. On the right side of the panel, set the Degree of Importance parameter to Strong Rule.
Click Determine.
Note
For information about other parameters configured for a monitoring rule, see Configure a monitoring rule for a single table.

3. Other configurations

What to do next

After the data is processed, you can use DataAnalysis to visualize the data. For more information, see Visualize data on a dashboard.