Monitoring rules created based on built-in templates of Data Quality in DataWorks - DataWorks

Data Quality provides various built-in rule templates. This topic describes the check logic of Data Quality and the built-in rule templates.

Description of calculation formula

You can calculate the fluctuation by using the following formula: Fluctuation = (Sample value - Baseline)/Baseline.

Sample value
The sample value for the current day. For example, if you want to check the fluctuation in the number of table rows on an SQL node within a day, the sample value is the number of table rows in partitions on that day.
Baseline
The comparison value collected from the previous N days. Examples:
- If you want to check the fluctuation in the number of table rows on an SQL node based on the statistics seven days ago, the baseline is the number of table rows in partitions seven days before the current day. In other words, the fluctuation is calculated by comparing the sample value collected on the current day with that collected seven days before the current day.
- If you want to check the fluctuation in the number of table rows on an SQL node in the last seven days, the baseline is the average number of table rows in the last seven days. In other words, the baseline is calculated by dividing the total number of table rows in the last seven days by seven.

Check logic

Data Quality supports three verification methods: comparison with a fixed value, comparison with thresholds, and comparison with a dynamic threshold.

Verification method	Check logic
Comparison with a fixed value	Return a Boolean value based on the verification expression. The following comparison operators are supported: `>`, `<`, `>=`, `<=` and `!=` If the calculation result is true, the data is considered to be normal. If the calculation result is false, an error alert is reported.
Comparison with thresholds	The comparison of the raising range, drop range, and fluctuation range (absolute value) is supported. The comparison of the fluctuation range (absolute value) is used as an example in this topic. If the absolute value of the fluctuation does not exceed the warning threshold, the data is considered to be normal. If the absolute value of the fluctuation exceeds the warning threshold and does not exceed the error threshold, a warning alert is reported. If the absolute value of the fluctuation exceeds the error threshold, an error alert is reported.
Comparison with a dynamic threshold	You do not need to set thresholds. The system automatically checks the metrics in real time based on algorithm models. If the value of a metric falls outside a reasonable range, an alert is reported.

Built-in monitoring rule templates

You can use a built-in rule template to quickly create a monitoring rule for a single table or multiple tables. For more information, see Configure a monitoring rule for a single table and Configure a monitoring rule for multiple tables based on a template.

Template category	Template	Description
Table Count	Number of rows. fixed value	Data Quality compares the number of table rows collected on the current day with a fixed value.
	Table is not empty	Checks whether the number of table rows is greater than 0.
	Number of rows. 1 day difference	Data Quality compares the number of table rows collected on the current day with that in partitions generated on the previous day to obtain the fluctuation. Then, Data Quality compares the obtained fluctuation with thresholds. Note The baseline is the number of table rows in partitions generated on the previous day.
	Number of table rows. upper cycle difference	Data Quality compares the number of table rows collected on the current day with that in partitions generated in the last cycle to obtain the fluctuation. Then, Data Quality compares the obtained fluctuation with thresholds.
	Number of rows. 1. 7. 30 days. 1st of this month. volatility	Data Quality compares the number of table rows collected on the current day with that on the previous day, seven days ago, 30 days ago, and that on the first day of the current month to obtain the fluctuations. Then, Data Quality compares the obtained fluctuations with thresholds. If a fluctuation exceeds a threshold, Data Quality reports an alert.
	Table row number. 1. 7. 30 day volatility	Data Quality compares the number of table rows collected on the current day with that on the previous day, seven days ago, and 30 days ago to obtain the fluctuations. Note Then, Data Quality compares the obtained fluctuations with thresholds. If a fluctuation exceeds a threshold, Data Quality reports an alert.
	Table row number. 1 day volatility	Data Quality compares the number of table rows collected on the current day with that on the previous day to obtain the fluctuation. Then, Data Quality compares the obtained fluctuation with thresholds. If the fluctuation exceeds a threshold, Data Quality reports an alert.
	Table row number. 30-day volatility	Data Quality compares the number of table rows collected on the current day with that 30 days before the current day to obtain the fluctuation. Then, Data Quality compares the obtained fluctuation with thresholds. If the fluctuation exceeds a threshold, Data Quality reports an alert.
	Number of rows. 7-day volatility	Data Quality compares the number of table rows collected on the current day with that seven days before the current day to obtain the fluctuation. Then, Data Quality compares the obtained fluctuation with thresholds. If the fluctuation exceeds a threshold, Data Quality reports an alert.
	Table Rows	If you set the Comparison Method parameter to Intelligent Dynamic Threshold, you do not need to manually configure the fluctuation thresholds or the expected value. The system determines the thresholds by using intelligent algorithms. If data exceptions are detected, the system triggers alerts or blocks at the earliest opportunity.
	Table row number. 30-day average volatility	Data Quality compares the number of table rows collected on the current day with the average number of table rows in the last 30 days to obtain the fluctuation. The baseline is the average number of table rows in the last 30 days. In other words, the baseline is calculated by dividing the total number of table rows in the last 30 days by 30.
	Table row number. 7-day average volatility	Data Quality compares the number of table rows collected on the current day with the average number of table rows in the last seven days to obtain the fluctuation. The baseline is the average number of table rows in the last seven days. In other words, the baseline is calculated by dividing the total number of table rows in the last seven days by seven.
	Number of table rows. upper cycle volatility	Data Quality compares the number of table rows collected on the current day with that in partitions generated in the last cycle to obtain the fluctuation. Then, Data Quality compares the obtained fluctuation with thresholds.
	Table row count with user defined condition	You can specify the comparison method and the comparison threshold range for the number of table rows based on your business requirements.
Percent on Condition	Row count matched user defined condition	You can specify the comparison method and the comparison threshold range for the matching rate of filter conditions based on your business requirements.
Table Size	Table size. fixed value	Data Quality compares the size of a table in bytes on the current day with a fixed value.
	Table size. upper period difference	Data Quality compares the size of a table in bytes on the current day with that on the previous day to obtain the fluctuation. Then, Data Quality compares the obtained fluctuation with thresholds.
	Table size. upper period difference	Data Quality compares the size of a table in bytes on the current day with that in the last cycle to obtain the fluctuation. Then, Data Quality compares the obtained fluctuation with thresholds.
	Table size. 1 day volatility	Data Quality compares the size of a table on the current day with that on the previous day to obtain the fluctuation. Then, Data Quality compares the obtained fluctuation with thresholds. If the fluctuation exceeds a threshold, Data Quality reports an alert. For example, you can set the warning threshold to 5% and the error threshold to 10%. If the fluctuation is greater than 5% and less than or equal to 10%, a warning alert is reported. If the fluctuation is greater than 10%, an error alert is reported.
	Table size. 30-day volatility (to be determined)	Data Quality compares the size of a table on the current day with that 30 days ago to obtain the fluctuation. Then, Data Quality compares the obtained fluctuation with thresholds. If the fluctuation exceeds a threshold, Data Quality reports an alert.
	Table size. 7-day volatility	Data Quality compares the size of a table on the current day with that seven days ago to obtain the fluctuation. Then, Data Quality compares the obtained fluctuation with thresholds. If the fluctuation exceeds a threshold, Data Quality reports an alert.
	Table Size	If you set the Comparison Method parameter to Intelligent Dynamic Threshold, you do not need to manually configure the fluctuation thresholds or the expected value. The system determines the thresholds by using intelligent algorithms. If data exceptions are detected, the system triggers alerts or blocks at the earliest opportunity.
Null Value Count	Number of null values. fixed value	Data Quality compares the number of null values of a field with a fixed value. Note The `IS NULL` expression is used in the SQL statements to check whether a value of a field is a null value.
Null Value Count	No null value on single field	Checks whether the number of null values of a field is 0.
Null Value Count / Table Count	Number of nulls / total number of rows. fixed value	Data Quality compares the ratio of the number of null values of a field to the total number of rows with a fixed value. Note The fixed value is a decimal.
Duplicated Value Count	Repeated value. fixed value	Data Quality subtracts the number of values of a field after deduplication from the total number of rows to obtain the number of duplicate values of the field. Then, Data Quality compares the number of duplicate values with a fixed value.
Duplicated Value Count	No duplicated value on single field	Checks whether the number of duplicate values of a field is 0.
Distinct Count on Multiple Fields	No duplicated value on multiple fields	Checks whether the number of duplicate values of multiple fields is 0.
Duplicated Value Count / Table Count	Repeated number of values / total number of rows. fixed value	Data Quality compares the ratio of the number of duplicate values of a field to the total number of rows with a fixed value.
Distinct Count	Unique value. fixed value	Data Quality compares the number of unique values of a field after deduplication with a fixed value.
	The number of unique values. 1. 7. 30 volatility	Data Quality compares the number of unique values of a field after deduplication on the current day with that on the previous day, seven days ago, and 30 days ago to obtain the fluctuations. Then, Data Quality compares the obtained fluctuations with thresholds.
	Unique value	If you set the Comparison Method parameter to Intelligent Dynamic Threshold, you do not need to manually configure the fluctuation thresholds or the expected value. The system determines the thresholds by using intelligent algorithms. If data exceptions are detected, the system triggers alerts or blocks at the earliest opportunity.
Distinct Count / Table Count	Unique value/total number of rows. fixed value	Data Quality compares the ratio of the number of unique values of a field to the total number of rows with a fixed value.
Min	Minimum. 1. 7. 30-day volatility	Data Quality compares the minimum value of a field on the current day with the average values calculated on the previous day, seven days ago, and 30 days ago to obtain the fluctuations. Then, Data Quality compares the obtained fluctuation with thresholds. If the fluctuation exceeds a threshold, Data Quality reports an alert.
	Min Value	If you set the Comparison Method parameter to Intelligent Dynamic Threshold, you do not need to manually configure the fluctuation thresholds or the expected value. The system determines the thresholds by using intelligent algorithms. If data exceptions are detected, the system triggers alerts or blocks at the earliest opportunity.
	Minimum value. 1 day volatility	Data Quality compares the minimum value of a field on the current day with that on the previous day to obtain the fluctuation. Then, Data Quality compares the obtained fluctuation with thresholds.
	Minimum period	Data Quality compares the minimum value of a field on the current day with that in the last cycle to obtain the fluctuation. Then, Data Quality compares the obtained fluctuation with thresholds.
	Minimum value with user defined condition	You can specify the comparison method and the comparison threshold range for the minimum value of a field based on your business requirements.
Max	Maximum. 1. 7. 30-day volatility	Data Quality compares the maximum value of a field on the current day with the average values calculated on the previous day, seven days ago, and 30 days ago to obtain the fluctuations. Then, Data Quality compares the obtained fluctuation with thresholds. If the fluctuation exceeds a threshold, Data Quality reports an alert.
	Maximum	If you set the Comparison Method parameter to Intelligent Dynamic Threshold, you do not need to manually configure the fluctuation thresholds or the expected value. The system determines the thresholds by using intelligent algorithms. If data exceptions are detected, the system triggers alerts or blocks at the earliest opportunity.
	Maximum. 1 day volatility	Data Quality compares the maximum value of a field on the current day with that on the previous day to obtain the fluctuation. Then, Data Quality compares the obtained fluctuation with thresholds.
	Maximum period	Data Quality compares the maximum value of a field on the current day with that in the last cycle to obtain the fluctuation. Then, Data Quality compares the obtained fluctuation with thresholds.
	Maximum value with user defined condition	You can specify the comparison method and the comparison threshold range for the maximum value of a field based on your business requirements.
Average	Average. 1. 7. 30-day volatility	Data Quality compares the average value of a field calculated on the current day with that on the previous day, seven days ago, and 30 days ago to obtain the fluctuations. Then, Data Quality compares the obtained fluctuation with thresholds. If the fluctuation exceeds a threshold, Data Quality reports an alert. Note Data Quality compares the average value of a field calculated on the current day with that on the previous day, seven days ago, and 30 days ago.
	Average	If you set the Comparison Method parameter to Intelligent Dynamic Threshold, you do not need to manually configure the fluctuation thresholds or the expected value. The system determines the thresholds by using intelligent algorithms. If data exceptions are detected, the system triggers alerts or blocks at the earliest opportunity.
	Average. 1 day volatility	Data Quality compares the average value of a field on the current day with that on the previous day to obtain the fluctuation. Then, Data Quality compares the obtained fluctuation with thresholds.
	Average value with user defined condition	You can specify the comparison method and the comparison threshold range for the average value of a field based on your business requirements.
Sum	Summary value. 1. 7. 30-day volatility	Data Quality compares the value sum of a field on the current day with the average values calculated on the previous day, seven days ago, and 30 days ago to obtain the fluctuations. Then, Data Quality compares the obtained fluctuation with thresholds. If the fluctuation exceeds a threshold, Data Quality reports an alert.
	Summary value	If you set the Comparison Method parameter to Intelligent Dynamic Threshold, you do not need to manually configure the fluctuation thresholds or the expected value. The system determines the thresholds by using intelligent algorithms. If data exceptions are detected, the system triggers alerts or blocks at the earliest opportunity.
	Summary value. 1 day volatility	Data Quality compares the value sum of a field on the current day with that on the previous day to obtain the fluctuation. Then, Data Quality compares the obtained fluctuation with thresholds.
	Summary value. upper period volatility	Data Quality compares the value sum of a field on the current day with that in the last cycle to obtain the fluctuation. Then, Data Quality compares the obtained fluctuation with thresholds.
	Sum value with user defined condition	You can specify the comparison method and the comparison threshold range for the aggregated value of a field based on your business requirements.
Discrete Values	Discrete value (status value). fixed value	Data Quality compares the number of values in each group of a field with a fixed value.
	Discrete value (number of groups). fixed value	Data Quality compares the number of groups of a field with a fixed value.
	Discrete value (number of groups)	If you set the Comparison Method parameter to Intelligent Dynamic Threshold, you do not need to manually configure the fluctuation thresholds or the expected value. The system determines the thresholds by using intelligent algorithms. If data exceptions are detected, the system triggers alerts or blocks at the earliest opportunity.
	Discrete value (status value)	If you set the Comparison Method parameter to Intelligent Dynamic Threshold, you do not need to manually configure the fluctuation thresholds or the expected value. The system determines the thresholds by using intelligent algorithms. If data exceptions are detected, the system triggers alerts or blocks at the earliest opportunity.
	Discrete value (number of groups). 1 day volatility	Data Quality compares the number of groups of a field on the current day with that on the previous day to obtain the fluctuation. Then, Data Quality compares the obtained fluctuation with thresholds.
	Discrete values (number of groups and status values). 1. 7. 30-day volatility	Data Quality compares the number of groups and the number of values in each group of a field on the current day with those on the previous day, seven days ago, and 30 days ago to obtain the fluctuations. Then, Data Quality compares the obtained fluctuations with thresholds.

Note

You cannot configure table-size-based monitoring rules for E-MapReduce (EMR) tables.

Appendix: Description of the last cycle

In some of the preceding built-in templates, the instance of the last cycle is used as the baseline. For a daily- or hourly-scheduled task, the logic of determining the instance of the last cycle is to first exclude all instances with the current data timestamp and then sort the other instances by data timestamp in reverse chronological order. If multiple instances have the same data timestamp, further sort these instances by running time in reverse chronological order. The first instance in the obtained sequence is the instance of the last cycle and is used as the baseline. The following table describes how to determine the baseline.

Scheduling scenario

Data timestamp

Baseline

FAQ

Daily scheduling

Historical data timestamps:

2024-06-01
2024-06-02
2024-06-03
2024-06-04
2024-06-05

When the instance whose data timestamp is June 6, 2024 starts to be checked based on monitoring rules, the instance whose data timestamp is June 5, 2024 is used as the baseline.

Historical data backfilling scenario:

Background:

The scheduling node is run as expected from June 1, 2024 to June 5, 2024. After the instance whose data timestamp is June 5, 2024 finishes running, a data backfill operation is performed to backfill the data on July 1, 2024 to the scheduling node. In this case, what is the baseline that can be used for comparison when the instance whose data timestamp is June 6, 2024 starts to be checked based on monitoring rules?

Conclusion:

The instance whose data timestamp is June 6, 2024 uses the instance whose data timestamp is July 1, 2024 as the baseline. The instance whose data timestamp is July 1, 2024 is used as the baseline before the instance whose data timestamp is July 2, 2024 finishes running.

Hourly scheduling

Historical data timestamps:

2024-06-01
2024-06-02
2024-06-03

A scheduling node is scheduled by hour and run three times a day.

When an instance whose data timestamp is June 4, 2024 starts to be checked based on monitoring rules, the last instance whose data timestamp is June 3, 2024 is used as the baseline.

Hourly scheduling scenario:

Background:

Three instances are generated for each day from June 1, 2024 to June 3, 2024 and run as expected, and the first instance whose data timestamp is June 4, 2024 is also run as expected. In this case, what is the baseline that can be used for comparison when the second instance whose data timestamp is June 4, 2024 starts to be checked based on monitoring rules?

Conclusion:

The first instance whose data timestamp is June 4, 2024 is excluded. The last instance whose data timestamp is June 3, 2024 is used as the baseline.

Appendix 2: Description of obtaining a sample value from the output data of an hourly-scheduled task on the date that is N days before the current date

When you extract a sample value from the output data of an hourly-scheduled task on the date that is N days before the current date, the instances of the task are sorted by running time (different from scheduling time) on the date that is N days before the current date in reverse chronological order. The output data of the first instance in the obtained sequence is used as the baseline by default and is compared with the output data of an instance generated for the task on the current date to obtain the fluctuation. The following table describes how to obtain the fluctuation.

Scheduling scenario

Data timestamp

Sample value

FAQ

Hourly scheduling

Historical data timestamps:

2024-06-01
2024-06-02
...
2024-06-08

A task is scheduled by hour and run three times a day.

If you want to obtain a seven-day fluctuation, a sample value is extracted from the output data of the last instance whose running time is June 1, 2024 when an instance whose running time is June 8, 2024 starts to be checked based on monitoring rules.

Hourly scheduling scenario:

Background:

Three instances are generated for each day from June 1, 2024 to June 8, 2024. In this case, what is the sample value that can be used for comparison to obtain a seven-day fluctuation when the second instance whose running time is June 8, 2024 starts to be checked based on monitoring rules?

Conclusion:

When the second instance whose running time is June 8, 2024 starts to be checked based on monitoring rules, the output data of the last instance whose running time is June 1, 2024 is used as the sample value for comparison to obtain a seven-day fluctuation.