DataWorks allows you to create a data quality monitoring node and add monitoring rules to the node to monitor the data quality of a specific table of a data source. For example, you can use the data quality monitoring node to check whether dirty data exists. You can also configure a custom scheduling policy for the data quality monitoring node to periodically run a data quality monitoring task to check data. This topic describes how to create and use a data quality monitoring node to monitor the data quality of tables.
Background information
To ensure data quality, DataWorks Data Quality detects changes in source data and tracks dirty data that is generated during the extract, transform, load (ETL) process at the earliest opportunity. DataWorks Data Quality automatically blocks the running of tasks that involve dirty data to effectively stop the spread of dirty data to descendant tasks. This way, you can prevent tasks from producing unexpected dirty data that affects the smooth running of tasks and business decision-making. This also helps you reduce the time for troubleshooting issues and prevents the waste of resources caused by rerunning tasks. For more information, see Data Quality overview.
Limits
Supported data source types: MaxCompute, E-MapReduce (EMR), Hologres, Cloudera's Distribution Including Apache Hadoop (CDH) Hive, AnalyticDB for PostgreSQL, AnalyticDB for MySQL, and StarRocks.
Scope of tables that can be monitored:
You can monitor only the tables of a data source that is added to the workspace to which the current data quality monitoring node belongs.
Each data quality monitoring node can monitor the data quality of only one table. However, you can add multiple monitoring rules to a data quality monitoring node. The monitoring scope varies based on the type of a table.
Non-partitioned table: By default, all data in the table is monitored.
Partitioned table: You must specify a partition filter expression to determine the partition whose data you want to monitor.
NoteIf you want to monitor the data quality of multiple tables, create multiple data quality monitoring nodes.
Supported operations:
After you create data quality monitoring rules in DataStudio, you can run, modify, and publish the monitoring rules or perform other management operations on the monitoring rules only in DataStudio. In DataWorks Data Quality, you can view the monitoring rules but cannot run or perform management operations on them.
If you modify the monitoring rules configured in a data quality monitoring node and deploy the node, the original monitoring rules are replaced.
Prerequisites
A workflow is created.
In DataStudio, development operations are performed on different types of data sources based on workflows. Therefore, you must create a workflow before you create a data quality monitoring node. For more information, see Create a workflow.
A data source is added to a specific workspace, and a table whose data quality you want to monitor is created in the data source.
Before you run a data quality monitoring task, you must create a data source table whose data quality you want to monitor. For more information, see Add and manage data sources, Preparations before data development: Associate a data source or a cluster with DataStudio, and Task development.
A resource group is created.
You can run data quality monitoring nodes only by using a serverless resource group (recommended) or an exclusive resource group for scheduling. For more information, see Resource group management.
(Required if you use a RAM user to develop tasks) The RAM user is added to the DataWorks workspace as a member and is assigned the Develop or Workspace Administrator role. The Workspace Administrator role has extensive permissions. We recommend that you assign the Workspace Administrator role to a user only when necessary. For more information about how to add a member and assign roles to the member, see Add workspace members and assign roles to them.
Step 1: Create a data quality monitoring node
Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
In the Scheduled Workflow pane of the DataStudio page, find the desired workflow and choose
.In the Create Node dialog box, configure the Name parameter and click Confirm. Then, you can use the node to develop and configure a task.
Step 2: Configure data quality monitoring rules
1. Select a table that you want to monitor
2. Configure the monitored data range
3. Configure data quality monitoring rules
4. Configure runtime resources
Step 3: Configure a handling policy for the check result
In the Handling Policy section of the configuration tab of the data quality monitoring node, configure a handling policy and a subscription method for the exception that is identified based on the monitoring rule.
Exception categories
Handling policies for exceptions
Subscription method for exceptions
Step 4: Configure scheduling properties for the node
If you want to periodically run a task on the created node, click Properties in the right-side navigation pane of the configuration tab of the node and configure the scheduling properties for the node based on your business requirements. For more information, see Overview.
Before you commit the node, you must configure the Rerun and Parent Nodes parameters on the Properties tab.
Step 5: Debug the node
Perform the following debugging operations to check whether the node is configured as expected:
Optional. Select a resource group and assign custom parameters to variables.
In the top toolbar of the configuration tab of the node, click the icon. In the Parameters dialog box, select a resource group for scheduling that you want to use to debug and run the node code.
If you configure scheduling parameters for the node, you can assign values to the variables for debugging. For information about the value assignment logic of scheduling parameters, see Debugging procedure.
The following figure shows how to configure scheduling parameters.
Save and run the node.
In the top toolbar, click the icon to save the node and then click the icon to run the node.
After node running is complete, you can view the running result in the lower part of the configuration tab of the node. If the node fails to run, troubleshoot the issue based on the reported error.
Optional. Perform smoke testing.
You can perform smoke testing on the node in the development environment when you commit the node or after you commit the node. For more information, see Perform smoke testing.
Step 6: Commit the node
After you configure the node, commit and deploy the node. After you commit and deploy the node, the system periodically runs the related task on a regular basis based on the scheduling configurations.
When you commit and deploy the node, the monitoring rules configured in the node are also committed and deployed.
In the top toolbar, click the icon to save the node.
In the top toolbar, click the icon to commit the node.
In the Submit dialog box, configure the Change description parameter. Then, determine whether to review the node code after you commit the node based on your business requirements.
NoteBefore you commit the node, you must configure the Rerun and Parent Nodes parameters on the Properties tab.
You can use the code review feature to ensure the quality of the node code and prevent errors caused by incorrect configurations that are directly deployed. If you use the code review feature, the committed node can be deployed only after the node code passes review. For more information, see Code review.
If you use a workspace in standard mode, you must deploy the node to the production environment after you commit the node. To deploy a node, click Deploy in the upper-right corner of the configuration tab of the node. For more information, see Deploy nodes.
What to do next
Node O&M: After you commit and deploy the node, the node is periodically run based on the configurations. To view the scheduling status of the node, such as the node running status and the details of triggered monitoring rules, you can click O&M in the upper-right corner of the configuration tab of the node to go to Operation Center. For more information, see View and manage auto triggered nodes.
Data Quality: After a data quality monitoring rule is published, you can go to the Data Quality page to view the details of the rule. However, you cannot perform management operations on the rule, such as modifying or deleting the rule. For more information, see Overview.