Troubleshoot issues related to data quality in batch synchronization - DataWorks

This topic describes some common scenarios in which data quality issues occur. You can refer to the instructions provided in this topic to quickly troubleshoot issues related to data quality in batch synchronization.

Background information

This topic also describes the principle and mechanism of data synchronization to help you understand the data synchronization process and determine the data synchronization effect. You can determine the data synchronization effect based on items such as the volume of data that is synchronized to a destination and the number of data records that are synchronized to a destination.

Principle of data synchronization

A data synchronization node of Data Integration is run as a job. To improve data synchronization efficiency, the system splits the job into multiple tasks when the data synchronization node is run. The system uses a single thread or parallel threads to run the tasks. If the system uses parallel threads to run the tasks, each task reads data from a specific range. All tasks comprise the entire data synchronization process.

Note During batch synchronization of data, readers are used to read data, and writers are used to write data.

A reader reads data from a source and transfers the data to an internal buffer queue. A writer consumes the data from the internal buffer queue and writes the data to a destination.
A reader and a writer collaborate with each other to complete the data synchronization process. Readers and writers follow the data read and write protocols and rules of data sources. Data read and write rules of data sources include constraints that are imposed by data sources on data. Therefore, the data synchronization effect is affected by data read and write protocols and rules of data sources.

Check data consistency in a destination

A writer writes data that is read from a source to a destination. Data Integration provides a writer for each type of destination. A writer writes data to a destination over a Java Database Connectivity (JDBC) connection or by using the SDK of the destination based on the write mode and conflict processing policy configured for the related data synchronization node.

Note The effect of data write to a destination is related to the data content, write mode, and constraints on destination tables.

If the quality of data that is synchronized by your data synchronization node does not meet your business requirements, you can troubleshoot issues in the destination based on the instructions described in the following table. You can evaluate the data quality from the following dimensions: the number of data records and the data content.


Cause	Problem description	Solution
An unsuitable write mode is configured.	A writer performs a replay operation on the destination based on the write mode that is configured for the related data synchronization node. If constraint conflicts occur between source data and schemas of destination tables, specific situations may occur. For example, source data fails to be inserted into destination tables due to dirty data, source data is ignored, or existing data in destination tables is replaced with source data.	Configure a suitable write mode for the data synchronization node, and check whether constraint conflicts occur between source data and schemas of destination tables, and whether the constraints are reasonable. The following common write modes are supported for writing data to relational databases. The write mode varies based on the destination type.
		insert into	This write mode allows you to write data to a destination by using an INSERT INTO statement. If a data constraint conflict, such as a primary key conflict, a unique key constraint, or a foreign key constraint, occurs between the data in the source and the existing data in the destination, the data in the source is considered dirty data and is not written to the destination. You can view the logs that are generated for the data synchronization node to obtain the number of dirty data records generated during data synchronization and the data content.
		replace into	This write mode allows you to write data to a destination by using a REPLACE INTO statement. If a data constraint conflict, such as a primary key conflict, a unique key constraint, or a foreign key constraint, occurs between the data in the source and the existing data in the destination, the system uses the source data to replace the existing data in destination tables. If multiple data constraints exist on destination tables, multiple data records in the destination tables may be replaced. If no data constraint conflicts occur between the data in the source and the existing data in the destination, the system inserts the source data into the destination. This operation is the same as the operation that is performed when the insert into write mode is used.
		insert into on duplicate key update	This write mode allows you to write data to a destination by using the INSERT INTO ON DUPLICATE KEY UPDATE statement. If a data constraint conflict, such as a primary key conflict, a unique key constraint, or a foreign key constraint, occurs between the data in the source and the existing data in the destination, the system uses the source data to update the existing data in destination tables. If multiple data constraints exist on destination tables, data update may fail, and dirty data may be generated. If no data constraint conflicts occur between the data in the source and the existing data in the destination, the system inserts the source data into the destination. This operation is the same as the operation that is performed when the insert into write mode is used.
		insert ignore into	This write mode allows you to write data to a destination by using the INSERT IGNORE INTO statement. If a data constraint conflict, such as a primary key conflict, a unique key constraint, or a foreign key constraint, occurs between the data in the source and the existing data in the destination, the system ignores the source data and does not write the source data to the destination. The status of the data write operation is successful, and no dirty data is generated. If no data constraint conflicts occur between the data in the source and the existing data in the destination, the system inserts the source data into the destination. This operation is the same as the operation that is performed when the insert into write mode is used.
		insert on conflict do nothing	This write mode is a mode supported by the protocol family of PostgreSQL and is similar to the insert ignore into write mode supported by the MySQL protocol. If a data constraint conflict, such as a primary key conflict, a unique key constraint, or a foreign key constraint, occurs between the data in the source and the existing data in the destination, the system ignores the source data and does not write the source data to the destination. The status of the data write operation is successful, and no dirty data is generated. If no data constraint conflicts occur between the data in the source and the existing data in the destination, the system inserts the source data into the destination. This operation is the same as the operation that is performed when the insert into write mode is used.
		insert on conflict do update	This write mode is a mode supported by the protocol family of PostgreSQL and is similar to the insert into on duplicate key update write mode supported by the MySQL protocol. If a data constraint conflict, such as a primary key conflict, a unique key constraint, or a foreign key constraint, occurs between the data in the source and the existing data in the destination, the system uses the source data to update the existing data in destination tables. If multiple data constraints exist on destination tables, data update may fail, and dirty data may be generated. If no data constraint conflicts occur between the data in the source and the existing data in the destination, the system inserts the source data into the destination. This operation is the same as the operation that is performed when the insert into write mode is used.
		copy on conflict do nothing	This write mode is a mode supported by the protocol family of PostgreSQL. This write mode allows you to use a COPY statement to write data to a destination. If a data constraint conflict occurs between the data in the source and the existing data in the destination, the source data is discarded, and no dirty data is generated. If no data constraint conflicts occur between the data in the source and the existing data in the destination, the system inserts the source data into the destination.
		copy on conflict do update	This write mode is a mode supported by the protocol family of PostgreSQL. This write mode allows you to use a COPY statement to write data to a destination. If a data constraint conflict occurs between the data in the source and the existing data in the destination, the system uses the source data to replace the existing data in the destination, and no dirty data is generated. If no data constraint conflicts occur between the data in the source and the existing data in the destination, the system inserts the source data into the destination.
		merge into	This write mode is not supported.
Dirty data is generated during data synchronization.	Data in the source fails to be written to the destination, and dirty data is generated. As a result, the number of data records that are synchronized to the destination is inconsistent with the number of data records in the source.	Determine the reason why dirty data is generated and resolve the dirty data issue. Note If you want to prohibit the generation of dirty data for your data synchronization node, you can change the maximum number of dirty data records that are allowed during data synchronization to 0 when you configure channel control policies for the node in the codeless user interface (UI). For more information about the parameter configuration, see Configure a batch synchronization node by using the codeless UI. For more information about dirty data, see Terms.
A data query is performed when the data synchronization node is run.	Some types of writers, such as Hive Writer and MaxCompute Writer, allow you to query the data that is synchronized to the destination before the related data synchronization node finishes running. For MaxCompute Writer, you can specify whether to enable or disable this feature. If you query the data before the related data synchronization node finishes running, the data in the destination and data in the source are inconsistent.	Query data after the data synchronization node finishes running.
Scheduling dependencies between nodes are not configured.	A data analytics node needs to use the data that is generated by the data synchronization node, but no scheduling dependencies are configured between the two nodes. For example, a data analytics node needs to use the data that is generated by a data synchronization node whose destination is a MaxCompute data source, but no scheduling dependencies are configured between the two nodes. The data analytics node uses the max_pt function to identify the partition that stores the largest volume of data in a destination MaxCompute table and wants to read data from the partition. However, the data synchronization node has not written data to the partition. As a result, the data analytics node cannot obtain the required data.	Do not use methods such as max_pt to establish weak data dependencies between nodes. Instead, configure scheduling dependencies between nodes, which ensures that the descendant node can successfully obtain the required data from the ancestor node.
Multiple data synchronization nodes or multiple instances of the same data synchronization node write data to the same destination table or the same partition.	Multiple data synchronization nodes or multiple instances of the same data synchronization node are run in parallel.	If the conflict occurs because multiple instances of the same data synchronization node are run in parallel, we recommend that you configure the self-dependency for the node. For more information, see Dependency on the instance generated for the current node in the previous cycle. For example, two data synchronization nodes whose destinations are the same MaxCompute data source or the same Hologres data source are run in parallel to synchronize data to the same partition, and an SQL statement used to clear existing data before data synchronization is respectively configured for the nodes. In this case, the data that is written to the partition by the first node may be cleared by the second node before the second node writes data to the partition. If two data synchronization nodes that are used to synchronize data from relational databases are run in parallel to synchronize data to the same table or partition and an SQL statement that needs to be executed before or after data synchronization is configured for one of the nodes, the SQL statement may affect the data that is written to the table or partition by the other node.
The data synchronization results of the data synchronization node are not idempotent.	In theory, if you run a data synchronization node multiple times, the data synchronization results are the same. If the configurations of a data synchronization node cannot be run in idempotent mode and the node is rerun upon successes or failures multiple times, data that is synchronized to the destination may be duplicate or overwritten.	Do not run a data synchronization node multiple times. Note If you do not allow the data synchronization node to rerun in your business scenario but you want to ensure the timeliness of the output of the node, we recommend that you configure monitoring and alerting settings for the node. This way, the specified alert recipient can receive alert notifications at the earliest opportunity after errors occur and troubleshoot the issues in an efficient manner. For information about how to configure monitoring and alerting settings for a node, see Overview.
Scheduling parameters are not correctly configured or are replaced with incorrect values.	For example, if you run a data synchronization node for which scheduling parameters are configured to synchronize data to a partitioned MaxCompute table, you can specify the destination partition by using scheduling parameters such as $bizdate. In this case, the following issues may occur: The scheduling parameters are not replaced with correct values. As a result, data is written to the partition that is named $bizdate instead of the partition that is indicated by the data timestamp such as 20230118. Values assigned to the partition filter expressions that contain scheduling parameters and are configured for the descendant node of the data synchronization node are incorrect. As a result, when the descendant node is run, the node queries data from incorrect partitions.	Check whether scheduling parameters that are configured meet your business requirements, and whether the scheduling parameters are replaced with actual values as expected during node running.
The data type, data range, and time zone of the destination are different from those of the source.	The data type and data range of the destination are different from those of the source. As a result, data is unexpectedly truncated or dirty data is generated during data synchronization. The time zone of the destination is different from the time zone of the source. As a result, the data that is synchronized to the destination is different from the data in the source.	Check whether the data types, data ranges, and time zones of the source and destination are different. Check whether the data type and parameter configurations related to the time zone need to be modified for the destination. If the data type and configurations do not need to be modified, retain the original settings.
Data that is synchronized to the destination is changed.	Other system programs may access and update the data that is synchronized to the destination. As a result, the data in the destination is different from the data in the source.	In most cases, this situation occurs as expected and does not need to be handled.

Check data consistency in a source

A reader connects to a source, reads data from the source, and then transfers the data to a writer. Data Integration provides a reader for each type of source. A reader reads data from a source over a JDBC connection or by using the SDK of the source based on the data read settings that are configured for the related data synchronization node. The data read settings include the data filter condition, tables, partitions, and columns.

Note The effect of data read from a source depends on various factors such as the data synchronization mechanism, data changes in the source, and node configurations.

If the quality of data that is synchronized by your data synchronization node does not meet your business requirements, you can troubleshoot issues in the source based on the instructions described in the following table. You can evaluate the data quality from the following dimensions: the number of data records and the data content.


Cause	Problem description	Solution
Data in the source is continuously changed.	Source data that you want to synchronize may be continuously changed. As a result, the data that is synchronized to the destination may be inconsistent with the latest data in the source. In essence, a data synchronization node reads data from the source by executing data query statements on the source. If you specify the maximum number of parallel threads for your data synchronization node that is used to synchronize data from a relational database, the node runs the threads to read data from the source in parallel by executing multiple query statements at the same time. Due to the transaction consistency characteristic of a relational database, each query statement returns only the data snapshot that is generated when the statement is executed. The query statements do not belong to the context of the same transaction. The data query statements cannot return the data changes that are generated after the statements are executed.	In most cases, this issue does not affect your business. If this issue occurs, you can run your data synchronization node multiple times. The number of data records that are synchronized during each run may differ.
Scheduling parameters are not correctly configured or are replaced with incorrect values.	If you configure a WHERE clause that contains a scheduling parameter as a filter condition for a data synchronization node used to synchronize data from a specific data source, such as a MySQL data source, the scheduling parameter may fail to be correctly replaced when the node is run. For example, you configure `gmt_modify >= ${bizdate}` for the node to extract data of the last two days from the source. However, only data of the last one day is extracted because the scheduling parameter is not correctly replaced when the node is run. If you use a scheduling parameter in the partition filter expression configured for a data synchronization node used to read data from a partitioned table in a specific data source, such as a MaxCompute data source, to specify the partition from which you want to read data, the scheduling parameter may not meet your business requirements or may fail to be correctly replaced when the node is run. For example, if you configure `pt=${bizdate}`, the issue may occur.	Check whether scheduling parameters that are configured meet your business requirements, and whether the scheduling parameters are replaced with actual values as expected during node running.
Dirty data is generated during data synchronization.	Some data in the source fails to be read. The source data that fails to be read is considered dirty data. As a result, the number of data records that are written to the destination is inconsistent with the number of data records in the source.	Determine the reason why dirty data is generated and resolve the dirty data issue. Check whether dirty data can be allowed and ignored. Note Dirty data is rarely generated in sources. In most cases, dirty data is generated in a source when a semi-structured data source, such as an Object Storage Service (OSS), File Transfer Protocol (FTP), or Hadoop Distributed File System (HDFS) data source, is used as the source.

Check environment information


Cause	Solution
The data source, table, or partition from which you query data is incorrect or incomplete.	A DataWorks workspace in standard mode isolates data sources in the development environment from data sources in the production environment. When a data synchronization node is run in the development environment, the data source in the development environment is used. When a data synchronization node is run in the production environment, the data source in the production environment is used. If you want to compare the number of data records and the data content before and after data synchronization, you must make sure that the data source from which you query data is correct. In a business scenario in the production environment, an online data source is often used in the staging environment or test environment. A data synchronization node uses a data source in the production environment instead of the data source in the staging environment or test environment. You must compare data of the data source in the production environment and data of the data source in the staging environment or test environment to check whether differences exist in the data. If the source or destination of your data synchronization node is a semi-structured data source, the node may need to read data from or write data to multiple files. You must check whether the files from which you want to read data or the files to which you want to write data are complete.
The data output has not been generated.	If you want to query data that is periodically generated, such as data generated by a data synchronization node that is periodically scheduled or data generated by a merge node that is used to periodically merge full data and incremental data, you must check whether the related node finishes running and has generated data.

Note When you troubleshoot data quality issues of your data synchronization node, you can run the node multiple times, and compare the data synchronization effect. You can also change the source or destination of the data synchronization node or compare the data synchronization effect in scenarios in which different destinations are used. This can narrow troubleshooting scope and facilitate troubleshooting.