This topic describes some common scenarios in which data quality issues occur. You can refer to the instructions provided in this topic to quickly troubleshoot issues related to data quality in batch synchronization.
Background information
This topic also describes the principle and mechanism of data synchronization to help you understand the data synchronization process and determine the data synchronization effect. You can determine the data synchronization effect based on items such as the volume of data that is synchronized to a destination and the number of data records that are synchronized to a destination.Principle of data synchronization
A data synchronization node of Data Integration is run as a job. To improve data synchronization efficiency, the system splits the job into multiple tasks when the data synchronization node is run. The system uses a single thread or parallel threads to run the tasks. If the system uses parallel threads to run the tasks, each task reads data from a specific range. All tasks comprise the entire data synchronization process.
Note During batch synchronization of data, readers are used to read data, and writers are used to write data.
- A reader reads data from a source and transfers the data to an internal buffer queue. A writer consumes the data from the internal buffer queue and writes the data to a destination.
- A reader and a writer collaborate with each other to complete the data synchronization process. Readers and writers follow the data read and write protocols and rules of data sources. Data read and write rules of data sources include constraints that are imposed by data sources on data. Therefore, the data synchronization effect is affected by data read and write protocols and rules of data sources.
Check data consistency in a destination
A writer writes data that is read from a source to a destination. Data Integration provides a writer for each type of destination. A writer writes data to a destination over a Java Database Connectivity (JDBC) connection or by using the SDK of the destination based on the write mode and conflict processing policy configured for the related data synchronization node.
Note The effect of data write to a destination is related to the data content, write mode, and constraints on destination tables.
If the quality of data that is synchronized by your data synchronization node does not meet your business requirements, you can troubleshoot issues in the destination based on the instructions described in the following table. You can evaluate the data quality from the following dimensions: the number of data records and the data content.
Cause | Problem description | Solution | |
---|---|---|---|
An unsuitable write mode is configured. | A writer performs a replay operation on the destination based on the write mode that is configured for the related data synchronization node. If constraint conflicts occur between source data and schemas of destination tables, specific situations may occur. For example, source data fails to be inserted into destination tables due to dirty data, source data is ignored, or existing data in destination tables is replaced with source data. | Configure a suitable write mode for the data synchronization node, and check whether constraint conflicts occur between source data and schemas of destination tables, and whether the constraints are reasonable. The following common write modes are supported for writing data to relational databases. The write mode varies based on the destination type. | |
insert into | This write mode allows you to write data to a destination by using an INSERT INTO statement. If a data constraint conflict, such as a primary key conflict, a unique key constraint, or a foreign key constraint, occurs between the data in the source and the existing data in the destination, the data in the source is considered dirty data and is not written to the destination. You can view the logs that are generated for the data synchronization node to obtain the number of dirty data records generated during data synchronization and the data content. | ||
replace into | This write mode allows you to write data to a destination by using a REPLACE INTO statement.
| ||
insert into on duplicate key update | This write mode allows you to write data to a destination by using the INSERT INTO ON DUPLICATE KEY UPDATE statement.
| ||
insert ignore into | This write mode allows you to write data to a destination by using the INSERT IGNORE INTO statement.
| ||
insert on conflict do nothing | This write mode is a mode supported by the protocol family of PostgreSQL and is similar to the insert ignore into write mode supported by the MySQL protocol.
| ||
insert on conflict do update | This write mode is a mode supported by the protocol family of PostgreSQL and is similar to the insert into on duplicate key update write mode supported by the MySQL protocol.
| ||
copy on conflict do nothing | This write mode is a mode supported by the protocol family of PostgreSQL. This write mode allows you to use a COPY statement to write data to a destination. If a data constraint conflict occurs between the data in the source and the existing data in the destination, the source data is discarded, and no dirty data is generated. If no data constraint conflicts occur between the data in the source and the existing data in the destination, the system inserts the source data into the destination. | ||
copy on conflict do update | This write mode is a mode supported by the protocol family of PostgreSQL. This write mode allows you to use a COPY statement to write data to a destination. If a data constraint conflict occurs between the data in the source and the existing data in the destination, the system uses the source data to replace the existing data in the destination, and no dirty data is generated. If no data constraint conflicts occur between the data in the source and the existing data in the destination, the system inserts the source data into the destination. | ||
merge into | This write mode is not supported. | ||
Dirty data is generated during data synchronization. | Data in the source fails to be written to the destination, and dirty data is generated. As a result, the number of data records that are synchronized to the destination is inconsistent with the number of data records in the source. | Determine the reason why dirty data is generated and resolve the dirty data issue. Note If you want to prohibit the generation of dirty data for your data synchronization node, you can change the maximum number of dirty data records that are allowed during data synchronization to 0 when you configure channel control policies for the node in the codeless user interface (UI). For more information about the parameter configuration, see Configure a batch synchronization node by using the codeless UI. For more information about dirty data, see Terms. | |
A data query is performed when the data synchronization node is run. | Some types of writers, such as Hive Writer and MaxCompute Writer, allow you to query the data that is synchronized to the destination before the related data synchronization node finishes running. For MaxCompute Writer, you can specify whether to enable or disable this feature. If you query the data before the related data synchronization node finishes running, the data in the destination and data in the source are inconsistent. | Query data after the data synchronization node finishes running. | |
Scheduling dependencies between nodes are not configured. | A data analytics node needs to use the data that is generated by the data synchronization node, but no scheduling dependencies are configured between the two nodes. For example, a data analytics node needs to use the data that is generated by a data synchronization node whose destination is a MaxCompute data source, but no scheduling dependencies are configured between the two nodes. The data analytics node uses the max_pt function to identify the partition that stores the largest volume of data in a destination MaxCompute table and wants to read data from the partition. However, the data synchronization node has not written data to the partition. As a result, the data analytics node cannot obtain the required data. | Do not use methods such as max_pt to establish weak data dependencies between nodes. Instead, configure scheduling dependencies between nodes, which ensures that the descendant node can successfully obtain the required data from the ancestor node. | |
Multiple data synchronization nodes or multiple instances of the same data synchronization node write data to the same destination table or the same partition. | Multiple data synchronization nodes or multiple instances of the same data synchronization node are run in parallel. |
| |
The data synchronization results of the data synchronization node are not idempotent. | In theory, if you run a data synchronization node multiple times, the data synchronization results are the same. If the configurations of a data synchronization node cannot be run in idempotent mode and the node is rerun upon successes or failures multiple times, data that is synchronized to the destination may be duplicate or overwritten. | Do not run a data synchronization node multiple times. Note If you do not allow the data synchronization node to rerun in your business scenario but you want to ensure the timeliness of the output of the node, we recommend that you configure monitoring and alerting settings for the node. This way, the specified alert recipient can receive alert notifications at the earliest opportunity after errors occur and troubleshoot the issues in an efficient manner. For information about how to configure monitoring and alerting settings for a node, see Overview. | |
Scheduling parameters are not correctly configured or are replaced with incorrect values. | For example, if you run a data synchronization node for which scheduling parameters are configured to synchronize data to a partitioned MaxCompute table, you can specify the destination partition by using scheduling parameters such as $bizdate. In this case, the following issues may occur:
| Check whether scheduling parameters that are configured meet your business requirements, and whether the scheduling parameters are replaced with actual values as expected during node running. | |
The data type, data range, and time zone of the destination are different from those of the source. |
|
| |
Data that is synchronized to the destination is changed. | Other system programs may access and update the data that is synchronized to the destination. As a result, the data in the destination is different from the data in the source. | In most cases, this situation occurs as expected and does not need to be handled. |
Check data consistency in a source
A reader connects to a source, reads data from the source, and then transfers the data to a writer. Data Integration provides a reader for each type of source. A reader reads data from a source over a JDBC connection or by using the SDK of the source based on the data read settings that are configured for the related data synchronization node. The data read settings include the data filter condition, tables, partitions, and columns.
Note The effect of data read from a source depends on various factors such as the data synchronization mechanism, data changes in the source, and node configurations.
If the quality of data that is synchronized by your data synchronization node does not meet your business requirements, you can troubleshoot issues in the source based on the instructions described in the following table. You can evaluate the data quality from the following dimensions: the number of data records and the data content.
Cause | Problem description | Solution |
---|---|---|
Data in the source is continuously changed. | Source data that you want to synchronize may be continuously changed. As a result, the data that is synchronized to the destination may be inconsistent with the latest data in the source. In essence, a data synchronization node reads data from the source by executing data query statements on the source. If you specify the maximum number of parallel threads for your data synchronization node that is used to synchronize data from a relational database, the node runs the threads to read data from the source in parallel by executing multiple query statements at the same time. Due to the transaction consistency characteristic of a relational database, each query statement returns only the data snapshot that is generated when the statement is executed. The query statements do not belong to the context of the same transaction. The data query statements cannot return the data changes that are generated after the statements are executed. | In most cases, this issue does not affect your business. If this issue occurs, you can run your data synchronization node multiple times. The number of data records that are synchronized during each run may differ. |
Scheduling parameters are not correctly configured or are replaced with incorrect values. |
| Check whether scheduling parameters that are configured meet your business requirements, and whether the scheduling parameters are replaced with actual values as expected during node running. |
Dirty data is generated during data synchronization. | Some data in the source fails to be read. The source data that fails to be read is considered dirty data. As a result, the number of data records that are written to the destination is inconsistent with the number of data records in the source. |
Note Dirty data is rarely generated in sources. In most cases, dirty data is generated in a source when a semi-structured data source, such as an Object Storage Service (OSS), File Transfer Protocol (FTP), or Hadoop Distributed File System (HDFS) data source, is used as the source. |
Check environment information
Cause | Solution |
---|---|
The data source, table, or partition from which you query data is incorrect or incomplete. |
|
The data output has not been generated. | If you want to query data that is periodically generated, such as data generated by a data synchronization node that is periodically scheduled or data generated by a merge node that is used to periodically merge full data and incremental data, you must check whether the related node finishes running and has generated data. |
Note When you troubleshoot data quality issues of your data synchronization node, you can run the node multiple times, and compare the data synchronization effect. You can also change the source or destination of the data synchronization node or compare the data synchronization effect in scenarios in which different destinations are used. This can narrow troubleshooting scope and facilitate troubleshooting.