Channel control settings for batch synchronization

This topic describes the channel control settings that you can configure for batch synchronization. The information provided in this topic can help you understand and correctly configure channel control parameters for batch synchronization and reduce your need for technical support.

Channel control settings

Parallelism

This section provides answers to the following questions:

Question 1: How do I configure parallelism for a data synchronization node?
Question 2: Why does my data synchronization node run at a slow speed? Is this issue caused by insufficient parallel threads?
Question 3: Why does my data synchronization node run at a slow speed even if I configure a large number of parallel threads for the node? Why does my data synchronization node always wait for resources of an exclusive resource group?

Parallelism refers to the maximum number of parallel threads that can be used to read data from a source or write data to a destination when a data synchronization node is run. To improve data synchronization efficiency, you can appropriately adjust the parallelism configured for your data synchronization node. The following figure shows the entry point of configuring parallelism for a data synchronization node in the codeless user interface (UI). When you run a data synchronization node to synchronize data from files stored in an Object Storage Service (OSS), FTP, Hadoop Distributed File System (HDFS), or AWS S3 data source, the data is read based on the granularity of the files. By default, the number of files from which you can read data is less than the maximum number of parallel threads configured for the data synchronization node. The Expected Maximum Concurrency parameter displayed in the preceding figure specifies the maximum number of parallel threads that can be configured for a data synchronization node. Due to the performance of the resource group for Data Integration that you use and the characteristics of a data synchronization node, the actual number of parallel threads that are run for the data synchronization node may be less than or equal to the maximum number of parallel threads that is configured for the data synchronization node. If you use the shared resource group for Data Integration (debugging) to run a data synchronization node and configure parallelism for the data synchronization node, you are charged fees based on the number of parallel threads that are run. Data Integration tries to ensure the consistency between the actual number of parallel threads that are run for a data synchronization node and the maximum number of parallel threads that is configured for the data synchronization node. In the following scenarios, the actual number of parallel threads that are run for a data synchronization node may be less than the maximum number of parallel threads that is configured for the data synchronization node:

A data synchronization node is used to synchronize data from a relational database, such as a MySQL, PolarDB, SQL Server, PostgreSQL, or Oracle database, but no shard key or an invalid shard key is configured for the data synchronization node. In this case, the data synchronization node cannot shard data in the database, and parallel threads cannot be used to read data. You can specify a field of an integer data type as a shard key. Oracle allows you to specify a field of a time data type as a shard key.
Data is synchronized from a PolarDB for Xscale (PolarDB-X) data source. When you run a data synchronization node to synchronize data from PolarDB-X, the node shards data based on the physical topology of logical tables and reads the sharded data. By default, the number of physical table shards is less than the maximum number of parallel threads configured for the data synchronization node.
When you run a data synchronization node to synchronize data from files stored in an Object Storage Service (OSS), FTP, Hadoop Distributed File System (HDFS), or AWS S3 data source, the data is read based on the granularity of the files. By default, the number of files from which you can read data is less than the maximum number of parallel threads configured for the data synchronization node.
If data distribution in a source is extremely uneven, an extended period of time may be required to read data from some shards after data read from the other shards is complete. In the later stages of the running of the related data synchronization node, the actual number of parallel threads that are run for the node may be less than the maximum number of parallel threads that is configured for the node.

Best practices of configuring parallelism for a data synchronization node:

The more parallel threads are run for a data synchronization node, the more resources are preempted by the data synchronization node. Resource allocation of a resource group used for data synchronization conforms to the first in, first out (FIFO) rule. This indicates that the earlier a data synchronization node is committed, the earlier the data synchronization node can preempt resources of a resource group. We recommend that you configure parallelism for your data synchronization node based on your business requirements. This can help prevent a large number of parallel threads from increasing the running duration of the data synchronization node and prevent resource occupation of the data synchronization node from blocking the running of other nodes.
If you want to synchronize only a small amount of data from a source, we recommend that you configure a small number of parallel threads for the related data synchronization node. Small parallelism requires only a small amount of resources. This helps the data synchronization node quickly preempt fragment resources for running and helps control the running duration of the data synchronization node within a proper range.
If you configure multiple data synchronization nodes for the same data source, we recommend that you do not run the data synchronization nodes in parallel. This can balance the resource usage of resource groups and reduce parallel read workloads on the data source.

Data transmission rate

This section provides answers to the following questions:

Question 1: How do I specify a data transmission rate for a data synchronization node? What are the impacts of throttling that is enabled or disabled for a data synchronization node?
Question 2: Why do the settings not take effect in some cases after I enable throttling and specify a maximum transmission rate?
Question 3: Why does a big gap exist between the actual data transmission rate and the specified maximum transmission rate in some cases?

Data transmission rate: The data transmission rate and maximum number of parallel threads are closely related to each other. If you configure these two settings for a data synchronization node at the same time, read workloads on the source and write workloads on the destination can be controlled within a proper range. This helps ensure the stability of the source and destination.

If you do not enable throttling for a data synchronization node, the data synchronization node runs the maximum number of parallel threads that you configured to synchronize data, and the data transmission rate for each slice of the node is not limited. For example, if the actual number of parallel threads that are run for a data synchronization node is ActualConcurrent and the data transmission rate for each slice of the node is Speed, the actual data transmission rate of the data synchronization node is calculated based on the following formula: ActualConcurrent × Speed. If you do not enable throttling for a data synchronization node, data is transmitted at the maximum transmission rate that is supported by the hardware and configurations of the node. The configurations of the node include the maximum number of parallel threads and the memory size that you configure for the data synchronization node. The configurations of hardware include the specifications of the data source and the network configurations.

If you enable throttling for a data synchronization node, Data Integration runs the data synchronization node based on the maximum transmission rate and the maximum number of parallel threads that you configured for the node. When Data Integration develops an execution plan for a data synchronization node, the data transmission rate for each slice of the node is calculated by rounding up the quotient obtained after the maximum transmission rate is divided by the maximum number parallel threads. The minimum data transmission rate for each slice of a data synchronization node is 1 MB/s. The actual maximum transmission rate of a data synchronization node is calculated based on the following formula: Actual number of parallel threads that are run × Actual data transmission rate for each slice of the node. The following descriptions provide examples on throttling:

If the maximum number of parallel threads that is configured for a data synchronization node is 5 and the maximum transmission rate is 5 MB/s, the data synchronization node is split into 5 slices for parallel running, and the maximum transmission rate for each slice of the node is 1 MB/s.
- If the actual number of parallel threads that are run is 5, the actual maximum transmission rate of the data synchronization node is 5 MB/s, which is the same as the maximum transmission rate that is configured.
- If the actual number of parallel threads that can be run is limited by the performance of a data source, the actual number of parallel threads that are run may be less than the maximum number of parallel threads that is configured. If the actual number of parallel threads that are run is 1, the actual maximum transmission rate of the data synchronization node is 1 MB/s, which is less than the maximum transmission rate that is configured.
If the maximum number of parallel threads that is configured for a data synchronization node is 5 and the maximum transmission rate is 3 MB/s, the data synchronization node is split into 5 slices for parallel running, and the maximum transmission rate for each slice of the node is 1 MB/s, which is calculated by rounding up the quotient obtained after 3 is divided by 5.
- If the actual number of parallel threads that are run is 5, the actual maximum transmission rate of the data synchronization node is 5 MB/s, which is greater than the maximum transmission rate that is configured.
- If the actual number of parallel threads that are run is 1, the actual maximum transmission rate of the data synchronization node is 1 MB/s, which is less than the maximum transmission rate that is configured.
If the maximum number of parallel threads that is configured for a data synchronization node is 5 and the maximum transmission rate is 10 MB/s, the data synchronization node is split into 5 slices for parallel running, and the maximum transmission rate for each slice of the node is 2 MB/s, which is the quotient obtained after 10 is divided by 5.
- If the actual number of parallel threads that are run is 5, the actual maximum transmission rate of the data synchronization node is 10 MB/s, which is the same as the maximum transmission rate that is configured.
- If the actual number of parallel threads that are run is 1, the actual maximum transmission rate of the data synchronization node is 2 MB/s, which is less than the maximum transmission rate that is configured.

Distributed execution

This section provides answers to the following questions:

Question 1: In which scenarios do I need to enable the distributed execution mode for a data synchronization node?
Question 2: Why does my data synchronization node run at a slow speed even if the node is run in distributed execution mode?

If you do not enable the distributed execution mode for a data synchronization node, the configured number of parallel threads are used only for a single Elastic Compute Service (ECS) instance to run the node. If you enable the distributed execution mode for the data synchronization node, the system splits the node into slices and distributes them to multiple ECS instances for parallel running. In this case, the more ECS instances, the higher the data synchronization speed. If you have a high requirement for data synchronization performance, you can run your data synchronization node in distributed execution mode. If you run your data synchronization node in distributed execution mode, fragment resources of ECS instances can be utilized. This improves resource utilization.

Limits and best practices:

If you use a large number of parallel threads to run your data synchronization node in distributed execution mode, excessive access requests are sent to the data sources. Therefore, before you use the distributed execution mode, you must evaluate the access loads on the data sources.
If your exclusive resource group contains only one ECS instance, we recommend that you do not run your data synchronization node in distributed execution mode. This is because the execution process is distributed on only one machine and no more ECS resources are available.
If you want to run a data synchronization node to synchronize only a small amount of data, we recommend that you configure a small number of parallel threads and an exclusive resource group that contains a single ECS instance for the node. We recommend that you do not enable the distributed execution mode for the data synchronization node.
The distributed execution mode can be enabled only if the maximum number of parallel threads that you configured is greater than or equal to 8.

Maximum number of dirty data records allowed during data synchronization

This section provides answers to the following questions:

Question 1: What is dirty data?
Question 2: How do I configure the maximum number of dirty data records that are allowed during data synchronization?
Question 3: What are the relationships between the maximum transmission rate and dirty data?

Limits on dirty data are used to control the behavior of a data synchronization node when dirty data is generated during the running of the node. If an exception occurs when a single data record is written to the destination, the data record is considered as dirty data. Due to the complexity and differences in the aspects of data processing of various heterogeneous systems, data records that fail to be written to a destination are considered as dirty data. In some data synchronization scenarios, dirty data reduces data synchronization efficiency. For example, when you run a data synchronization node to synchronize data to a relational database, data records are written to the database in batches by default. However, if dirty data is generated during data synchronization, the write mode is changed. Data records are written to the database one by one to identify the dirty data records and ensure the normal write of the remaining data records. After the write mode is changed, the data synchronization efficiency is reduced. As a result, if a large number of dirty data records are generated during data synchronization, the overall data synchronization efficiency is greatly reduced.

Data Integration allows you to configure settings related to dirty data records for most types of data sources. The following settings are supported:

If you do not configure the maximum number of dirty data records that are allowed during data synchronization, your data synchronization node can continue to run if dirty data records are generated. To apply this setting to the data synchronization node, you can leave the parameter that specifies the maximum number of dirty data records allowed empty when you configure the node.
If you set the maximum number of dirty data records that are allowed during data synchronization to 0, no dirty data records are allowed during data synchronization. If dirty data records are generated during data synchronization, the data synchronization node fails.
If you set the maximum number of dirty data records that are allowed during data synchronization to a positive integer N, a maximum of N dirty data records are allowed during data synchronization. If the number of dirty data records that are generated during data synchronization exceeds N, the data synchronization node fails.

Best practices:

If you want to run a data synchronization node to synchronize data to a data source that has high requirements for data, such as a relational database or a Hologres, ClickHouse, or AnalyticDB for MySQL data source, we recommend that you set the maximum number of dirty data records allowed during data synchronization to 0. This can help you identify data quality risks at the earliest opportunity. Relational databases include MySQL, SQL Server, PostgreSQL, Oracle, PolarDB, and PolarDB-X databases.
If you want to run a data synchronization node to synchronize data to a data source that does not have high requirements for data, you do not need to configure the maximum number of dirty data records allowed during data synchronization. You can also configure the maximum number of dirty data records allowed during data synchronization based on your business requirements. This can reduce O&M costs that are required to handle dirty data records.
You can configure an alert rule whose trigger condition is a node failure or delay for a key node. This can help you identify issues that occur on the node at the earliest opportunity.
If your data synchronization node can be rerun, we recommend that you configure settings to enable the node to be rerun after an error occurs on the node. This helps prevent occasional environment issues from blocking the running of the node.

Quota for the number of connections supported for data sources

This section provides answers to the following questions:

Question 1: What is the quota for the number of connections supported for data sources? How do I appropriately configure the quota?
Question 2: Why are the batch synchronization nodes generated by a data synchronization solution run at a slow speed and stuck in the Submit state?

The quota for the number of connections supported for data sources includes the following items:

Maximum number of parallel threads supported for data write to a destination: specifies the maximum number of parallel threads that can be used to write data to a destination in a real-time synchronization node. You must specify an appropriate number based on the specifications of the resource group that you use and the data write capabilities of the destination. The maximum value is 32, and the default value is 3.
Maximum number of connections supported for data read from a source: When the batch synchronization nodes generated by a data synchronization solution are run to synchronize full data from the source, Java Database Connectivity (JDBC) connections to the source are established to read full historical data. The maximum number specifies the upper limit for the number of JDBC connections that are allowed for the source. This setting prevents the parallel running of a large number of batch synchronization nodes from reaching the maximum number of connections allowed by the connection pool of the source and ensures the stability of the source. You must configure this setting based on the resources of the source. The default value is 15. If the maximum number of connections that you configure for your data synchronization solution does not meet your business requirements, the batch synchronization nodes generated by the data synchronization solution may be stuck in the Submit state. To resolve this issue, you can change the scheduling time of other data synchronization nodes that use the same data source or appropriately increase the maximum number of connections.