Speed up or slow down the batch synchronization process

This topic describes the factors that affect the speed of data synchronization and how to adjust the parallelism for synchronization tasks to maximize the synchronization speed. This topic also describes bandwidth throttling settings and slow data synchronization scenarios.

Overview

The speed of data synchronization is affected by many factors such as the synchronization task configurations, database performance, and network. For more information, see Factors that affect the speed of data synchronization.
The slow speed of a data synchronization task may occur at different phases of the synchronization process. This topic describes the possible problems in different phases and the solutions to these problems. For more information, see Scenarios and solutions to slow data synchronization.
If the data synchronization performance of a database is limited, faster data synchronization is not necessarily good for the database and may overstress the database and affect the production. Data Integration provides the Bandwidth Throttling parameter, which can be used to control the data synchronization speed. You can configure this parameter based on your business requirements. For more information, see Limit the synchronization speed.

Factors that affect the speed of data synchronization

The speed of data synchronization is affected by many factors such as the source database, destination database, and synchronization task configurations. The factors that you can adjust and optimize are the source database performance, destination database performance, database loads, and network.

The following table describes the factors that affect the speed of data synchronization.

Factor	Description

Factor	Description
Source	Database performance: the performance of the CPU, memory, SSD, network, and hard disk. Parallelism: A high parallelism results in a heavy database load. A database that has better performance can support more parallel threads. In this case, you can specify a large parallelism value for a synchronization task. Network: the bandwidth (throughput) and speed of the network.
Resource group for scheduling used to run a synchronization task	A resource group for scheduling is used to issue a synchronization task to a resource group for Data Integration for running. The usage of the resource group for scheduling affects the overall synchronization efficiency of the synchronization task. For more information about the task issuing mechanism, see Mechanism for issuing tasks.
Configurations of a synchronization task	Synchronization speed: whether an upper limit is specified for the synchronization speed. Parallelism: the maximum number of parallel threads that are used to read data from the source or write data to the destination. Whether the current task is waiting for resources. Bandwidth throttling: The bandwidth of a single thread is 1,048,576 Byte/s. If your business is sensitive to network speed, timeout occurs. We recommend that you specify a small bandwidth limit. Whether an index is created for query statements.
Destination	Performance: the performance of the CPU, memory, SSD, network, and hard disk. Load: Excessive loads on the destination database affect the write efficiency of the synchronization task. Network: the bandwidth (throughput) and speed of the network.

Scenarios and solutions to slow data synchronization

Note

For more information about logs of batch synchronization tasks, see Analyze run logs generated for a batch synchronization task.

Scenario	Problem description	Possible cause	Solution

Scenario	Problem description	Possible cause	Solution
Wait for scheduling resources	Problem description 1: The logs of a synchronization task show that the task is waiting for gateway resources. Problem description 2: The General tab of the directed acyclic graph (DAG) of an instance generated for an auto triggered task shows that the instance waits for resources for a long period of time.	A resource group for scheduling is used to issue the synchronization task to a compute engine instance for running. Therefore, if the number of synchronization tasks that are run in parallel on the resource group for scheduling reaches the upper limit, the current task must wait until the synchronization tasks finish running and the resources used by the tasks are released.	On the Intelligent Diagnosis page, you can view the information about the tasks that are using the resources of the resource group while the current task is waiting for resources. Note If you use the shared resource group for scheduling, we recommend that you migrate the task to an exclusive resource group for scheduling for running.
Wait for resources in a resource group for Data Integration	The logs of a synchronization task show that the task is in the WAIT state.	The remaining resources in the resource group for Data Integration that you want to use to run the current synchronization task are insufficient to run the task. For example, a resource group for Data Integration supports a maximum of eight parallel threads. Three synchronization tasks are configured to run on the resource group for Data Integration. Three parallel threads are configured for each of the synchronization tasks. If two of the tasks are run in parallel on the resource group, the resource group for Data Integration can support two more parallel threads. In this case, the remaining task has to wait for resources in the resource group due to insufficient resources, and the logs of the task show that the task is in the WAIT state.	You can use one of the following methods to resolve the issue that other synchronization tasks in the resource group for Data Integration are using a large number of resources: Note On the Intelligent Diagnosis page, you can view the resource usage and the information about the tasks that are using the resources of the resource group while the current task is waiting for resources. The maximum number of parallel threads supported by a resource group for Data Integration varies based on the specifications of the resource group. For more information, see Overview of an exclusive resource group for Data Integration. Check whether the tasks that are using the resources in the resource group for Data Integration are blocked or significantly slow down. If the tasks have any issues, you must resolve the issues or suspend the tasks. If the tasks are not blocked, wait until the tasks finish running and the resources used by the tasks are released, and then run the current task. Find the tasks that are using the resources in the resource group for Data Integration and ask the owners of the tasks to reduce the parallel threads for the tasks. Reduce the parallel threads that you specified for the current batch synchronization task. Then, commit and deploy the task again. Scale out the resource group for Data Integration. For more information, see Scale out or in a resource group.
Slow data synchronization speed	The logs of a synchronization task show that the task is in the RUN state but the data synchronization speed is 0. If the synchronization task is running but remains in such a state for an extended period of time, we recommend that you click the link to the right of Detail log url to view the execution details of the task. If the value of the WaitReaderTime parameter is large in the task logs, it takes a long time to wait for data to be returned from the source.	An inappropriate shard key is specified. The SQL statements that are generated based on the shard key to read data from the source database are executed for an extended period of time. The SQL statements that are used to read data from the source database take an extended period of time to execute. For some Readers, you need to configure the parameters such as where and querySql. The parameters slow down the execution of the SQL statements. Scenario: A full table scan slows down the data synchronization because no index is added in the WHERE clause. Database loads are high when the synchronization task is run. Network issues such as the bandwidth (throughput) and network speed issues exist. Note If you synchronize data over the Internet, the data synchronization speed cannot be ensured.	Resolve the issue of the long execution time of SQL statements: Take note of the following items when you configure the SQL statements that you want to execute before or after synchronization: Make sure that an index is added for the fields that you use for data filtering. This prevents the synchronization task from scanning a full table. Prevent or reduce complex processing such as function-based processing. If complex processing is necessary, try to perform operations in the source or destination database before data synchronization. Check whether the volume of data in source tables is large. If it is, use multiple synchronization tasks to synchronize the data. Check the database logs to find the SQL statements that are blocked and contact the database administrator to resolve the issue. Check the loads on the source or destination database at the time when the synchronization task is run.
	The logs of a synchronization task show that the task is in the RUN state but the data synchronization speed is 0. If the synchronization task is running but remains in such a state for an extended period of time, we recommend that you click the link to the right of Detail log url to view the execution details of the task. If the value of the WaitWriterTime parameter is large in the task logs, it takes a long time to write data to the destination.	The SQL statements that you configure to execute before and after data is written to the destination database take an extended period of time to execute. For some Writers, you need to configure the parameters such as preSql and postSql. The parameters slow down the execution of the SQL statements. Database loads are high when the synchronization task is run. Network issues such as the bandwidth (throughput) and network speed issues exist. Note If you synchronize data over the Internet, the data synchronization speed cannot be ensured.
	The logs of a synchronization task show that the task is in the RUN state but the data synchronization speed is slow.	An inappropriate shard key is specified for a synchronization task that is used to synchronize data from a relational database. In this case, the parallelism settings become invalid for the task. The task runs a single thread to synchronize data. The maximum number of parallel threads is set to a small value. A large amount of dirty data is generated during data synchronization, which affects the synchronization speed. Database performance issues exist. Note A database that has better performance can support more parallel threads. In this case, you can specify a large parallelism value for a synchronization task. Network issues such as the bandwidth (throughput) and network speed issues exist. Note If you synchronize data over the Internet, the data synchronization speed cannot be ensured.	Specify an appropriate shard key. For more information about how to specify a shard key for a synchronization task, see View run logs generated for a batch synchronization task. Specify the maximum number of parallel threads for each synchronization task that is run on a resource group for Data Integration based on the total number of parallel threads that are supported by the resource group. You can increase the number of parallel threads for the current synchronization task based on your business requirements. You can configure the parallelism for a synchronization task on the codeless UI. The following example shows how to configure the parallelism in the code editor: Note The maximum number of parallel threads supported by a resource group for Data Integration varies based on the specifications of the resource group. For more information, see Overview of an exclusive resource group for Data Integration. Process dirty data. For more information about the definition of dirty data, see Terms. If synchronization tasks are run in distributed execution mode, make sure that the value obtained after the number of parallel threads for the tasks is divided by the number of ECS instances in a resource group for Data Integration is less than or equal to the maximum number of parallel threads that are supported by a single ECS instance in the resource group for Data Integration. If you want to synchronize data across clouds or regions, we recommend that you synchronize data over an internal network after network connections are established. For information about network connectivity solutions, see Establish a network connection between a resource group and a data source. Check database loads.

Limit the synchronization speed

By default, bandwidth throttling is disabled. In a synchronization task, data is synchronized at the maximum transmission rate given the parallelism that is configured for the task. Excessively fast synchronization may overstress the database and thus affect the production. Therefore, Data Integration allows you to limit the synchronization speed and optimize the configuration based on your business requirements. If bandwidth throttling is enabled, we recommend that you limit the maximum transmission rate to 30 MB/s. The following example shows how to configure an upper limit for the synchronization speed in the code editor, in which the transmission rate is 1 MB/s:

"setting": {
      "speed": {
         "throttle": true // Indicates that bandwidth throttling is enabled. 
        "mbps": 1, // The synchronization speed. 
      }
    }

The valid values of the throttle parameter are true and false.
- If you set the throttle parameter to true, bandwidth throttling is enabled. In this case, you must configure the mbps parameter. If you do not configure the mbps parameter, an error is returned when the synchronization task is run or data is synchronized at an abnormal speed.
- If you set the throttle parameter to false, bandwidth throttling is disabled, and you do not need to configure the mbps parameter.
The bandwidth value is a Data Integration metric and does not represent the actual elastic network interface (ENI) traffic. In most cases, the ENI traffic is one to two times the channel traffic. The actual ENI traffic varies based on the serialization of the data storage system.
A semi-structured file does not have shard keys. If multiple files exist, you can configure the maximum transmission rate for a task to increase the synchronization speed. However, the maximum transmission rate is limited by the number of files.
For example, the maximum transmission rate for n files can be set to n MB/s.
- If you set the maximum transmission rate to (n + 1) MB/s, the files are still synchronized at a speed of n MB/s.
- If you set the maximum transmission rate to (n - 1) MB/s, the files are synchronized at a speed of (n - 1) MB/s.
A table in a relational database can be split based on the maximum transmission rate only after you specify the maximum transmission rate and shard key. In most cases, relational databases support only numeric-type shard keys. However, Oracle databases support numeric-type and string-type shard keys.

FAQ

What do I do if a full scan for a MaxCompute table slows down data synchronization because no index is added in the WHERE clause?
The BatchSize or maxfilesize parameter determines the number of data records to synchronize at a time. We recommend that you specify an appropriate parameter value to reduce the interactions between Data Integration and databases and increase throughput. If you set the BatchSize or maxfilesize parameter to an excessively large value, an out of memory (OOM) error may occur during data synchronization. To resolve the issue, see Batch synchronization.

Appendix: Check the actual parallelism

On the log details page of a batch synchronization task, find the entry in the JobContainer - Job set Channel-Number to X channels format. In this example, the value of X is 2, which indicates that two parallel threads are used by the batch synchronization task. 查看实际并发

Appendix: Relationship between the parallelism and resource usage

This section describes the relationships between the parallelism and CPU utilization and between the parallelism and memory usage of exclusive resource groups.

Relationship between the parallelism and CPU utilization
For exclusive resource groups, the ratio of the parallelism to CPU utilization is 1:0.5. For example, if the exclusive resource group that you purchase uses an ECS instance with the specifications of 4 vCPUs and 8 GiB of memory, the parallelism of the exclusive resource group is 8. You can run a maximum of eight batch synchronization tasks with a parallelism of 1 or four batch synchronization tasks with a parallelism of 2 at a time.
If the task that you want to run on the exclusive resource group requires more threads than the threads available in the exclusive resource group, the task needs to wait until the tasks that are using the resources of the resource group finish running and sufficient threads are available for the task.
Note
If the task that you want to run on the exclusive resource group requires more threads than the maximum number of threads that can be provided by the exclusive resource group, the task enters the state of waiting for resources. For example, if you want to run a task that requires 10 parallel threads on the exclusive resource group that uses an ECS instance with the specifications of 4 vCPUs and 8 GiB of memory, the task will permanently wait for resources. The exclusive resource group allocates resources to tasks based on the sequence in which the tasks are committed. Therefore, DataWorks cannot run tasks that are committed later than this task.
Relationship between the parallelism and memory usage
For exclusive resource groups, the minimum memory size that can be allocated to a synchronization task is calculated by using the following formula: 768 MB + (Parallelism - 1) × 256 MB. The maximum memory size that can be allocated to a synchronization task is 8,029 MB. If you specify a memory size when you configure the synchronization task, the specified memory size overrides the default settings of the exclusive resource group. If you configure the synchronization task in the code editor, you can specify a memory size for the task by configuring the jvmOption parameter in the $.setting.jvmOption path in the JSON configurations.
To ensure that all tasks that are running on an exclusive resource group can be run as expected, we recommend that you set the total memory size of the tasks to at least 1 GB less than the total memory size of all ECS instances that are deployed in the exclusive resource group. If this condition is not met, the Linux OOM killer forcefully stops the tasks that are running.
Note
If you do not configure the required memory size in the code editor, you need to consider only the parallelism limits when you commit synchronization tasks.

Appendix: Synchronization speed

The read and write speeds vary based on data sources. This section describes the average speed for a single thread to read data from or write data to each type of data source.

Average speed for a single thread to write data to each type of data source

Writer	Average write speed for a single thread (KB/s)

Writer	Average write speed for a single thread (KB/s)
AnalyticDB for PostgreSQL	147.8
AnalyticDB for MySQL	181.3
ClickHouse	5259.3
DataHub	45.8
DRDS	93.1
Elasticsearch	74.0
FTP	565.6
GDB	17.1
HBase	2395.0
hbase20xsql	37.8
HDFS	1301.3
Hive	1960.4
HybridDB for MySQL	323.0
HybridDB for PostgreSQL	116.0
Kafka	0.9
LogHub	788.5
MongoDB	51.6
MySQL	54.9
ODPS	660.6
Oracle	66.7
OSS	3718.4
OTS	138.5
PolarDB	45.6
PostgreSQL	168.4
Redis	7846.7
SQLServer	8.3
Stream	116.1
TSDB	2.3
Vertica	272.0

Average speed for a single thread to read data from each type of data source

Reader	Average read speed for a single thread (KB/s)

Reader	Average read speed for a single thread (KB/s)
AnalyticDB for PostgreSQL	220.3
AnalyticDB for MySQL	248.6
DRDS	146.4
Elasticsearch	215.8
FTP	279.4
HBase	1605.6
hbase20xsql	465.3
HDFS	2202.9
Hologres	741.0
HybridDB for MySQL	111.3
HybridDB for PostgreSQL	496.9
Kafka	3117.2
LogHub	1014.1
MongoDB	361.3
MySQL	459.5
ODPS	207.2
Oracle	133.5
OSS	665.3
OTS	229.3
OTSStream	661.7
PolarDB	238.2
PostgreSQL	165.6
RDBMS	845.6
SQLServer	143.7
Stream	85.0
Vertica	454.3

Overview

Factors that affect the speed of data synchronization

Scenarios and solutions to slow data synchronization

Limit the synchronization speed

FAQ

Appendix: Check the actual parallelism

Appendix: Relationship between the parallelism and resource usage

Appendix: Synchronization speed

Sales Support

Technical Support

Connect & Report Abuse

About Alibaba Cloud

Our Global Network

Quick Start

Global Offices

Olympic Games Paris 2024 New

Stade Roland Garros – Glitz from the Past New

Place de la Concorde – “Breaking” the Barriers New

Vaires-sur-Marne Nautical Stadium – Sports with Sustainability New

International Broadcast Center – Images, Sounds, and Data that Captivate Billions New

Customer Success Stories New

Trust Center

Security & Compliance Center

Cloud Compliance Resources

Security Compliance FAQs

Product & Feature Update New

Cloud Forward

Press Room

Alibaba Cloud e-Magazine New

Alibaba Cloud in Analyst Research

Notice

Go Global Service New

Go Global Alliance with Alibaba Cloud

China Gateway Hot

Information Compliance

China Gateway - MLPS 2.0 Compliance New

China Gateway - Networking

China Gateway - Global Application Acceleration New

China Gateway - Security

China Gateway - Data Security New

ICP Support Hot

China Gateway - Omnichannel Data Mid-End New

China Gateway - Organizational Data Mid-End New

China Gateway - Business Mid-End New

China Gateway - AI Service for Conversational Chatbots New

China Gateway - Online Education

China Gateway - Domain Registration

Work at Alibaba Cloud

Experienced Professionals

Students and Graduates

Free Trial

Pricing

Promo Center

Price Reduction

Pay Less and Deploy More

FinOps

Elastic Compute Service (ECS)

Simple Application Server (SAS)

Elastic GPU Service

Elastic Desktop Service (EDS)

Object Storage Service (OSS)

Cloud Enterprise Network (CEN)

Web Application Firewall (WAF)

Domain Names

Container Compute Service (ACS)

Secure Access Service Edge (SASE)

Intelligent Media Services(IMS)

Edge Security Acceleration (ESA)(Original DCDN)

Intelligent Media Management

DingTalk Enterprise

YiDA

Alibaba Cloud Model Studio

Apsara Prime - For Easy Cloud Product Selection

Alibaba Cloud ECS - Cater All Your Cloud Hosting Needs

1TB CDN—Get Free 1 TB Outbound Traffic Plan Now

Security—Under Attack? Get Free Security Support

Short Message Service - Free Testing is Available

Elastic Compute Service (ECS) Hot

CloudBox

Compute Nest

Dedicated Host Hot

ECS Bare Metal Instance

Elastic Desktop Service (EDS) Featured

Elastic GPU Service Featured

Simple Application Server (SAS) Hot

Auto Scaling

Cloud Phone Beta

Elastic Desktop Service (EDS) Featured

Batch Compute

Elastic High Performance Computing (E-HPC)

Super Computing Cluster (SCC)