By Chushi
In the current trend of enterprise digital transformation, there is an increasing need for online statistical analysis of extensive business data, which traditional databases often fail to provide in terms of performance and real-time analysis. With the growth of the ClickHouse community, more developers are looking to speed up analysis by syncing data from MySQL to ClickHouse. After testing the MaterializedMySQL engine released by ClickHouse, I encountered several limitations:
To overcome these restrictions, Data Transmission Service (DTS) team has developed a synchronization link from MySQL to ClickHouse. I will guide you through how to use DTS to smoothly and reliably sync data from MySQL to ClickHouse.
To address the problems of using the MaterializedMySQL engine, Alibaba Cloud DTS has developed a synchronization link between MySQL and ClickHouse. Compared with MaterializedMySQL, DTS for MySQL-to-ClickHouse synchronization has the following advantages:
When using Alibaba Cloud DTS to synchronize data from MySQL to ClickHouse, beginners may encounter challenges during preparation, task creation, and ongoing maintenance. This article aims to provide best practices and address common pitfalls when using Alibaba Cloud DTS for migrating data from MySQL to ClickHouse, ensuring a worry-free data migration experience and efficient data analysis with ClickHouse.
When you are using Alibaba Cloud DTS to synchronize data from MySQL to ClickHouse, you may encounter a series of issues during the preparation, task creation, and O&M. This article compiles best practices and common errors to assist you in using Alibaba Cloud DTS for data migration with ease and to facilitate an efficient data analysis experience with ClickHouse.
When handling ClickHouse link tickets, insufficient permissions of the database account may cause DTS tasks to fail. During the precheck, DTS verifies the permissions of the database account used to configure the task. If the verification fails, the task cannot be initiated. Before configuring a task, you need to grant the following permissions to the source and destination database accounts:
Database | Schema Synchronization | Full Synchronization | Incremental Synchronization |
MySQL | The SELECT permission | The SELECT permission | The SELECT permission on the objects to be synchronized REPLICATION CLIENT, REPLICATION SLAVE, and SHOW VIEW The permissions to create databases and tables, which allow DTS to create a database named dts to record heartbeat data during data synchronization. |
ClickHouse | The read and write permissions. |
DTS MySQL-to-ClickHouse links have certain requirements for dblist. If you use the DTS console to configure tasks, you will get a standardized dblist. If you need to use DTS OpenAPIs to configure tasks for business reasons, you must specify dblist in the following format:
dbList={
"source_db_name": {
"name": "target_db_name",
"all": false,
"state": "open",
"Table": {
"source_table_name": {
"name": "target_table_name",
"all": true,
"primary_key": "id",
"message_key": "id",
"partition_key": "sipHash64(id)",
"part_key": "id",
"type": "partition"
}
},
"num": 1
}
}
Among them,
The schema of ClickHouse has a significant impact on the data writing performance and stability. When you use DTS to migrate data from MySQL to ClickHouse, you must use the schema migration feature provided by DTS to avoid errors caused by schemas that are not supported by DTS.
If you need to create a custom schema for business reasons, it must meet the following requirements of DTS:
Column | Data Type | Default Value | Description |
_sign | Int8 | 1 | Record the type of DML operations: Insert: The value is 1. Delete: The value is -1. Update: The value is 1. |
_version | UInt64 | 1 | The timestamp when data is written to ClickHouse. |
CK Type | Cluster Type | Whether the Source Has a Primary Key | Destination CK Engine |
Community Edition | Single Replica | Yes | MergeTree |
No | ReplacingMergeTree | ||
Multiple Replicas | Yes | ReplicatedMergeTree | |
No | ReplicatedReplacingMergeTree |
When you configure a MySQL-to-ClickHouse migration task, you can specify a column as the partition key to separate data. ClickHouse creates a new file directory for each key value. This is a data management trick that allows users to logically separate data in a table. For example, separate data by day. DROP PARTITION allows you to quickly delete a data subset. However, if you select an inappropriate partition key, the number of parts created by ClickHouse may exceed the limits of the parts_to_throw_insert and max_parts_in_total parameters. In this case, the "Too many inactive parts (N). Parts cleaning are processing significantly slower than inserts" error may occur.
To avoid the preceding issue, the DTS schema migration has the following optimizations:
If you specify a partition key, the specified column is used as the partition key.
If you do not use DTS to migrate the tables you created, you must select an appropriate partition key to avoid the "Too many inactive parts" error during data writing to DTS.
During data writing, the distributed table distributes data to the local table of each host node based on the rules of the shard key. The shard key is required to return a value of integer type, which can be a specific integer column field. If the shard key value is not of integer type, an error is reported when a table is created for schema migration. For example, if you use the String type as the shard key, a message Sharding expression has type String, but should be one of integer type. is reported.
To avoid the preceding issue, the DTS schema migration is optimized as follows:
If you specify the shard key,
ClickHouse does not require a unique primary key. This means that even if multiple rows of data have the same primary key, ClickHouse allows the data to be written. DTS may introduce duplicate data in the following two scenarios:
Based on the preceding principles, we can use the following methods to remove duplicate data:
select * from table_name final where _sign>0;
The time types of ClickHouse (Date, Date32, DateTime, and DateTime64) are based on Unix timestamps. The following are the time ranges of the DateTime, DateTime64, Date32, and Date data types in ApsaraDB for ClickHouse. If the time in ApsaraDB RDS for MySQL is not within these ranges, it will be converted to these ranges. Therefore, the time in ClickHouse may be inconsistent with that in MySQL.
ClickHouse Data Type | Minimum Value | Maximum Value |
DateTime64 | 1925-01-01 08:00:00 | 2283-11-12 07:59:59 |
DateTime | 1970-01-01 08:00:00 | 2106-02-07 14:28:15 |
Date32 | 1925-01-01 | 2283-11-11 |
Date | 1970-01-01 | 2149-06-06 |
Recommended Spec & Free Trial for ApsaraDB for ClickHouse:
https://www.alibabacloud.com/product/clickhouse
ApsaraDB - November 21, 2023
ApsaraDB - October 23, 2023
Alibaba Cloud Indonesia - January 20, 2023
Alibaba EMR - January 10, 2023
Alibaba Cloud Data Intelligence - August 8, 2024
Alibaba Clouder - May 27, 2019
Secure and easy solutions for moving you workloads to the cloud
Learn MoreSupports data migration and data synchronization between data engines, such as relational database, NoSQL and OLAP
Learn MoreTair is a Redis-compatible in-memory database service that provides a variety of data structures and enterprise-level capabilities.
Learn MoreMigrate your legacy Oracle databases to Alibaba Cloud to save on long-term costs and take advantage of improved scalability, reliability, robust security, high performance, and cloud-native features.
Learn MoreMore Posts by ApsaraDB