Due to high stability risks, difficult O&M caused by poor cluster scalability and challenging version updates, and weak disaster recovery capabilities of self-managed ClickHouse clusters, an increasing number of users want to migrate their data from self-managed ClickHouse clusters to a platform as a service (PaaS) service in the cloud. This topic describes how to migrate data from a self-managed ClickHouse cluster to an ApsaraDB for ClickHouse cluster that runs Community-compatible Edition.
Prerequisites
Destination cluster:
The cluster runs Community-compatible Edition.
A database account is created for the cluster. For more information, see Manage database accounts in a Community-compatible Edition cluster.
The account is a privileged account. For more information, see Modify the permissions of a database account in the ApsaraDB for ClickHouse console or by using SQL statements.
Self-managed cluster:
A database account is created for the cluster.
The account has the read permissions on databases and tables and the permissions to run the SYSTEM command.
The destination cluster is connected to the self-managed cluster.
If the self-managed cluster and the destination cluster are in the same virtual private cloud (VPC), you must add the IP addresses of all nodes of the destination cluster and the IPv4 CIDR block of the vSwitch of the destination cluster to the whitelist of the self-managed cluster.
For more information about how to configure a whitelist for an ApsaraDB for ClickHouse cluster, see Configure a whitelist.
For more information about how to configure a whitelist for a self-managed cluster, see the related documentation.
You can execute the
SELECT * FROM system.clusters;
statement to query the IP addresses of all nodes of an ApsaraDB for ClickHouse cluster.To obtain the IPv4 CIDR block of the vSwitch of an ApsaraDB for ClickHouse cluster, perform the following operations:
In the ApsaraDB for ClickHouse console, find the cluster that you want to manage and click the cluster name. In the Network Information section on the Cluster Information page, obtain the vSwitch ID.
Go to the vSwitch page, search for the vSwitch by using the vSwtich ID and obtain the IPv4 CIDR block of the vSwitch.
If the self-managed cluster and the destination cluster are in different VPCs or the self-managed cluster is in a data center or from a third-party service provider, you must establish a connection between the clusters. For more information, see What do I do if a connection fails to be established between the destination cluster and the data source?
Verify business compatibility
Before migration, we recommend that you verify the business compatibility and system performance in a test environment. After you complete the verification, you can migrate data in the production environment. The verification step is critical because it helps you identify and resolve potential issues in advance, ensure a smooth migration, and avoid adverse impacts on your production environment.
For more information about the migration procedure, see the following sections.
For more information about the business compatibility and performance bottleneck analysis, see Compatibility and performance bottleneck analysis and solutions for self-managed ClickHouse migration to ApsaraDB for ClickHouse.
Select a method
Migration method | Advantage | Disadvantage | Scenario |
Allows you to migrate data in a visualized manner and frees you from manually migrating metadata. | Migrates only full or incremental data of a cluster. You cannot migrate specific databases, tables, or historical data. | An entire cluster needs to be migrated. | |
Allows you to specify the databases and tables to be migrated. | Requires complex operations and manual migration of metadata. |
|
Procedure
In the ApsaraDB for ClickHouse console
Limits
The version of the destination cluster must be 21.8 or later.
Usage notes
During migration:
The merging of databases and tables in the destination cluster is suspended during the migration process, but continues in the self-managed cluster.
NoteIf the migration takes too long, excessive metadata accumulates in the destination cluster. We recommend that you set the duration of a migration task to no more than five days.
The destination cluster must be the default cluster. If the name of the self-managed cluster is not default, the system automatically changes the cluster definition to default in the distributed tables.
Migration content:
You can migrate the following objects from the self-managed cluster: databases, tables, data dictionaries, and materialized views.
You cannot migrate Kafka or RabbitMQ tables.
ImportantTo ensure that Kafka and RabbitMQ data is not sharded, you must delete the Kafka and RabbitMQ tables in the self-managed cluster and then create corresponding tables in the destination cluster or use different consumer groups.
You can migrate only the schemas of non-MergeTree tables such as external tables and log tables.
NoteIf the self-managed cluster contains non-MergeTree tables, the non-MergeTree tables in the destination cluster have only a table schema and no specific business data after migration. You can use the
remote
function to migrate specific business data. For more information, see Use the remote function to migrate data.Cold data issues: The migration speed of cold data is relatively slow. We recommend that you clean up the cold data in the self-managed cluster to ensure that the total amount of cold data does not exceed 1 TB. Otherwise, if the migration takes too long, the migration may fail.
Impact on clusters
Self-managed cluster:
When you read data from the self-managed cluster, the CPU utilization and memory usage of the cluster increase.
DDL operations are not allowed.
Destination cluster:
When you write data to the destination cluster, the CPU utilization and memory usage of the cluster increase.
DDL operations are not allowed.
DDL operations are not allowed on the databases and tables that you want to migrate. This limit is not imposed on the databases and tables that do not need to be migrated.
The merging of databases and tables is suspended during the migration process. This limit is not imposed on the databases and tables that do not need to be migrated.
The destination cluster is restarted before the migration begins and after the migration is complete.
Procedure
Step 1: Enable the system tables for the self-managed cluster
Before migration, you must modify the config.xml file of the self-managed cluster to enable system tables system.part_log and system.query_log to allow for incremental migration. The following code provides an example on how to modify the file:
system.part_log
<part_log>
<database>system</database>
<table>part_log</table>
<partition_by>event_date</partition_by>
<ttl>event_date + INTERVAL 15 DAY DELETE</ttl>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
</part_log>
system.query_log
<query_log>
<database>system</database>
<table>part_log</table>
<partition_by>event_date</partition_by>
<ttl>event_date + INTERVAL 15 DAY DELETE</ttl>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
</query_log>
Step 2: Configure the destination cluster to be compatible with the self-managed cluster version
Make sure that the destination cluster is compatible with the self-managed cluster. This minimizes the need for business modifications after the migration.
Query the version numbers of the destination cluster and the self-managed cluster and check whether the version numbers are the same.
Log on to the clusters separately and execute the following statement to query the version numbers of the clusters. For more information, see Database connectivity.
SELECT version();
If the version numbers are different, log on to the destination cluster and set the compatibility parameter to the version number of the self-managed cluster. Make sure that the destination cluster is feature-compatible with the self-managed cluster. Sample code:
SET GLOBAL compatibility = '22.8';
(Optional) Step 3: Enable the MaterializedMySQL
engine for the destination cluster
If the self-managed cluster contains tables that use the MaterializedMySQL engine, execute the following statement to enable the MaterializedMySQL engine for the destination cluster:
SET GLOBAL allow_experimental_database_materialized_mysql = 1;
The ClickHouse community no longer maintains the MaterializedMySQL engine. We recommend that you use Data Transmission Service (DTS) to synchronize MySQL data after the migration.
As the MaterializedMySQL engine is no longer maintained by the community, you can use DTS to synchronize MySQL data to the ApsaraDB for ClickHouse cluster. During the synchronization, DTS replaces the MaterializedMySQL tables with the ReplacingMergeTree tables. For more information, see MaterializedMySQL compatibility.
For more information about how to use DTS to migrate data from a MySQL database to an ApsaraDB for ClickHouse cluster, see the following topics:
Step 4: Create a migration task
Log on to the ApsaraDB for ClickHouse console.
On the Clusters page, click the Clusters of Community-compatible Edition tab and click the ID of the cluster that you want to manage.
In the left-side navigation pane, choose
.In the upper-left corner of the page that appears, click Create Migration Task.
Configure the source and destination clusters.
Configure the required parameters in the Source Instance Information and the Destination Instance Information sections and click Test Connectivity and Proceed.
NoteIn this example, set the Source Access Method parameter to Express Connect, VPN Gateway, Smart Access Gateway, or Self-managed ClickHouse on an ECS Instance and configure other parameters as prompted.
After the connection test succeeds, proceed to the Migration Content step. If the connection test fails, configure the source and destination clusters again as prompted.
Confirm the migration content.
On the page that appears, read the information about the data migration content and click Next: Pre-detect and Start Synchronization.
The system performs prechecks on the migration configuration and then starts the migration task in the backend after the prechecks pass.
The system performs the following prechecks on the source and destination clusters: Instance Status Detection, Storage Space Detection, and Local Table and Distributed Table Detection.
If the prechecks are successful, perform the following operations:
If the prechecks are successful, the information shown in the following figure is displayed.
Read the information about the impacts of data migration on clusters.
Click Completed.
ImportantAfter you complete the configurations, the task is created and enters the Running state. You can view the task in the task list.
After the migration task is created, you must monitor the migration task. When the migration is nearing completion, you must stop writing data to the self-managed cluster and migrate the remaining database and table schemas. For more information, see Monitor a migration task and stop writing data to a self-managed cluster.
If the prechecks fail, you need to follow the on-screen instructions to resolve the issue and then configure the migration task parameters again. The following table describes the precheck items and requirements. For more information about errors and solutions, see Query and troubleshot the errors that are reported in the migration check.
Item
Requirement
Instance Status Detection
Before you migrate data, make sure that no management operations, such as scale-out, upgrade, or downgrade operations, are being performed on the self-managed and destination clusters. If management operations are being performed on the self-managed and destination clusters, the system cannot start a migration task.
Storage Space Detection
Before a migration task is started, the system checks the storage space of the self-managed and destination clusters. Make sure that the storage capacity of the destination cluster is at least 1.2 times the storage capacity of the self-managed cluster.
Local Table and Distributed Table Detection
If no distributed table is created for a local table or multiple distributed tables are created for the same local table of the self-managed cluster, the prechecks fail. You must delete redundant distributed tables or create a unique distributed table.
Step 5: Monitor the migration task and stop writing data to the self-managed cluster.
Log on to the ApsaraDB for ClickHouse console.
On the Clusters page, click the Clusters of Community-compatible Edition tab and click the ID of the cluster that you want to manage.
In the left-side navigation pane, choose
.On the page that appears, monitor the value in the Running Information column.
ImportantStop writing data to the self-managed cluster based on the estimated remaining time displayed in the Running Information column. Then, perform the next step to complete the migration task.
Step 6: Complete the migration task
In this step, you must stop writing data to the self-managed cluster, migrate the remaining data, perform a data volume check, and migrate the remaining database and table schemas.
If the migration task fails to pass the data volume check, the migration task remains in the data volume check phase. In this case, we recommend that you cancel the migration and recreate the migration task. For more information about how to cancel a migration task, see Step 7: Perform other operations.
If the migration takes too long, excessive metadata accumulates in the destination cluster. We recommend that you complete this operation within five days after the migration task is created.
Log on to the ApsaraDB for ClickHouse console.
On the Clusters page, click the Clusters of Community-compatible Edition tab and click the ID of the cluster that you want to manage.
In the left-side navigation pane, choose
.Find the migration task that you want to complete and click Complete Migration in the Actions column.
In the Complete Migration message, click OK.
Step 7: Perform other operations
After the migration task is complete, the Migration Status of the task changes to Completed. However, the task list is not immediately updated. We recommend that you refresh the task list at intervals to view the task status.
Operation | Description | Impact | Scenario |
Cancel a migration task | Forcibly cancel the task, skip the data volume check, and do not migrate the remaining database or table schemas. |
| The migration task affects the self-managed cluster. You want to stop the migration task and write data to the cluster at the earliest opportunity. |
Stop a migration task | Immediately stop migrating data, skip the data volume check, and migrate the remaining database and table schemas. | The destination cluster is restarted. | You want to migrate a certain amount of data and run a test. You do not want to stop writing data to the self-managed cluster. |
Stop a migration task
Log on to the ApsaraDB for ClickHouse console.
On the Clusters page, click the Clusters of Community-compatible Edition tab and click the ID of the cluster that you want to manage.
In the left-side navigation pane, choose
.Find the migration task that you want to stop and click Stop Migration in the Actions column.
In the Stop Migration message, click OK.
Cancel a migration task
Log on to the ApsaraDB for ClickHouse console.
On the Clusters page, click the Clusters of Community-compatible Edition tab and click the ID of the cluster that you want to manage.
In the left-side navigation pane, choose
.Find the migration task that you want to cancel and click Cancel Migration in the Actions column.
In the Cancel Migration message, click OK.
Manual migration
Step 1: Migrate metadata (DDL statements for table creation)
The migration of ClickHouse metadata primarily involves migrating the DDL statements that are used to create tables.
If you need to install clickhouse-client, make sure that the version of clickhouse-client is consistent with the version of the destination ApsaraDB for ClickHouse cluster. For more information about how to download clickhouse-client, see clickhouse-client.
Query the databases in the self-managed cluster.
clickhouse-client --host="<old host>" --port="<old port>" --user="<old user name>" --password="<old password>" --query="SHOW databases" > database.list
Parameters
Parameter
Description
old host
The endpoint of the self-managed cluster.
old port
The port of the self-managed cluster.
old user name
The username of the account that is used to log on to the self-managed cluster. The user must have permissions to execute specific DML statements, including read, write, and setting permissions. The user must also have permissions to execute DDL statements.
old password
The password of the account that is used to log on to the self-managed cluster.
Notesystem specifies the system database, which does not need to be migrated. You can skip the database during migration.
Query the tables in the self-managed cluster.
clickhouse-client --host="<old host>" --port="<old port>" --user="<old user name>" --password="<old password>" --query="SHOW tables from <database_name>" > table.list
Parameters
Parameter
Description
database_name
The database name.
You can use the system table to query the names of all databases and tables in the cluster.
SELECT DISTINCT database, name FROM system.tables WHERE database != 'system';
NoteThe tables whose names start with .inner. are the internal representations of materialized views and do not need to be migrated. You can skip these tables during migration.
Export the DDL statements that are used to create the tables in the specified database of the self-managed cluster.
clickhouse-client --host="<old host>" --port="<old port>" --user="<old user name>" --password="<old password>" --query="SELECT concat(create_table_query, ';') FROM system.tables WHERE database='<database_name>' FORMAT TabSeparatedRaw" > tables.sql
Import the DDL statements to the destination cluster.
NoteBefore you import the DDL statements, make sure that a database is created in the destination cluster to host the tables defined by the DDL statements.
clickhouse-client --host="<new host>" --port="<new port>" --user="<new user name>" --password="<new password>" -d '<database_name>' --multiquery < tables.sql
Parameters
Parameter
Description
new host
The endpoint of the destination cluster.
new port
The port of the destination cluster.
new user name
The username of the account that is used to log on to the destination cluster. The user must have permissions to execute specific DML statements, including read, write, and setting permissions. The user must also have permissions to execute DDL statements.
new password
The password of the account that is used to log on to the destination cluster.
Step 2: Migrate data
Use the remote function to migrate data
Execute the following SQL statement in the destination ApsaraDB for ClickHouse cluster to migrate data:
INSERT INTO <new_database>.<new_table> SELECT * FROM remote('<old_endpoint>', <old_database>.<old_table>, '<username>', '<password>') [WHERE _partition_id = '<partition_id>'] SETTINGS max_execution_time = 0, max_bytes_to_read = 0, log_query_threads = 0;
NoteFor ApsaraDB for ClickHouse V20.8, we recommend that you use the remoteRaw function to migrate data. If data fails to be migrated, you can apply for a minor version update.
INSERT INTO <new_database>.<new_table> SELECT * FROM remoteRaw('<old_endpoint>', <old_database>.<old_table>, '<username>', '<password>') [WHERE _partition_id = '<partition_id>'] SETTINGS max_execution_time = 0, max_bytes_to_read = 0, log_query_threads = 0;
Parameters
ImportantThe partition_id parameter can be used to filter data to reduce resource usage. We recommend that you use this parameter.
Parameter
Description
new_database
The name of the database in the destination cluster.
new_table
The name of the table in the destination cluster.
old_endpoint
The endpoint of the self-managed cluster.
Self-managed cluster
The endpoint of a self-managed cluster is in the
IP address of the node of the self-managed cluster:port
format.ImportantIn this case, the port is a TCP port.
ApsaraDB for ClickHouse cluster
The endpoint of an ApsaraDB for ClickHouse cluster is a VPC endpoint, instead of a public endpoint.
ImportantThe port 3306 is the default port of ApsaraDB for ClickHouse Community-compatible Edition clusters and the port 9000 is the default port of ApsaraDB for ClickHouse Enterprise Edition clusters.
Community-compatible Edition cluster:
The endpoint is in the
VPC endpoint:3306
format.Example:
cc-2zeqhh5v7y6q*****.clickhouse.ads.aliyuncs.com:3306
.
Enterprise Edition cluster:
The endpoint is in the
VPC endpoint:9000
format.Example:
cc-bp1anv7jo84ta*****clickhouse.clickhouseserver.rds.aliyuncs.com:9000
.
old_database
The name of the database in the self-managed cluster.
old_table
The name of the table in the self-managed cluster.
username
The username of the account of the self-managed cluster.
password
The password of the account of the self-managed cluster.
max_execution_time
The maximum query execution time. The value 0 indicates that no limits are imposed.
max_bytes_to_read
The maximum number of bytes that can be read from source data during the query. The value 0 indicates that no limits are imposed.
log_query_threads
Specifies whether to record information about query threads. The value 0 indicates that information about query threads is not recorded.
_partition_id
The ID of the data partition.
Export and import a file to migrate data
You can export a file from the self-managed cluster and import the file to the destination cluster to migrate data.
Export and import a CSV file
Export data to a CSV file from the self-managed cluster.
clickhouse-client --host="<old host>" --port="<old port>" --user="<old user name>" --password="<old password>" --query="select * from <database_name>.<table_name> FORMAT CSV" > table.csv
Import the CSV file to the destination cluster.
clickhouse-client --host="<new host>" --port="<new port>" --user="<new user name>" --password="<new password>" --query="insert into <database_name>.<table_name> FORMAT CSV" < table.csv
Use a pipe in Linux to export and import streaming data
clickhouse-client --host="<old host>" --port="<old port>" --user="<user name>" --password="<password>" --query="select * from <database_name>.<table_name> FORMAT CSV" | clickhouse-client --host="<new host>" --port="<new port>" --user="<user name>" --password="<password>" --query="INSERT INTO <database_name>.<table_name> FORMAT CSV"
Query and troubleshoot the errors that are reported in the migration check
Error message | Description | Solution |
Missing unique distributed table or sharding_key not set. | The unique distributed table that is associated with the local table of the self-managed cluster does not exist. | Before migration, create a distributed table for the local table of the self-managed cluster. |
The corresponding distribution table is not unique. | The local table of the self-managed cluster is associated with multiple distributed tables. | Delete redundant distributed tables and retain one distributed table. |
MergeTree table on multiple replica cluster. | The self-managed cluster is a multi-replica cluster and contains replicated tables. In this case, data is inconsistent between replicas and cannot be migrated. | For more information, see Why are non-replicated tables not allowed when I scale up or down or migrate a multi-replica cluster? |
Data reserved table on destination cluster. | The destination table already contains data. | Delete the destination table that contains data. |
Columns of distributed table and local table conflict | The columns of the distributed table are inconsistent with the columns of the local table. | Rebuild the distributed table in the self-managed cluster and make sure that the columns of the distributed table are consistent with the columns of the local table. |
Storage is not enough. | The storage capacity of the destination cluster is insufficient. | Expand the storage capacity of the destination cluster to at least 1.2 times the storage capacity of the self-managed cluster. For more information, see Modify the configurations of an ApsaraDB for ClickHouse Community-compatible Edition cluster. |
Missing system table. | The system table of the self-managed cluster is missing. | Modify the config.xml file of the self-managed cluster to create the required system table. For more information, see Step 1: Enable the system table for the self-managed cluster. |
FAQ
What do I do if the "Too many partitions for single INSERT block (more than 100)" error message appears?
The number of partitions in a single INSERT operation exceeds the value of the max_partitions_per_insert_block parameter. The default value of the max_partitions_per_insert_block parameter is 100. ApsaraDB for ClickHouse generates a data part for each write operation. A partition may contain one or more data parts. If a single INSERT operation inserts an excessively large number of partitions, a large number of data parts are generated, which can impose a heavy burden on merge and query operations. To prevent a large number of data parts from being generated, ApsaraDB for ClickHouse imposes certain restrictions.
Solution: Perform the following operations to adjust the number of partitions or the value of the max_partitions_per_insert_block parameter:
Adjust the table schema and the partitioning mode or make sure that the number of partitions inserted at a time does not exceed the upper limit.
To prevent the number of partitions inserted at a time from exceeding the upper limit, execute the following statement to increase the value of the max_partitions_per_insert_block parameter based on the data volume:
SET GLOBAL ON cluster DEFAULT max_partitions_per_insert_block = XXX;
NoteIn ClickHouse, the recommended default value of max_partitions_per_insert_block is 100. Do not set the max_partitions_per_insert_block parameter to an excessively large value. Otherwise, the performance may be affected. After you complete batch data imports, you can reset the parameter to its default value.
Why does a destination ApsaraDB for ClickHouse cluster fail to connect to a self-managed ClickHouse cluster?
The self-managed ClickHouse cluster may have been configured with a firewall or a whitelist. You must add the IPv4 CIDR block of the vSwitch of the ApsaraDB for ClickHouse to the whitelist of the self-managed ClickHouse cluster. For more information, see View IPv4 CIDR blocks.
Why are non-replicated tables not allowed when I scale up or down or migrate a multi-replica cluster? What do I do if the non-replicated tables exist?
The following section describes the causes of the issue and the related solutions:
Cause analysis: In a multi-replica cluster, replicated tables are used to synchronize data across replicas. The migration tool randomly selects one of the replicas as the data source to migrate data to the destination cluster.
If a non-replicated table exists, data cannot be synchronized across different replicas. As a result, each replica may contain different data. The migration tool migrates data from only one of the replicas, resulting in data loss. In the following figure, the MergeTree table in the replica named r0 contains records 1, 2, and 3, and the MergeTree table in the replica named r1 contains records 4 and 5. In this case, only records 1, 2, and 3 are migrated to the destination cluster.
Solution: If non-replicated tables in the source cluster can be deleted, we recommend that you delete the tables. Otherwise, you must replace the non-replicated tables in the source cluster with replicated tables. Procedure:
Log on to the source cluster.
Create replicated tables. Make sure that the table schema remains identical, except for the table engine.
Manually migrate data from non-replicated tables to new replicated tables. The following code provides an example of a statement.
ImportantAll data from non-replicated tables on each replica must be migrated to the corresponding replicated tables. In this example, you must execute the sample statement in replicas r0 and r1.
You can run the
SELECT * FROM system.clusters;
statement to query the value of the Node IP address parameter in the following sample statement.INSERT INTO <Destination database>.<New replicated table> SELECT * FROM remote('<Node IP address>:3003', '<Source database>', '<Non-replicated table to be replaced>', '<username>', '<password>') [WHERE _partition_id = '<partition_id>'] SETTINGS max_execution_time = 0, max_bytes_to_read = 0, log_query_threads = 0;
Exchange the names of the non-replicated table and the replicated table.
EXCHANGE TABLES <Source database>.<non-replicated table to be replaced> AND <Destination database>.<new replicated table> ON CLUSTER default;