This topic provides you with a practical guide on how to migrate data from a self-managed ClickHouse cluster to an ApsaraDB for ClickHouse cluster to improve the scalability and reliability of your business. It also offers guidance on how to handle potential issues that you may encounter during the migration process. The migration policy in this topic helps you complete data migration efficiently and securely and maximize the benefits of your cloud environment.
For security purposes, you cannot connect to an ApsaraDB for ClickHouse cluster over the Internet. The backend servers of an ApsaraDB for ClickHouse cluster can connect only to the servers in the same virtual private cloud (VPC). Before you migrate data, verify that the self-managed ClickHouse cluster and the destination ApsaraDB for ClickHouse cluster can communicate with each other.
Overview
The data migration policy involves two main steps: First, migrate metadata to ensure that all table schemas are correctly migrated to the ApsaraDB for ClickHouse cluster. Then, use the remote
function of ApsaraDB for ClickHouse to directly migrate data, or export the data and import it to the ApsaraDB for ClickHouse cluster to complete the migration.
Network relationship between the self-managed ClickHouse cluster and the destination ApsaraDB for ClickHouse cluster | Migration method |
The self-managed ClickHouse cluster is deployed on an Alibaba Cloud Elastic Compute Service (ECS) instance. The ECS instance and the ApsaraDB for ClickHouse cluster are deployed in the same VPC. |
|
The self-managed ClickHouse cluster is deployed on an Alibaba Cloud ECS instance. The ECS instance and the ApsaraDB for ClickHouse cluster are deployed in different VPCs. |
|
The self-managed ClickHouse cluster is not deployed on an Alibaba Cloud ECS instance. For example, the self-managed ClickHouse cluster is deployed in an on-premises data center. |
|
The self-managed ClickHouse cluster cannot connect to the destination ApsaraDB for ClickHouse cluster. |
|
The self-managed ClickHouse cluster cannot connect to the destination ApsaraDB for ClickHouse cluster, but infrastructure such as Spark or Flink is available. | Create Spark or Flink jobs to read data from the self-managed ClickHouse cluster and then write the data to the destination ApsaraDB for ClickHouse cluster. |
Step 1: Migrate metadata
To migrate metadata from the self-managed ClickHouse cluster, import the DDL statements that are executed to create tables in the self-managed ClickHouse cluster to the destination ApsaraDB for ClickHouse cluster.
If you need to install clickhouse-client, make sure that the version of clickhouse-client is consistent with the version of the destination ApsaraDB for ClickHouse cluster. For more information about the download link for clickhouse-client, see clickhouse-client.
Query the databases in the self-managed ClickHouse cluster.
clickhouse-client --host="<old host>" --port="<old port>" --user="<old user name>" --password="<old password>" --query="SHOW databases" > database.list
The following table describes the parameters in the preceding syntax.
Parameter
Description
old host
The endpoint of the self-managed ClickHouse cluster.
old port
The port that is used by the self-managed ClickHouse cluster.
old user name
The username that is used to log on to the self-managed ClickHouse cluster. The user has DML permissions, including read, write, and set permissions. The user can also be granted DDL permissions.
old password
The password of the user that is used to connect to the self-managed ClickHouse cluster.
Notesystem stands for the system database. You do not need to migrate this database.
Query the tables in the self-managed ClickHouse cluster.
clickhouse-client --host="<old host>" --port="<old port>" --user="<old user name>" --password="<old password>" --query="SHOW tables from <database_name>" > table.list
The following table describes a parameter in the preceding syntax.
Parameter
Description
database_name
The database name.
You can also use system tables to query the names of all databases and tables in the cluster.
SELECT DISTINCT database, name FROM system.tables WHERE database != 'system';
NoteTable names that start with .inner. are the internal representations of materialized views. You can skip these tables during migration.
Export the DDL statements that are executed to create tables in the specified database of the self-managed ClickHouse cluster.
clickhouse-client --host="<old host>" --port="<old port>" --user="<old user name>" --password="<old password>" --query="SELECT concat(create_table_query, ';') FROM system.tables WHERE database='<database_name>' FORMAT TabSeparatedRaw" > tables.sql
Import the DDL statements that are executed to create tables in the self-managed ClickHouse cluster to the destination ApsaraDB for ClickHouse cluster.
NoteBefore you import the DDL statements that are executed to create tables, you must create a database to which the tables will be imported in the ApsaraDB for ClickHouse cluster.
clickhouse-client --host="<new host>" --port="<new port>" --user="<new user name>" --password="<new password>" -d '<database_name>' --multiquery < tables.sql
The following table describes the parameters in the preceding syntax.
Parameter
Description
new host
The endpoint of the destination ApsaraDB for ClickHouse cluster.
new port
The port that is used by the destination ApsaraDB for ClickHouse cluster.
new user name
The username that is used to log on to the destination ApsaraDB for ClickHouse cluster. The user has DML permissions, including read, write, and set permissions. The user can also be granted DDL permissions.
new password
The password of the user that is used to log on to the destination ApsaraDB for ClickHouse cluster.
Step 2: Migrate data
Use the remote function to migrate data
Execute the following SQL statement in the destination ApsaraDB for ClickHouse cluster to migrate data:
INSERT INTO <new_database>.<new_table>
SELECT *
FROM remote('<old_endpoint>', <old_database>.<old_table>, '<username>', '<password>')
[WHERE _partition_id = '<partition_id>']
SETTINGS max_execution_time = 0, max_bytes_to_read = 0, log_query_threads = 0;
For ApsaraDB for ClickHouse V20.8, we recommend that you use the remoteRaw function to migrate data. If data fails to be migrated, you can apply for a minor version update.
INSERT INTO <new_database>.<new_table>
SELECT *
FROM remoteRaw('<old_endpoint>', <old_database>.<old_table>, '<username>', '<password>')
[WHERE _partition_id = '<partition_id>']
SETTINGS max_execution_time = 0, max_bytes_to_read = 0, log_query_threads = 0;
The following table describes the parameters in the preceding statement.
The partition_id parameter can be used to filter data to reduce resource usage. We recommend that you use this parameter.
Parameter | Description |
new_database | The name of the database in the destination ApsaraDB for ClickHouse cluster. |
new_table | The name of the table in the destination ApsaraDB for ClickHouse cluster. |
old_endpoint | The endpoint of the source cluster. Self-managed ClickHouse clusterThe endpoint of a self-managed ClickHouse cluster is in the Important In this case, the port is a TCP port. ApsaraDB for ClickHouse clusterThe endpoint of an ApsaraDB for ClickHouse cluster is a VPC endpoint, instead of a public endpoint. Important The port 3306 is the default port of ApsaraDB for ClickHouse Community-compatible Edition clusters and the port 9000 is the default port of ApsaraDB for ClickHouse Enterprise Edition clusters.
|
old_database | The name of the database in the self-managed ClickHouse cluster. |
old_table | The name of the table in the self-managed ClickHouse cluster. |
username | The username that is used to log on to the self-managed ClickHouse cluster. |
password | The password that is used to log on to the self-managed ClickHouse cluster. |
max_execution_time | The maximum query execution time. The value 0 indicates that no limits are imposed. |
max_bytes_to_read | The maximum number of bytes that can be read from source data during the query. The value 0 indicates that no limits are imposed. |
log_query_threads | Specify whether to log information about query threads. The value 0 indicates that information about query threads is not logged. |
_partition_id | The ID of the data partition. |
Export and import a file to migrate data
You can export a file from the self-managed ClickHouse cluster and import the file to the destination ApsaraDB for ClickHouse cluster to migrate data.
Export and import a CSV file
Export data to a CSV file from the self-managed ClickHouse cluster.
clickhouse-client --host="<old host>" --port="<old port>" --user="<old user name>" --password="<old password>" --query="select * from <database_name>.<table_name> FORMAT CSV" > table.csv
Import the CSV file to the destination ApsaraDB for ClickHouse cluster.
clickhouse-client --host="<new host>" --port="<new port>" --user="<new user name>" --password="<new password>" --query="insert into <database_name>.<table_name> FORMAT CSV" < table.csv
Use a pipe in Linux to export and import streaming data
clickhouse-client --host="<old host>" --port="<old port>" --user="<user name>" --password="<password>" --query="select * from <database_name>.<table_name> FORMAT CSV" | clickhouse-client --host="<new host>" --port="<new port>" --user="<user name>" --password="<password>" --query="INSERT INTO <database_name>.<table_name> FORMAT CSV"
FAQ
Q: What do I do if the error message "Too many partitions for single INSERT block (more than 100)" appears?
A: The number of partitions in a single inserted block specified in a single INSERT operation exceeds the value of max_partitions_per_insert_block. The default value of max_partitions_per_insert_block is 100. ApsaraDB for ClickHouse generates a data part for each write operation. A partition may contain one or more data parts. If a single INSERT operation inserts an excessively large number of partitions, a large number of data parts are generated in ApsaraDB for ClickHouse, which can impose a heavy burden on merge and query operations. To prevent a large number of data parts from being generated, ApsaraDB for ClickHouse imposes certain restrictions.
Solution: Perform the following operations to adjust the number of partitions or the value of max_partitions_per_insert_block:
Adjust the table schema and the partitioning mode or make sure that the number of partitions inserted at a time does not exceed the upper limit.
To prevent the number of partitions inserted at a time from exceeding the upper limit, execute the following statement to change the value of max_partitions_per_insert_block based on the data volume to increase the upper limit on the number of partitions that can be inserted at a time:
SET GLOBAL ON cluster DEFAULT max_partitions_per_insert_block = XXX;
NoteIn ClickHouse, the recommended default value of max_partitions_per_insert_block is 100. Do not set max_partitions_per_insert_block to an excessively large value. Otherwise, the performance may be affected. You can change the value of max_partitions_per_insert_block to the default value after you import data in batches.
Q: Why does a destination ApsaraDB for ClickHouse cluster fail to connect to a self-managed ClickHouse cluster?
A: The self-managed ClickHouse cluster may have been configured with a firewall or a whitelist. You can view the VPC to which the destination ApsaraDB for ClickHouse cluster belongs in the ApsaraDB for ClickHouse console and add the entire CIDR block of the VPC to the whitelist of the self-managed ClickHouse cluster. If you want to control the CIDR block range in the whitelist to avoid potential security risks, execute the following SQL statement to query the IP addresses of the backend servers of the destination ApsaraDB for ClickHouse cluster and add only these IP addresses to the whitelist of the self-managed ClickHouse cluster.
SELECT * FROM system.clusters;