All Products
Search
Document Center

ApsaraDB for ClickHouse:Migrate data from a self-managed ClickHouse cluster to an ApsaraDB for ClickHouse cluster

Last Updated:Aug 30, 2024

This topic provides you with a practical guide on how to migrate data from a self-managed ClickHouse cluster to an ApsaraDB for ClickHouse cluster to improve the scalability and reliability of your business. It also offers guidance on how to handle potential issues that you may encounter during the migration process. The migration policy in this topic helps you complete data migration efficiently and securely and maximize the benefits of your cloud environment.

Note

For security purposes, you cannot connect to an ApsaraDB for ClickHouse cluster over the Internet. The backend servers of an ApsaraDB for ClickHouse cluster can connect only to the servers in the same virtual private cloud (VPC). Before you migrate data, verify that the self-managed ClickHouse cluster and the destination ApsaraDB for ClickHouse cluster can communicate with each other.

Overview

The data migration policy involves two main steps: First, migrate metadata to ensure that all table schemas are correctly migrated to the ApsaraDB for ClickHouse cluster. Then, use the remote function of ApsaraDB for ClickHouse to directly migrate data, or export the data and import it to the ApsaraDB for ClickHouse cluster to complete the migration.

Network relationship between the self-managed ClickHouse cluster and the destination ApsaraDB for ClickHouse cluster

Migration method

The self-managed ClickHouse cluster is deployed on an Alibaba Cloud Elastic Compute Service (ECS) instance. The ECS instance and the ApsaraDB for ClickHouse cluster are deployed in the same VPC.

  1. Perform operations in Step 1: Migrate metadata.

  2. Use the remote function to migrate data.

The self-managed ClickHouse cluster is deployed on an Alibaba Cloud ECS instance. The ECS instance and the ApsaraDB for ClickHouse cluster are deployed in different VPCs.

  1. Connect the two VPCs. For more information, see Overview of VPC connections.

  2. Perform operations in Step 1: Migrate metadata.

  3. Use the remote function to migrate data.

The self-managed ClickHouse cluster is not deployed on an Alibaba Cloud ECS instance. For example, the self-managed ClickHouse cluster is deployed in an on-premises data center.

  1. Connect the on-premises data center to a VPC. For more information, see Connect a data center to a VPC.

  2. Perform operations in Step 1: Migrate metadata.

  3. Use the remote function to migrate data.

The self-managed ClickHouse cluster cannot connect to the destination ApsaraDB for ClickHouse cluster.

  1. Perform operations in Step 1: Migrate metadata.

  2. Export and import a file to migrate data.

The self-managed ClickHouse cluster cannot connect to the destination ApsaraDB for ClickHouse cluster, but infrastructure such as Spark or Flink is available.

Create Spark or Flink jobs to read data from the self-managed ClickHouse cluster and then write the data to the destination ApsaraDB for ClickHouse cluster.

Step 1: Migrate metadata

To migrate metadata from the self-managed ClickHouse cluster, import the DDL statements that are executed to create tables in the self-managed ClickHouse cluster to the destination ApsaraDB for ClickHouse cluster.

If you need to install clickhouse-client, make sure that the version of clickhouse-client is consistent with the version of the destination ApsaraDB for ClickHouse cluster. For more information about the download link for clickhouse-client, see clickhouse-client.

  1. Query the databases in the self-managed ClickHouse cluster.

    clickhouse-client --host="<old host>" --port="<old port>" --user="<old user name>" --password="<old password>" --query="SHOW databases"  > database.list

    The following table describes the parameters in the preceding syntax.

    Parameter

    Description

    old host

    The endpoint of the self-managed ClickHouse cluster.

    old port

    The port that is used by the self-managed ClickHouse cluster.

    old user name

    The username that is used to log on to the self-managed ClickHouse cluster. The user has DML permissions, including read, write, and set permissions. The user can also be granted DDL permissions.

    old password

    The password of the user that is used to connect to the self-managed ClickHouse cluster.

    Note

    system stands for the system database. You do not need to migrate this database.

  2. Query the tables in the self-managed ClickHouse cluster.

    clickhouse-client --host="<old host>" --port="<old port>" --user="<old user name>" --password="<old password>" --query="SHOW tables from <database_name>"  > table.list

    The following table describes a parameter in the preceding syntax.

    Parameter

    Description

    database_name

    The database name.

    You can also use system tables to query the names of all databases and tables in the cluster.

    SELECT DISTINCT database, name FROM system.tables WHERE database != 'system';
    Note

    Table names that start with .inner. are the internal representations of materialized views. You can skip these tables during migration.

  3. Export the DDL statements that are executed to create tables in the specified database of the self-managed ClickHouse cluster.

    clickhouse-client --host="<old host>" --port="<old port>" --user="<old user name>" --password="<old password>" --query="SELECT concat(create_table_query, ';') FROM system.tables WHERE database='<database_name>' FORMAT TabSeparatedRaw" > tables.sql
  4. Import the DDL statements that are executed to create tables in the self-managed ClickHouse cluster to the destination ApsaraDB for ClickHouse cluster.

    Note

    Before you import the DDL statements that are executed to create tables, you must create a database to which the tables will be imported in the ApsaraDB for ClickHouse cluster.

    clickhouse-client --host="<new host>" --port="<new port>" --user="<new user name>" --password="<new password>"  -d '<database_name>'  --multiquery < tables.sql

    The following table describes the parameters in the preceding syntax.

    Parameter

    Description

    new host

    The endpoint of the destination ApsaraDB for ClickHouse cluster.

    new port

    The port that is used by the destination ApsaraDB for ClickHouse cluster.

    new user name

    The username that is used to log on to the destination ApsaraDB for ClickHouse cluster. The user has DML permissions, including read, write, and set permissions. The user can also be granted DDL permissions.

    new password

    The password of the user that is used to log on to the destination ApsaraDB for ClickHouse cluster.

Step 2: Migrate data

Use the remote function to migrate data

  1. (Optional) If you need to consider network traffic consumption when you migrate data to an ApsaraDB for ClickHouse cluster, you can adjust the value of the network_compression_method parameter to select an appropriate compression algorithm to reduce traffic consumption.

    • View or temporarily change the value of the network_compression_method parameter in the destination ApsaraDB for ClickHouse cluster. The following statement provides an example.

    • SET network_compression_method = 'ZSTD';
    • View the value of the network_compression_method parameter in the destination ApsaraDB for ClickHouse cluster. The following statement provides an example.

    • SELECT * FROM system.settings WHERE name = 'network_compression_method';
  2. Execute the following SQL statement in the destination ApsaraDB for ClickHouse cluster to migrate data:

INSERT INTO <new_database>.<new_table> 
SELECT * 
FROM remote('<old_endpoint>', <old_database>.<old_table>, '<username>', '<password>') 
[WHERE _partition_id = '<partition_id>']
SETTINGS max_execution_time = 0, max_bytes_to_read = 0, log_query_threads = 0;
Note

For ApsaraDB for ClickHouse V20.8, we recommend that you use the remoteRaw function to migrate data. If data fails to be migrated, you can apply for a minor version update.

INSERT INTO <new_database>.<new_table> 
SELECT * 
FROM remoteRaw('<old_endpoint>', <old_database>.<old_table>, '<username>', '<password>')
[WHERE _partition_id = '<partition_id>']
SETTINGS max_execution_time = 0, max_bytes_to_read = 0, log_query_threads = 0;

The following table describes the parameters in the preceding statement.

Important

The partition_id parameter can be used to filter data to reduce resource usage. We recommend that you use this parameter.

(Optional) You can execute the following SQL statement to query the value of the partition_id parameter and the number of parts in system.parts:

SELECT partition_id, count(*) AS part_count from clusterAllReplicas(default, system, parts) WHERE `database` = '<old_database>' AND `table` = '<old_table>' GROUP BY partition_id ;

Parameter

Description

new_database

The name of the database in the destination ApsaraDB for ClickHouse cluster.

new_table

The name of the table in the destination ApsaraDB for ClickHouse cluster.

old_endpoint

The endpoint of the source cluster.

Self-managed ClickHouse cluster

The endpoint of a self-managed ClickHouse cluster is in the IP address of the source cluster node:port format.

Important

In this case, the port is a TCP port.

ApsaraDB for ClickHouse cluster

The endpoint of an ApsaraDB for ClickHouse cluster is a VPC endpoint, instead of a public endpoint.

Important

The port 3306 is the default port of ApsaraDB for ClickHouse Community-compatible Edition clusters and the port 9000 is the default port of ApsaraDB for ClickHouse Enterprise Edition clusters.

  • Community-compatible Edition cluster:

    • The endpoint is in the VPC endpoint:3306 format.

    • Example: cc-2zeqhh5v7y6q*****.clickhouse.ads.aliyuncs.com:3306.

  • Enterprise Edition cluster:

    • The endpoint is in the VPC endpoint:9000 format.

    • Example: cc-bp1anv7jo84ta*****clickhouse.clickhouseserver.rds.aliyuncs.com:9000.

old_database

The name of the database in the self-managed ClickHouse cluster.

old_table

The name of the table in the self-managed ClickHouse cluster.

username

The username that is used to log on to the self-managed ClickHouse cluster.

password

The password that is used to log on to the self-managed ClickHouse cluster.

max_execution_time

The maximum query execution time. The value 0 indicates that no limits are imposed.

max_bytes_to_read

The maximum number of bytes that can be read from source data during the query. The value 0 indicates that no limits are imposed.

log_query_threads

Specify whether to log information about query threads. The value 0 indicates that information about query threads is not logged.

_partition_id

The ID of the data partition.

Export and import a file to migrate data

You can export a file from the self-managed ClickHouse cluster and import the file to the destination ApsaraDB for ClickHouse cluster to migrate data.

  • Export and import a CSV file

    1. Export data to a CSV file from the self-managed ClickHouse cluster.

      clickhouse-client --host="<old host>" --port="<old port>" --user="<old user name>" --password="<old password>"  --query="select * from <database_name>.<table_name> FORMAT CSV"  > table.csv
    2. Import the CSV file to the destination ApsaraDB for ClickHouse cluster.

      clickhouse-client --host="<new host>" --port="<new port>" --user="<new user name>" --password="<new password>"  --query="insert into <database_name>.<table_name> FORMAT CSV"  < table.csv
  • Use a pipe in Linux to export and import streaming data

    clickhouse-client --host="<old host>" --port="<old port>" --user="<user name>" --password="<password>"  --query="select * from <database_name>.<table_name> FORMAT CSV" | 
    clickhouse-client --host="<new host>" --port="<new port>" --user="<user name>" --password="<password>"   --query="INSERT INTO <database_name>.<table_name> FORMAT CSV"

FAQ

  • Q: What do I do if the error message "Too many partitions for single INSERT block (more than 100)" appears?

    A: The number of partitions in a single inserted block specified in a single INSERT operation exceeds the value of max_partitions_per_insert_block. The default value of max_partitions_per_insert_block is 100. ApsaraDB for ClickHouse generates a data part for each write operation. A partition may contain one or more data parts. If a single INSERT operation inserts an excessively large number of partitions, a large number of data parts are generated in ApsaraDB for ClickHouse, which can impose a heavy burden on merge and query operations. To prevent a large number of data parts from being generated, ApsaraDB for ClickHouse imposes certain restrictions.

    Solution: Perform the following operations to adjust the number of partitions or the value of max_partitions_per_insert_block:

    • Adjust the table schema and the partitioning mode or make sure that the number of partitions inserted at a time does not exceed the upper limit.

    • To prevent the number of partitions inserted at a time from exceeding the upper limit, execute the following statement to change the value of max_partitions_per_insert_block based on the data volume to increase the upper limit on the number of partitions that can be inserted at a time:

      SET GLOBAL ON cluster DEFAULT max_partitions_per_insert_block = XXX;
      Note

      In ClickHouse, the recommended default value of max_partitions_per_insert_block is 100. Do not set max_partitions_per_insert_block to an excessively large value. Otherwise, the performance may be affected. You can change the value of max_partitions_per_insert_block to the default value after you import data in batches.

  • Q: Why does a destination ApsaraDB for ClickHouse cluster fail to connect to a self-managed ClickHouse cluster?

    A: The self-managed ClickHouse cluster may have been configured with a firewall or a whitelist. You can view the VPC to which the destination ApsaraDB for ClickHouse cluster belongs in the ApsaraDB for ClickHouse console and add the entire CIDR block of the VPC to the whitelist of the self-managed ClickHouse cluster. If you want to control the CIDR block range in the whitelist to avoid potential security risks, execute the following SQL statement to query the IP addresses of the backend servers of the destination ApsaraDB for ClickHouse cluster and add only these IP addresses to the whitelist of the self-managed ClickHouse cluster.

    SELECT * FROM system.clusters;