This topic describes how to use the clickhouse-client tool to import data from a local file into ApsaraDB for ClickHouse.
Prerequisites
Create a database account. For more information, see Manage accounts of Community-Compatible Edition clusters and Manage accounts of Enterprise Edition clusters.
Install the clickhouse-client whose version is the same as or later than your cluster's version.
NoteImporting data over the public network is slow. For large datasets or performance testing, importing data through the internal network is recommended.
To import data over an internal network:
The ECS instance should be in the same Virtual Private Cloud (VPC) as the ApsaraDB for ClickHouse cluster.
Use the VPC endpoint to connect to the ClickHouse cluster.
Enable public network access on the ECS instance to install the clickhouse-client tool:
New instance: Check Assign Public IPv4 Address when purchasing. For purchase details, see Create an instance using the wizard.
Existing instance without public access: Associate an Elastic IP Address (EIP).
Add the IP address of the server that hosts the clickhouse-client tool to the whitelist of ApsaraDB for ClickHouse. For details, see Set Whitelist.
The source file is in a supported format.
Procedure
This topic provides an example of how to import data from a CSV file into the distributed table in an ApsaraDB for ClickHouse cluster. In practice, modify the parameters to suit your business requirements. This example uses the following environment:
Destination database: default
Destination table: test_tbl_distributed
Source data file: testData.csv
Step 1: Prepare the data
In the clickhouse-client installation directory, create a file named testData.csv and add the following data to it.
1,yang,32,shanghai,http://example.com
2,wang,22,beijing,http://example.com
3,xiao,23,shenzhen,http://example.com
4,jess,45,hangzhou,http://example.com
5,jack,14,shanghai,http://example.com
6,tomy,25,hangzhou,http://example.com
7,lucy,45,shanghai,http://example.com
8,tengyin,26,shanghai,http://example.com
9,wangli,27,shenzhen,http://example.com
10,xiaohua,37,shanghai,http://example.comStep 2: Create a table
Connect to the database.
You can use DMS to connect to a ClickHouse cluster.
If you are using other clients, see Connect to a database.
Create a table based on your cluster's edition.
ImportantData import can fail if the column order and data types in your destination table do not exactly match the data in your source file. Always verify that the table schema corresponds to the file's structure before importing.
Enterprise Edition clusters only require creating a local table, while Community Compatible Edition clusters may need to create a distributed table based on your environment and requirements. The following are example statements. For more information about the table creation syntax, see CREATE TABLE.
Enterprise Edition
CREATE TABLE test_tbl_local ON cluster default ( id UInt8, user_name String, age UInt16, city String, access_url String ) ENGINE = MergeTree() ORDER BY id;When creating a table in an Enterprise Edition cluster, you may get the error
ON CLUSTER is not allowed for Replicated database. This is a known issue in older minor versions. To fix this, upgrade the minor engine version of your cluster.Community-Compatible Edition
The engines for single-replica and two-replica clusters differ. Select an engine based on your replica type.
ImportantWhen you create tables in a Double-replica Edition cluster, make sure that the tables use Replicated table engines from the MergeTree family. If the tables use non-Replicated table engines, the data on the tables is not replicated across replicas. This can lead to data inconsistency.
Single-replica
Create a local table.
CREATE TABLE test_tbl_local ON cluster default ( id UInt8, user_name String, age UInt16, city String, access_url String ) ENGINE = MergeTree() ORDER BY id;(Optional) Create a distributed table.
If you need to import data only to a local table, you can skip this step.
If you have a multi-node cluster, we recommend that you create a distributed table.
CREATE TABLE test_tbl_distributed ON cluster default ( id UInt8, user_name String, age UInt16, city String, access_url String ) ENGINE = Distributed(default, default, test_tbl_local, rand());
Two-replica
Create a local table.
CREATE TABLE test_tbl_local ON cluster default ( id UInt8, user_name String, age UInt16, city String, access_url String ) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{database}/{table}/{shard}', '{replica}') order by id;(Optional) Create a distributed table.
If you need to import data only to a local table, you can skip this step.
If you have a multi-node cluster, we recommend that you create a distributed table.
CREATE TABLE test_tbl_distributed ON cluster default ( id UInt8, user_name String, age UInt16, city String, access_url String ) ENGINE = Distributed(default, default, test_tbl_local, rand());
Step 3: Import the data
In the clickhouse-client installation directory, run the following command.
To accelerate the data import, you can split the source file into multiple parts and run multiple client processes to import the parts in parallel.
If you have a multi-node cluster, we recommend that you import the data into a distributed table.
cat <file_name> | ./clickhouse-client --host=<host> --port=<port> --user=<user> --password=<password> --query="INSERT INTO <table_name> FORMAT <file_type>";Parameter | Description |
file_name | The path of the source file. |
host | The public endpoint or VPC endpoint, which is displayed on the Cluster Information page. Choose the endpoint based on where the clickhouse-client tool is running:
For more information about how to apply for a public endpoint for a cluster, see Apply for or release a public endpoint. |
port | The TCP port number of the cluster, which is displayed on the Cluster Information page. |
user | The database account. |
password | The password for the database account. |
table_name | The name of the destination table. If you created a distributed table, specify the name of the distributed table. |
file_type | The format of the source data file. |
Step 4: Verify the result
Connect to the cluster.
Run a query statement.
ImportantOn a multi-node Community-Compatible Edition cluster, always query the distributed table. Querying a local table directly returns results from only a single node, which means your data will be incomplete.
SELECT * FROM test_tbl_local;The following result is returned.
+--------------+---------------------+---------------+----------------+----------------------+ | id | user_name | age | city | access_url | +--------------+---------------------+---------------+----------------+----------------------+ | 1 | yang | 32 | shanghai | http://example.com | | 2 | wang | 22 | beijing | http://example.com | | 3 | xiao | 23 | shenzhen | http://example.com | | 4 | jess | 45 | hangzhou | http://example.com | | 5 | jack | 14 | shanghai | http://example.com | | 6 | tomy | 25 | hangzhou | http://example.com | | 7 | lucy | 45 | shanghai | http://example.com | | 8 | tengyin | 26 | shanghai | http://example.com | | 9 | wangli | 27 | shenzhen | http://example.com | | 10 | xiaohua | 37 | shanghai | http://example.com | +--------------+---------------------+---------------+----------------+----------------------+
Supported file formats
The following table describes common supported file formats.
Each row in the file corresponds to one row in the table, and each column in the file corresponds to one column in the table.
For the following formats, header rows that contain column names or data types are ignored during parsing. Data is imported based on the column order of the destination table, not the headers in the source file.
Format | Requirements | Example |
TabSeparated |
| |
TabSeparatedWithNames | This format is similar to TabSeparated, but the first row contains column names. This header row is ignored during parsing. | |
TabSeparatedWithNamesAndTypes | This format is similar to TabSeparated, except that the first row contains column names and the second row contains data types. When the text is parsed, the first and second rows are ignored. | |
CSV |
| |
CSVWithNames | This format is similar to CSV, except that the first row contains column names. When the text is parsed, the first row is ignored. | |
For more information about file formats, see Formats for Input and Output Data.
Related documents
For more information about data migration and synchronization solutions, see Data migration and synchronization.