Use clickhouse-client to import data from a local file to ApsaraDB for ClickHouse - ApsaraDB for ClickHouse

If you want to import data from a local file to ApsaraDB for ClickHouse, you can use clickhouse-client. This topic describes how to import data to ApsaraDB for ClickHouse.

Prerequisites

A database account is created. For more information, see Manage database accounts in a Community-compatible Edition cluster and Manage the database accounts in an Enterprise Edition cluster.
clickhouse-client is installed, and the version of clickhouse-client is the same as or later than the version of the cluster. For more information, see Install clickhouse-client.
Note
Data import over the Internet is slow. If you want to import a large amount of data or test the import performance, you can import data over the internal network.
To import data over an internal network, install clickhouse-client on an Elastic Compute Service (ECS) instance that resides in the same virtual private cloud (VPC) as the ApsaraDB for ClickHouse cluster. Then, you can use the VPC endpoint to connect to the cluster.
When you install clickhouse-client on an ECS instance, you must enable Internet access for the ECS instance.
- If you have not purchased an ECS instance, we recommend that you select the checkbox in the Public IP Address field when you purchase an ECS instance. For more information about how to purchase an ECS instance, see Create an instance on the Custom Launch tab.
- If you have purchased an ECS instance but disable Internet access, you must associate an elastic IP address (EIP) with the ECS instance. For more information, see Associate an EIP with an ECS instance.
The IP address of the server on which clickhouse-client is installed is added to the whitelist of the ApsaraDB for ClickHouse cluster. For more information about how to configure a whitelist, see Configure a whitelist.
The format of the file to be imported is supported by ApsaraDB for ClickHouse. For more information about the supported file formats and requirements, see Use clickhouse-client to import data

Procedure

This example shows how to import a CSV file into a test_tbl_distributed distributed table in the default database of an ApsaraDB for ClickHouse cluster. You can modify the parameters based on your business requirements. Sample environment:

Destination database: default
Destination table: test_tbl_distributed
Data source: the testData.csv file

Step 1: Prepare data

In the directory in which clickhouse-client is installed, create a CSV file named testData.csv and write the following data to the file:

1,yang,32,shanghai,http://example.com
2,wang,22,beijing,http://example.com
3,xiao,23,shenzhen,http://example.com
4,jess,45,hangzhou,http://example.com
5,jack,14,shanghai,http://example.com
6,tomy,25,hangzhou,http://example.com
7,lucy,45,shanghai,http://example.com
8,tengyin,26,shanghai,http://example.com
9,wangli,27,shenzhen,http://example.com
10,xiaohua,37,shanghai,http://example.com

Step 2: Create a table

Connect to the database.
ApsaraDB for ClickHouse is integrated with Data Management (DMS), which facilitates connection to a cluster. For more information about how to connect to a cluster by using DMS, see Use DMS to connect to an ApsaraDB for ClickHouse cluster.
If you use another client, see Database connectivity.
Create a table based on the cluster edition.
Important
Pay attention to the mappings between table fields and the data types of the table fields, and text fields and the data types of the text fields. Make sure that the column order and data types in the destination table exactly match the column order and data types in the source text. Otherwise, errors may occur during data import.
For an Enterprise Edition cluster, you need to only create local tables. For a Community-compatible cluster, you may need to create distributed tables based on your business requirements. The following sample statement provides an example on how to create a table. For more information, see CREATE TABLE.
Enterprise Edition
```
CREATE TABLE test_tbl_local ON cluster default 
(
id UInt8,
user_name String,
age UInt16,
city String,
access_url String
)
ENGINE = MergeTree()
ORDER BY id;
```
If the ON CLUSTER is not allowed for Replicated database error message appears when you execute the preceding statement, you can update the minor engine version of the cluster to resolve the issue. For more information, see Update the minor engine version.
Community-compatible Edition
The table engines that are used by Single-replica Edition and Dual-replica Edition are different. Select a table engine based on the replica type you use.
Important
When you create tables in a Double-replica Edition cluster, make sure that the tables use Replicated table engines from the MergeTree family. If the tables use non-Replicated table engines, the data on the tables is not replicated across replicas. This can lead to data inconsistency.
Single-replica Edition
Create a local table.
CREATE TABLE test_tbl_local ON cluster default ( id UInt8, user_name String, age UInt16, city String, access_url String ) ENGINE = MergeTree() ORDER BY id;
Optional. Create a distributed table.
If you want to import data only to the local table, skip this step.
If your cluster is a multi-node cluster, we recommend that you create a distributed table.
CREATE TABLE test_tbl_distributed ON cluster default ( id UInt8, user_name String, age UInt16, city String, access_url String ) ENGINE = Distributed(default, default, test_tbl_local, rand());
Dual-replica Edition
Create a local table.
CREATE TABLE test_tbl_local ON cluster default ( id UInt8, user_name String, age UInt16, city String, access_url String ) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{database}/{table}/{shard}', '{replica}') order by id;
Optional. Create a distributed table.
If you want to import data only to the local table, skip this step.
If your cluster is a multi-node cluster, we recommend that you create a distributed table.
CREATE TABLE test_tbl_distributed ON cluster default ( id UInt8, user_name String, age UInt16, city String, access_url String ) ENGINE = Distributed(default, default, test_tbl_local, rand());

Step 3: Import sample data

Run the following command in the directory in which clickhouse-client is installed.

If you want to accelerate data import, you can split the original file into equal segments, start multiple client processes, and concurrently import the segments.

If your cluster is a multi-node cluster, we recommend that you import data to a distributed table.

cat <file_name> | ./clickhouse-client --host=<host> --port=<port> --user=<user> --password=<password> --query="INSERT INTO <table_name> FORMAT <file_type>";

Parameter	Description
file_name	The path to the destination file.
host	The public endpoint or VPC endpoint. You can view the public endpoint or VPC endpoint on the Cluster Information page. We recommend that you select an appropriate connection method based on the server on which clickhouse-client is installed: On an ECS instance: If clickhouse-client is in the same VPC as the ApsaraDB for ClickHouse cluster, we recommend that you use a VPC endpoint to connect to the cluster. This way, data can be quickly imported over an internal network. If clickhouse-client and the ApsaraDB for ClickHouse cluster are in different VPCs, we recommend that you connect to the cluster over the Internet. On a non-ECS instance: We recommend that you connect to the cluster over the Internet. For more information about how to apply for a public endpoint for a cluster, see Apply for and release a public endpoint.
port	The TCP port number. You can view the TCP port number on the Cluster Information page.
user	The username of the database account.
password	The password of the database account.
table_name	The name of the destination table. If you create a distributed table, set this parameter to the name of the distributed table.
file_type	The type of the source file.

Step 4: View the import results

Connect to the cluster.
Use DMS to connect to the cluster. For more information, see Use DMS to connect to an ApsaraDB for ClickHouse cluster.

Execute the following query statement.

Important

If the cluster runs Community-compatible Edition and contains multiple nodes, we recommend that you query the distributed table to retrieve complete data from all nodes. If you query a non-distributed table, you can retrieve data only from a single node in the cluster, which is less than the data you imported.

SELECT * FROM test_tbl_local;

The following query results are returned:

+--------------+---------------------+---------------+----------------+----------------------+
| id           | user_name           | age           | city           | access_url           |
+--------------+---------------------+---------------+----------------+----------------------+
| 1            | yang                | 32            | shanghai       | http://example.com   |
| 2            | wang                | 22            | beijing        | http://example.com   |
| 3            | xiao                | 23            | shenzhen       | http://example.com   |
| 4            | jess                | 45            | hangzhou       | http://example.com   |
| 5            | jack                | 14            | shanghai       | http://example.com   |
| 6            | tomy                | 25            | hangzhou       | http://example.com   |
| 7            | lucy                | 45            | shanghai       | http://example.com   |
| 8            | tengyin             | 26            | shanghai       | http://example.com   |
| 9            | wangli              | 27            | shenzhen       | http://example.com   |
| 10           | xiaohua             | 37            | shanghai       | http://example.com   |
+--------------+---------------------+---------------+----------------+----------------------+

Supported file formats

The following table describes the common file formats that can be imported.

Important

In a text, each row of data corresponds to a single record in the table, and each column of data corresponds to a field specified when the table was created.
If the first or second column contains descriptions of column names or column types, the descriptions are not parsed. Only the raw data that you want to import is processed. The column names and data types described in the file do not determine the position or data types of the columns in the table in which the data is inserted.

File format	Text requirement	Example
TabSeparated	Data is written to a text file in this format by row. Columns are separated by tabs. Escape sequences are required to represent tabs, line feeds, and backslashes. The escape sequence is `\t` for a tab, `\n` for a line feed, and `\\` for a backslash. \N indicates NULL.	`Bob\t28\tSoftware engineer\n John\t32\tData analyst\n Tom\t25\tProduct manager\n`
TabSeparatedWithNames	This file format is similar to the TabSeparated format, except that this format requires the first row in the text to be column names. The first row is ignored during parsing.	`Name\tAge\tOccupation\n Bob\t28\tSoftware engineer\n John\t32\tData analyst\n Tom\t25\tProduct manager\n`
TabSeparatedWithNamesAndTypes	This file format is similar to the TabSeparated format, except that the values in the first row are column names and the values in the second row are data types. The first and second rows are ignored during parsing.	`Name\tAge\tOccupation\n String\tUInt16\tString\n Bob\t28\tSoftware engineer\n John\t32\tData analyst\n Tom\t25\tProduct manager\n`
CSV	Data is written to a file in this format by row. The double quotation marks (") of strings are escaped by two double quotation marks. For example, the "Hello, World!" string must be written as ""Hello, World!"" in the file. The values of the numeric type do not use the double quotation marks ("). The default column delimiter is a `comma (,)`. You can use the `--format_csv_delimiter` parameter to specify other column delimiters when you run the import command. For example, if you want to separate columns with vertical bars (\|), run the following command: `cat testData.csv \| ./clickhouse-client --format_csv_delimiter="\|" --host=cc-bp163l724nkf8****.clickhouse.ads.aliyuncs.com --port=3306 --user=test --password=123456Aa --query="INSERT INTO test_tbl_distributed FORMAT CSV";`	`Bob,28,Software engineer John,t32,Data analyst Tom,t25,Product manager`
CSVWithNames	This format is similar to the CSV format, except that the values in the first row are column names. The first row is ignored during parsing.	`id,name,age,city,access_url 1,yang,32,shanghai,http://example.com 2,wang,22,beijing,http://example.com 3,xiao,23,shenzhen,http://example.com`

For more information about other supported file formats, see Formats for Input and Output Data.

References

For more information about migration solutions, see Data migration and synchronization.

Prerequisites

Procedure

Step 1: Prepare data

Step 2: Create a table

Enterprise Edition

Community-compatible Edition

Single-replica Edition

Dual-replica Edition

Step 3: Import sample data

Step 4: View the import results

Supported file formats

References