If you want to import data from a local file to ApsaraDB for ClickHouse, you can use clickhouse-client. This topic describes how to import data to ApsaraDB for ClickHouse.
Prerequisites
A database account is created. For more information, see Manage database accounts in a Community-compatible Edition cluster and Manage the database accounts in an Enterprise Edition cluster.
clickhouse-client is installed, and the version of clickhouse-client is the same as or later than the version of the cluster. For more information, see Install clickhouse-client.
NoteData import over the Internet is slow. If you want to import a large amount of data or test the import performance, you can import data over the internal network.
To import data over an internal network, install clickhouse-client on an Elastic Compute Service (ECS) instance that resides in the same virtual private cloud (VPC) as the ApsaraDB for ClickHouse cluster. Then, you can use the VPC endpoint to connect to the cluster.
When you install clickhouse-client on an ECS instance, you must enable Internet access for the ECS instance.
If you have not purchased an ECS instance, we recommend that you select the checkbox in the Public IP Address field when you purchase an ECS instance. For more information about how to purchase an ECS instance, see Create an instance on the Custom Launch tab.
If you have purchased an ECS instance but disable Internet access, you must associate an elastic IP address (EIP) with the ECS instance. For more information, see Associate an EIP with an ECS instance.
The IP address of the server on which clickhouse-client is installed is added to the whitelist of the ApsaraDB for ClickHouse cluster. For more information about how to configure a whitelist, see Configure a whitelist.
The format of the file to be imported is supported by ApsaraDB for ClickHouse. For more information about the supported file formats and requirements, see Use clickhouse-client to import data
Procedure
This example shows how to import a CSV file into a test_tbl_distributed distributed table in the default database of an ApsaraDB for ClickHouse cluster. You can modify the parameters based on your business requirements. Sample environment:
Destination database: default
Destination table: test_tbl_distributed
Data source: the testData.csv file
Step 1: Prepare data
In the directory in which clickhouse-client is installed, create a CSV file named testData.csv and write the following data to the file:
1,yang,32,shanghai,http://example.com
2,wang,22,beijing,http://example.com
3,xiao,23,shenzhen,http://example.com
4,jess,45,hangzhou,http://example.com
5,jack,14,shanghai,http://example.com
6,tomy,25,hangzhou,http://example.com
7,lucy,45,shanghai,http://example.com
8,tengyin,26,shanghai,http://example.com
9,wangli,27,shenzhen,http://example.com
10,xiaohua,37,shanghai,http://example.com
Step 2: Create a table
Connect to the database.
ApsaraDB for ClickHouse is integrated with Data Management (DMS), which facilitates connection to a cluster. For more information about how to connect to a cluster by using DMS, see Use DMS to connect to an ApsaraDB for ClickHouse cluster.
If you use another client, see Database connectivity.
Create a table based on the cluster edition.
ImportantPay attention to the mappings between table fields and the data types of the table fields, and text fields and the data types of the text fields. Make sure that the column order and data types in the destination table exactly match the column order and data types in the source text. Otherwise, errors may occur during data import.
For an Enterprise Edition cluster, you need to only create local tables. For a Community-compatible cluster, you may need to create distributed tables based on your business requirements. The following sample statement provides an example on how to create a table. For more information, see CREATE TABLE.
Enterprise Edition
CREATE TABLE test_tbl_local ON cluster default ( id UInt8, user_name String, age UInt16, city String, access_url String ) ENGINE = MergeTree() ORDER BY id;
If the
ON CLUSTER is not allowed for Replicated database
error message appears when you execute the preceding statement, you can update the minor engine version of the cluster to resolve the issue. For more information, see Update the minor engine version.Community-compatible Edition
The table engines that are used by Single-replica Edition and Dual-replica Edition are different. Select a table engine based on the replica type you use.
ImportantWhen you create tables in a Double-replica Edition cluster, make sure that the tables use Replicated table engines from the MergeTree family. If the tables use non-Replicated table engines, the data on the tables is not replicated across replicas. This can lead to data inconsistency.
Single-replica Edition
Create a local table.
CREATE TABLE test_tbl_local ON cluster default ( id UInt8, user_name String, age UInt16, city String, access_url String ) ENGINE = MergeTree() ORDER BY id;
Optional. Create a distributed table.
If you want to import data only to the local table, skip this step.
If your cluster is a multi-node cluster, we recommend that you create a distributed table.
CREATE TABLE test_tbl_distributed ON cluster default ( id UInt8, user_name String, age UInt16, city String, access_url String ) ENGINE = Distributed(default, default, test_tbl_local, rand());
Dual-replica Edition
Create a local table.
CREATE TABLE test_tbl_local ON cluster default ( id UInt8, user_name String, age UInt16, city String, access_url String ) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{database}/{table}/{shard}', '{replica}') order by id;
Optional. Create a distributed table.
If you want to import data only to the local table, skip this step.
If your cluster is a multi-node cluster, we recommend that you create a distributed table.
CREATE TABLE test_tbl_distributed ON cluster default ( id UInt8, user_name String, age UInt16, city String, access_url String ) ENGINE = Distributed(default, default, test_tbl_local, rand());
Step 3: Import sample data
Run the following command in the directory in which clickhouse-client is installed.
If you want to accelerate data import, you can split the original file into equal segments, start multiple client processes, and concurrently import the segments.
If your cluster is a multi-node cluster, we recommend that you import data to a distributed table.
cat <file_name> | ./clickhouse-client --host=<host> --port=<port> --user=<user> --password=<password> --query="INSERT INTO <table_name> FORMAT <file_type>";
Parameter | Description |
file_name | The path to the destination file. |
host | The public endpoint or VPC endpoint. You can view the public endpoint or VPC endpoint on the Cluster Information page. We recommend that you select an appropriate connection method based on the server on which clickhouse-client is installed:
For more information about how to apply for a public endpoint for a cluster, see Apply for and release a public endpoint. |
port | The TCP port number. You can view the TCP port number on the Cluster Information page. |
user | The username of the database account. |
password | The password of the database account. |
table_name | The name of the destination table. If you create a distributed table, set this parameter to the name of the distributed table. |
file_type | The type of the source file. |
Step 4: View the import results
Connect to the cluster.
Use DMS to connect to the cluster. For more information, see Use DMS to connect to an ApsaraDB for ClickHouse cluster.
Execute the following query statement.
ImportantIf the cluster runs Community-compatible Edition and contains multiple nodes, we recommend that you query the distributed table to retrieve complete data from all nodes. If you query a non-distributed table, you can retrieve data only from a single node in the cluster, which is less than the data you imported.
SELECT * FROM test_tbl_local;
The following query results are returned:
+--------------+---------------------+---------------+----------------+----------------------+ | id | user_name | age | city | access_url | +--------------+---------------------+---------------+----------------+----------------------+ | 1 | yang | 32 | shanghai | http://example.com | | 2 | wang | 22 | beijing | http://example.com | | 3 | xiao | 23 | shenzhen | http://example.com | | 4 | jess | 45 | hangzhou | http://example.com | | 5 | jack | 14 | shanghai | http://example.com | | 6 | tomy | 25 | hangzhou | http://example.com | | 7 | lucy | 45 | shanghai | http://example.com | | 8 | tengyin | 26 | shanghai | http://example.com | | 9 | wangli | 27 | shenzhen | http://example.com | | 10 | xiaohua | 37 | shanghai | http://example.com | +--------------+---------------------+---------------+----------------+----------------------+
Supported file formats
The following table describes the common file formats that can be imported.
In a text, each row of data corresponds to a single record in the table, and each column of data corresponds to a field specified when the table was created.
If the first or second column contains descriptions of column names or column types, the descriptions are not parsed. Only the raw data that you want to import is processed. The column names and data types described in the file do not determine the position or data types of the columns in the table in which the data is inserted.
File format | Text requirement | Example |
TabSeparated |
|
|
TabSeparatedWithNames | This file format is similar to the TabSeparated format, except that this format requires the first row in the text to be column names. The first row is ignored during parsing. |
|
TabSeparatedWithNamesAndTypes | This file format is similar to the TabSeparated format, except that the values in the first row are column names and the values in the second row are data types. The first and second rows are ignored during parsing. |
|
CSV |
|
|
CSVWithNames | This format is similar to the CSV format, except that the values in the first row are column names. The first row is ignored during parsing. |
|
For more information about other supported file formats, see Formats for Input and Output Data.
References
For more information about migration solutions, see Data migration and synchronization.