ClickHouse data sources provide bidirectional channels to read from and write to ClickHouse. This topic describes the data synchronization capabilities that DataWorks supports for ClickHouse.
Supported versions
The following table describes the compatibility between Alibaba Cloud ClickHouse versions and JDBC driver versions.
JDBC driver version | Alibaba Cloud ClickHouse version |
0.2.4 | 20.8, 21.8 |
0.4.0, 0.4.2 | 22.8, 23.8 |
Limits
ClickHouse data sources support only offline read and write operations. The following limits apply:
You can use serverless resource groups (recommended) and exclusive resource groups for Data Integration.
You can connect to ClickHouse using JDBC and read data only through JDBC statements.
You can select and reorder columns. You can also specify which columns to use.
To avoid overloading ClickHouse, you must limit the system throughput (TPS) to a maximum of 1,000 when ClickHouse Writer uses the INSERT mode.
Supported data types
The common Alibaba Cloud ClickHouse data types are supported. For a complete list of Alibaba Cloud ClickHouse data types, see Data types. Other data types that are described in the open source ClickHouse documentation are not supported. For a full list of open source ClickHouse data types, see ClickHouse Doc.
Data type | ClickHouse Reader | ClickHouse Writer |
Int8 | Support | Support |
Int16 | Support | Supported |
Int32 | Supported | Support |
Int64 | Support | Supported |
UInt8 | Support | Supported |
UInt16 | Supported | Support |
UInt32 | Support | Support |
UInt64 | Support | Support |
Float32 | Supported | Support |
Float64 | Support | Support |
Decimal | Support | Support |
String | Support | Support |
FixedString | Support | Support |
Date | Supported | Support |
DateTime | Support | Support |
DateTime64 | Support | Support |
Boolean | Support Note ClickHouse does not have a dedicated Boolean type. Use UInt8 or Int8 instead. | Support |
Array | Partially supported. Supported only when array elements are integers, floating-point numbers, strings, or DateTime64 with millisecond precision. | Supported |
Tuple | Supported | Supported |
Domain(IPv4,IPv6) | Supported | Support |
Enum8 | Supported | Support |
Enum16 | Support | Support |
Nullable | Support | Support |
Nested | Partially supported. The Nested data type supports integers, floating-point numbers, strings, and DateTime64 values with millisecond precision. | Support |
Add a data source
Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Data Source Management. You can view parameter descriptions in the DataWorks console to understand the meanings of the parameters when you add a data source.
Data synchronization task development
For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.
Configure a single-table offline sync task
For more information about the procedure, see Configure a task in the codeless UI and Configure a task in the code editor.
For more information about all parameters and script examples in code editor mode, see Appendix: Script demo and parameter descriptions.
Configure full-database offline read sync
For more information about the procedure, see Configure a real-time full database synchronization task.
Appendix: Script demo and parameter descriptions
Configure a batch synchronization task by using the code editor
If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a task in the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.
Reader script demo
{
"type": "job",
"version": "2.0",
"steps": [
{
"stepType": "clickhouse", // Plug-in name.
"parameter": {
"fetchSize":1024,// Number of records fetched per batch from the database server.
"datasource": "example",
"column": [ // Column names.
"id",
"name"
],
"where": "", // Filter condition.
"splitPk": "", // Shard key.
"table": "" // Table name.
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "clickhouse",
"parameter": {
"postSql": [
"update @table set db_modify_time = now() where db_id = 1"
],
"datasource": "example", // Data source.
"batchByteSize": "67108864",
"column": [
"id",
"name"
],
"writeMode": "insert",
"encoding": "UTF-8",
"batchSize": 1024,
"table": "ClickHouse_table",
"preSql": [
"delete from @table where db_id = -1"
]
},
"name": "Writer",
"category": "writer"
}
],
"setting": {
"executeMode": null,
"errorLimit": {
"record": "0" // Maximum number of error records allowed during sync.
},
"speed": {
"throttle":true,// When throttle is false, mbps is ignored (no throttling). When throttle is true, throttling is enabled.
"concurrent":1 // Job concurrency.
"mbps":"12",// Throttling rate. 1 mbps = 1 MB/s.
}
},
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
}
}Reader script parameters
Parameter | Description | Required | Default value |
datasource | Data source name. In script mode, you can add a data source. The value must exactly match the added data source name. | Yes | None |
table | Table to synchronize, described in JSON format. Note The table parameter must be included in the connection configuration block. | Yes | None |
fetchSize | Number of records fetched per batch from the database server. This value determines how often the sync system communicates with the server and affects extraction performance. Note An excessively large fetchSize may cause out-of-memory (OOM) errors. Increase this value gradually based on ClickHouse load. | No | 1,024 |
column | Columns to read from ClickHouse. Separate column names with commas. Example: Note The column parameter must be specified and cannot be empty. | Yes | None |
jdbcUrl | JDBC connection URL for the source database. The jdbcUrl must be included in the connection configuration block.
| Yes | None |
username | Username for the data source. | Yes | None |
password | Password for the specified username. | Yes | None |
splitPk | When extracting data from ClickHouse, specifying splitPk means you want to shard data using the field represented by splitPk. This enables concurrent sync tasks and improves efficiency. Note When splitPk is configured, fetchSize becomes required. | No | None |
where | Filter condition. In practice, you often sync only today's data by setting where to The where condition enables effective incremental synchronization. If you do not specify a where statement, which includes not providing a key or value for where, a full data synchronization is performed. | No | None |
Writer script demo
{
"type":"job",
"version":"2.0",// Version number.
"steps":[
{
"stepType":"stream",
"parameter":{},
"name":"Reader",
"category":"reader"
},
{
"stepType":"clickhouse",// Plug-in name.
"parameter":{
"username": "",
"password": "",
"column": [// Fields.
"id",
"name"
],
"connection": [
{
"table": [// Table name.
"ClickHouse_table"
],
"jdbcUrl": "jdbc:clickhouse://ip:port/database"
}
],
"preSql": [ // SQL statement executed before the sync task runs.
"TRUNCATETABLEIFEXISTStablename"
],
"postSql": [// SQL statement executed after the sync task runs.
"ALTERTABLEtablenameUPDATEcol1=1WHEREcol2=2"
],
"batchSize": "1024",
"batchByteSize": "67108864",
"writeMode": "insert"
},
"name":"Writer",
"category":"writer"
}
],
"setting":{
"errorLimit":{
"record":"0"// Number of error records.
},
"speed":{
"throttle":true,// When throttle is false, mbps is ignored (no throttling). When throttle is true, throttling is enabled.
"concurrent":1, // Job concurrency.
"mbps":"12"// Throttling rate. 1 mbps = 1 MB/s.
}
},
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
}
}Writer script parameters
Parameter | Description | Required | Default value |
jdbcUrl | JDBC connection URL for the destination database. The jdbcUrl must be included in the connection configuration block.
| Yes | None |
username | Username for the data source. | Yes | None |
password | Password for the specified username. | Yes | None |
table | Destination table name, described as a JSON array. Note The table parameter must be included in the connection configuration block. | Yes | None |
column | Fields in the destination table to write data to. Separate field names with commas. Example: Note The column parameter must be specified and cannot be empty. | Yes | None |
preSql | SQL statement executed before writing data to the destination table. | No | None |
postSql | SQL statement executed after writing data to the destination table. | No | None |
batchSize | Number of records submitted per batch. A proper value greatly reduces network interactions between the sync system and ClickHouse and improves overall throughput. An excessively large value may cause out-of-memory (OOM) errors. | No | 1,024 |