DataWorks provides StarRocks Reader and StarRocks Writer for you to read data from and write data to StarRocks data sources. This topic describes the capabilities of synchronizing data from or to StarRocks data sources.
Supported versions
StarRocks Community Edition. For more information, visit the StarRocks official website.
NoteStarRocks Community Edition is highly open. If you encounter an adaptation issue when you use a StarRocks data source, submit a ticket.
Data type mappings
Most StarRocks data types, including numeric, string, and date data types, are supported.
Preparations before data synchronization
To ensure the network connectivity of an exclusive resource group that you want to use, you must add the IP address or CIDR block of the resource group to the internal IP address whitelist of the desired EMR Serverless StarRocks instance in advance. In addition, you must allow the CIDR block to access ports 9030, 8030, and 8040.
To obtain the IP address or CIDR block of each type of exclusive resource group in DataWorks, see the following topics:
Develop a data synchronization task
For information about the entry point for and the procedure of configuring a data synchronization task, see the following sections. For information about the parameter settings, view the infotip of each parameter on the configuration tab of the task.
Add a data source
Before you configure a data synchronization task to synchronize data from or to a specific data source, you must add the data source to DataWorks. For more information, see Add and manage data sources. The following content describes the configuration of the Java Database Connectivity (JDBC) URL when you add a StarRocks data source:
If you add an EMR Serverless StarRocks instance as a data source, the JDBC URL is specified in the following format:
jdbc:mysql://<URL of the FE node>:<Query port of the FE node>/<Database name>.
Configure a batch synchronization task to synchronize data of a single table
For more information about the configuration procedure, see Configure a batch synchronization task by using the codeless UI and Configure a batch synchronization task by using the code editor.
For information about all parameters that are configured and the code that is run when you use the code editor to configure a batch synchronization task, see Appendix: Code and parameters.
Appendix: Code and parameters
Appendix: Configure a batch synchronization task by using the code editor
If you use the code editor to configure a batch synchronization task, you must configure parameters for the reader and writer of the related data source based on the format requirements in the code editor. For more information about the format requirements, see Configure a batch synchronization task by using the code editor. The following information describes the configuration details of parameters for the reader and writer in the code editor.
Code for StarRocks Reader
{
"stepType": "starrocks",
"parameter": {
"selectedDatabase": "didb1",
"datasource": "starrocks_datasource",
"column": [
"id",
"name"
],
"where": "id>100",
"table": "table1",
"splitPk": "id"
},
"name": "Reader",
"category": "reader"
}
Parameters in code for StarRocks Reader
Parameter | Description | Required | Default value |
datasource | The name of the StarRocks data source. | Yes | No default value |
selectedDatabase | The name of the StarRocks database. | No | The name of the database that is configured in the StarRocks data source |
column | The names of the columns from which you want to read data. | Yes | No default value |
where | The WHERE clause. For example, you can set this parameter to
| No | No default value |
table | The name of the table from which you want to read data. | Yes | No default value |
splitPk | The field that is used for data sharding when StarRocks Reader reads data. If you specify this parameter, data sharding is performed based on the value of this parameter, and parallel threads can be used to read data. This improves data synchronization efficiency. We recommend that you set the splitPk parameter to the name of the primary key column of the table. Data can be evenly distributed to different shards based on the primary key column, instead of being intensively distributed only to specific shards. | No | No default value |
Code for StarRocks Writer
{
"stepType": "starrocks",
"parameter": {
"selectedDatabase": "didb1",
"loadProps": {
"row_delimiter": "\\x02",
"column_separator": "\\x01"
},
"datasource": "starrocks_public",
"column": [
"id",
"name"
],
"loadUrl": [
"1.1.1.1:8030"
],
"table": "table1",
"preSql": [
"truncate table table1"
],
"postSql": [
]
},
"name": "Writer",
"category": "writer"
}
Parameters in code for StarRocks Writer
Parameter | Description | Required | Default value |
datasource | The name of the StarRocks data source. | Yes | No default value |
selectedDatabase | The name of the StarRocks database. | No | The name of the database that is configured in the StarRocks data source |
loadProps | The request parameters for the StarRocks Stream Load import method. If you want to import data as CSV files by using the Stream Load import method, you can configure request parameters. If you have no special requirements, set the parameter to {}. Request parameters that you can configure for the Stream Load import method:
| Yes | No default value |
column | The names of the columns to which you want to write data. | Yes | No default value |
loadUrl | The URL of a StarRocks frontend node. The URL consists of the IP address of the frontend node and the HTTP port number. The default HTTP port number is 8030. If you specify URLs for multiple frontend nodes, separate them with commas (,). | Yes | No default value |
table | The name of the table to which you want to write data. | Yes | No default value |
preSql | The SQL statement that you want to execute before the synchronization task is run. For example, you can set this parameter to the TRUNCATE TABLE tablename statement to delete outdated data. | No | No default value |
postSql | The SQL statement that you want to execute after the synchronization task is run. | No | No default value |