DataWorks Data Integration allows you to use Doris Writer to write table data to Doris. This topic describes the capabilities of synchronizing data to Doris data sources.
Supported Doris versions
Doris Writer uses the MySQL driver 5.1.47. The following table describes the Doris kernel versions supported by the driver. For more information about the driver capabilities, see the official Doris documentation.
Doris kernel version | Supported or not |
0.x.x | Yes |
1.1.x | Yes |
1.2.x | Yes |
2.x | Yes |
Limits
You can use Data Integration to synchronize data to Doris only in offline mode.
Supported data types
Different Doris versions support different data types and aggregation models. For information about all data types supported in each Doris version, see the official Doris documentation. The following table describes the mainly supported data types.
Data type | Supported model | Doris version | Doris Writer for batch data write |
SMALLINT | Aggregate,Unique,Duplicate | 0.x.x, 1.1.x, 1.2.x, 2.x | Yes |
INT | Aggregate,Unique,Duplicate | 0.x.x, 1.1.x, 1.2.x, 2.x | Yes |
BIGINT | Aggregate,Unique,Duplicate | 0.x.x, 1.1.x, 1.2.x, 2.x | Yes |
LARGEINT | Aggregate,Unique,Duplicate | 0.x.x, 1.1.x, 1.2.x, 2.x | Yes |
FLOAT | Aggregate,Unique,Duplicate | 0.x.x, 1.1.x, 1.2.x, 2.x | Yes |
DOUBLE | Aggregate,Unique,Duplicate | 0.x.x, 1.1.x, 1.2.x, 2.x | Yes |
DECIMAL | Aggregate,Unique,Duplicate | 0.x.x, 1.1.x, 1.2.x, 2.x | Yes |
DECIMALV3 | Aggregate,Unique,Duplicate | Versions later than 1.2.1, 2.x | Yes |
DATE | Aggregate,Unique,Duplicate | 0.x.x, 1.1.x, 1.2.x, 2.x | Yes |
DATETIME | Aggregate,Unique,Duplicate | 0.x.x, 1.1.x, 1.2.x, 2.x | Yes |
DATEV2 | Aggregate,Unique,Duplicate | 1.2.x, 2.x | Yes |
DATATIMEV2 | Aggregate,Unique,Duplicate | 1.2.x, 2.x | Yes |
CHAR | Aggregate,Unique,Duplicate | 0.x.x, 1.1.x, 1.2.x, 2.x | Yes |
VARCHAR | Aggregate,Unique,Duplicate | 0.x.x, 1.1.x, 1.2.x, 2.x | Yes |
STRING | Aggregate,Unique,Duplicate | 0.x.x, 1.1.x, 1.2.x, 2.x | Yes |
VARCHAR | Aggregate,Unique,Duplicate | 1.1.x, 1.2.x, 2.x | Yes |
ARRAY | Duplicate | 1.2.x, 2.x | Yes |
JSONB | Aggregate,Unique,Duplicate | 1.2.x, 2.x | Yes |
HLL | Aggregate | 0.x.x, 1.1.x, 1.2.x, 2.x | Yes |
BITMAP | Aggregate | 0.x.x, 1.1.x, 1.2.x, 2.x | Yes |
QUANTILE_STATE | Aggregate | 1.2.x, 2.x | Yes |
How it works
Doris Writer writes data by using the native StreamLoad method. Doris Writer caches the data that is read by a reader in the memory, concatenates the data into texts, and then writes the texts to a Doris database at a time. For more information, see the official Doris documentation.
Prepare a Doris environment before data synchronization
Before you use DataWorks to synchronize data to a Doris data source, you must prepare a Doris environment. This ensures that a data synchronization task can be configured and can synchronize data to the Doris data source as expected. The following information describes how to prepare a Doris environment for data synchronization to a Doris data source.
Preparation 1: Check the version of your Doris database
Data Integration supports only specific Doris versions. You can refer to the Supported Doris versions section in this topic to check whether the version of your Doris database meets the requirements. You can download the related Doris version from the official Doris website and install the Doris version.
Preparation 2: Prepare an account that has the required permissions
You must create an account that is used to log on to the Doris database for subsequent operations. You must specify a password for the account for subsequent connections to the Doris database. If you want to use the default root user of Doris to log on to the Doris database, you must specify a password for the root user. By default, the root user does not have a password. You can execute an SQL statement in Doris to specify the password:
SET PASSWORD FOR 'root' = PASSWORD('Password')
Preparation 3: Establish a network connection between the Doris database and a resource group
To use the StreamLoad method to write data, you need to access the private IP address of an FE node. If you access the public IP address of the FE node, you are redirected to the private IP address of a BE node. For more information about the redirection, see Data operation issues. To allow access to the FE node through the private IP address, you must establish a network connection between the Doris database and an exclusive resource group for Data Integration. For more information about how to establish a network connection between the Doris database and a resource group, see Network connectivity solutions.
Develop a data synchronization task
Add a data source
Before you configure a data synchronization task to synchronize data to a specific data source, you must add the data source to DataWorks. For more information, see Add and manage data sources. Take note of the configuration requirements for the following configuration items of the Doris data source:
JdbcUrl: a Java Database Connectivity (JDBC) connection string that consists of an IP address, a port number, a database name, and connection parameters. Public and private IP addresses are supported. If you use a public IP address, make sure that the resource group for Data Integration can access the host on which the Doris database resides.
FE endpoint: the IP address and port number of the FE node. If your cluster contains multiple FE nodes, you can configure the IP addresses and port numbers of multiple FE nodes. Separate the IP address and port number pairs with a comma (,), such as
ip1:port1,ip2:port2
. A network connectivity test will be performed for all FE endpoints.Username: the username that is used to log on to the Doris database.
Password: the password that is used to log on to the Doris database.
Configure a batch synchronization task to synchronize data of a single table
For more information about the configuration procedure, see Configure a batch synchronization task by using the codeless UI and Configure a batch synchronization task by using the code editor.
For information about all parameters that are configured and the code that is run when you use the code editor to configure a batch synchronization task, see Appendix: Code and parameters.
Appendix: Code and parameters
Appendix: Configure a batch synchronization task by using the code editor
If you use the code editor to configure a batch synchronization task, you must configure parameters for the writer of the related data source based on the format requirements in the code editor. For more information about the format requirements, see Configure a batch synchronization task by using the code editor. The following information describes the configuration details of parameters for the writer in the code editor.
Code for Doris Writer
{
"stepType": "doris",// The plug-in name.
"parameter":
{
"postSql":// The SQL statement that you want to execute after the synchronization task is run.
[],
"preSql":
[],// The SQL statement that you want to execute before the synchronization task is run.
"datasource":"doris_datasource",// The name of the data source.
"table": "doris_table_name",// The name of the table.
"column":
[
"id",
"table_id",
"table_no",
"table_name",
"table_status"
],
"loadProps":{
"column_separator": "\\x01",// The column delimiter of data in the CSV format.
"line_delimiter": "\\x02"// The row delimiter of data in the CSV format.
}
},
"name": "Writer",
"category": "writer"
}
Parameters in code for Doris Writer
Parameter | Description | Required | Default value |
datasource | The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor. | Yes | No default value |
table | The name of the table to which you want to write data. | Yes | No default value |
column | The names of the columns to which you want to write data. Separate the names with commas (,), such as | Yes | No default value |
preSql | The SQL statement that you want to execute before the synchronization task is run. For example, you can set this parameter to the SQL statement that is used to delete outdated data. You can execute only one SQL statement in the codeless UI and multiple SQL statements in the code editor. | No | No default value |
postSql | The SQL statement that you want to execute after the synchronization task is run. For example, you can set this parameter to the SQL statement that is used to add a timestamp. You can execute only one SQL statement on the codeless UI and multiple SQL statements in the code editor. | No | No default value |
maxBatchRows | The maximum number of rows that you can write to the destination table at a time. Both this parameter and the batchSize parameter determine the number of data records that you can write to the destination table at a time. Each time the cached data reaches the value of either parameter, the writer starts to write the data to the destination table. | No | 500000 |
batchSize | The maximum amount of data that you can write to the destination table at a time. Both this parameter and the maxBatchRows parameter determine the number of data records that you can write to the destination table at a time. Each time the cached data reaches the value of either parameter, the writer starts to write the data to the destination table. | No | 104857600 |
maxRetries | The maximum number of retries allowed after you failed to write multiple data records to the destination table at a time. | No | 3 |
labelPrefix | The prefix of the labels of the files that you want to upload at a time. The final label is a globally unique label that consists of | No | datax_doris_writer_ |
loadProps | The request parameter of the StreamLoad method, which is used to specify the format of data that you want to write. By default, data is written in the CSV format. If the loadProps parameter is left unspecified, the default CSV format is used, with
If you want to write data in the JSON format, use the following settings:
| No | No default value |
Code for writing data to columns of specific aggregation types (aggregation types supported by Doris Writer)
You can use Doris Writer to write data to columns of specific aggregation types. When you use Doris Writer to write data to columns of specific aggregation types, you must configure the additional parameters.
For example, in the following Doris table, uuid is of the bitmap type (aggregation type) and sex is of the HLL type (aggregation type).
CREATE TABLE `example_table_1` (
`user_id` int(11) NULL,
`date` varchar(10) NULL DEFAULT "10.5",
`city` varchar(10) NULL,
`uuid` bitmap BITMAP_UNION NULL, -- Aggregation type
`sex` HLL HLL_UNION -- Aggregation type
) ENGINE=OLAP AGGREGATE KEY(`user_id`, `date`,`city`)
COMMENT 'OLAP' DISTRIBUTED BY HASH(`user_id`) BUCKETS 32
Insert raw data into the table:
user_id,date,city,uuid,sex
0,T0S4Pb,abc,43,'54'
1,T0S4Pd,fsd,34,'54'
2,T0S4Pb,fa3,53,'64'
4,T0S4Pb,fwe,87,'64'
5,T0S4Pb,gbr,90,'56'
2,iY3GiHkLF,234,100,'54'
When you use Doris Writer to write data to a column of an aggregation type, you must specify the column in writer.parameter.column and configure an aggregate function in writer.parameter.loadProps.columns. For example, you must use the aggregate function bitmap_hash for the uuid column and use the aggregate function hll_hash for the sex column.
Sample code:
{
"stepType": "doris",// The plug-in name.
"writer":
{
"parameter":
{
"column":
[
"user_id",
"date",
"city",
"uuid",// The aggregation type is bitmap.
"sex"// The aggregation type is HLL.
],
"loadProps":
{
"format": "csv",
"column_separator": "\\x01",
"line_delimiter": "\\x02",
"columns": "user_id,date,city,k1,uuid=bitmap_hash(k1),k2,sex=hll_hash(k2)"// You must specify the aggregate functions.
},
"postSql":
[
"select count(1) from example_tbl_3"
],
"preSql":
[],
"datasource":"doris_datasource",// The name of the data source.
"table": "doris_table_name",// The name of the table.
}
"name": "Writer",
"category": "writer"
}
}