Doris data source

DataWorks Data Integration allows you to use Doris Writer to write table data to Doris. This topic describes the capabilities of synchronizing data to Doris data sources.

Supported Doris versions

Doris Writer uses the MySQL driver 5.1.47. The following table describes the Doris kernel versions supported by the driver. For more information about the driver capabilities, see the official Doris documentation.

Doris kernel version	Supported or not

Doris kernel version	Supported or not
0.x.x	Yes
1.1.x	Yes
1.2.x	Yes
2.x	Yes

Limits

You can use Data Integration to synchronize data to Doris only in offline mode.

Supported data types

Different Doris versions support different data types and aggregation models. For information about all data types supported in each Doris version, see the official Doris documentation. The following table describes the mainly supported data types.

Data type	Supported model	Doris version	Doris Writer for batch data write

Data type	Supported model	Doris version	Doris Writer for batch data write
SMALLINT	Aggregate,Unique,Duplicate	0.x.x, 1.1.x, 1.2.x, 2.x	Yes
INT	Aggregate,Unique,Duplicate	0.x.x, 1.1.x, 1.2.x, 2.x	Yes
BIGINT	Aggregate,Unique,Duplicate	0.x.x, 1.1.x, 1.2.x, 2.x	Yes
LARGEINT	Aggregate,Unique,Duplicate	0.x.x, 1.1.x, 1.2.x, 2.x	Yes
FLOAT	Aggregate,Unique,Duplicate	0.x.x, 1.1.x, 1.2.x, 2.x	Yes
DOUBLE	Aggregate,Unique,Duplicate	0.x.x, 1.1.x, 1.2.x, 2.x	Yes
DECIMAL	Aggregate,Unique,Duplicate	0.x.x, 1.1.x, 1.2.x, 2.x	Yes
DECIMALV3	Aggregate,Unique,Duplicate	Versions later than 1.2.1, 2.x	Yes
DATE	Aggregate,Unique,Duplicate	0.x.x, 1.1.x, 1.2.x, 2.x	Yes
DATETIME	Aggregate,Unique,Duplicate	0.x.x, 1.1.x, 1.2.x, 2.x	Yes
DATEV2	Aggregate,Unique,Duplicate	1.2.x, 2.x	Yes
DATATIMEV2	Aggregate,Unique,Duplicate	1.2.x, 2.x	Yes
CHAR	Aggregate,Unique,Duplicate	0.x.x, 1.1.x, 1.2.x, 2.x	Yes
VARCHAR	Aggregate,Unique,Duplicate	0.x.x, 1.1.x, 1.2.x, 2.x	Yes
STRING	Aggregate,Unique,Duplicate	0.x.x, 1.1.x, 1.2.x, 2.x	Yes
VARCHAR	Aggregate,Unique,Duplicate	1.1.x, 1.2.x, 2.x	Yes
ARRAY	Duplicate	1.2.x, 2.x	Yes
JSONB	Aggregate,Unique,Duplicate	1.2.x, 2.x	Yes
HLL	Aggregate	0.x.x, 1.1.x, 1.2.x, 2.x	Yes
BITMAP	Aggregate	0.x.x, 1.1.x, 1.2.x, 2.x	Yes
QUANTILE_STATE	Aggregate	1.2.x, 2.x	Yes

How it works

Doris Writer writes data by using the native StreamLoad method. Doris Writer caches the data that is read by a reader in the memory, concatenates the data into texts, and then writes the texts to a Doris database at a time. For more information, see the official Doris documentation.

Prepare an ApsaraDB for OceanBase environment before data synchronization

Before you use DataWorks to synchronize data to a Doris data source, you must prepare a Doris environment. This ensures that a data synchronization task can be configured and can synchronize data to the Doris data source as expected. The following information describes how to prepare a Doris environment for data synchronization to a Doris data source.

Preparation 1: Check the version of your Doris database

Data Integration supports only specific Doris versions. You can refer to the Supported Doris versions section in this topic to check whether the version of your Doris database meets the requirements. You can download the related Doris version from the official Doris website and install the Doris version.

Preparation 2: Prepare an account that has the required permissions

You must create an account that is used to log on to the Doris database for subsequent operations. You must specify a password for the account for subsequent connections to the Doris database. If you want to use the default root user of Doris to log on to the Doris database, you must specify a password for the root user. By default, the root user does not have a password. You can execute an SQL statement in Doris to specify the password:

SET PASSWORD FOR 'root' = PASSWORD('Password')

Preparation 3: Establish a network connection between the Doris database and a resource group

To use the StreamLoad method to write data, you need to access the private IP address of an FE node. If you access the public IP address of the FE node, you are redirected to the private IP address of a BE node. For more information about the redirection, see Data operation issues. In this case, you must establish network connections between your data source and a serverless resource group or an exclusive resource group for Data Integration to enable the resource group to access the data source over an internal network. For more information about how to establish a network connection between the Doris database and a resource group, see Network connectivity solutions.

Add a data source

Before you develop a synchronization task in DataWorks, you must add the required data source to DataWorks by following the instructions in Add and manage data sources. You can view the infotips of parameters in the DataWorks console to understand the meanings of the parameters when you add a data source.

Take note of the configuration requirements for the following configuration items of the Doris data source:

JdbcUrl: a Java Database Connectivity (JDBC) connection string that consists of an IP address, a port number, a database name, and connection parameters. Public and private IP addresses are supported. If you use a public IP address, make sure that the resource group for Data Integration can access the host on which the Doris database resides.
FE endpoint: the IP address and port number of the FE node. If your cluster contains multiple FE nodes, you can configure the IP addresses and port numbers of multiple FE nodes. Separate the IP address and port number pairs with a comma (,), such as ip1:port1,ip2:port2. A network connectivity test will be performed for all FE endpoints.
Username: the username that is used to log on to the Doris database.
Password: the password that is used to log on to the Doris database.

Develop a data synchronization task

For information about the entry point for and the procedure of configuring a synchronization task, see the following configuration guides.

Configure a batch synchronization task to synchronize data of a single table

For more information about the configuration procedure, see Configure a batch synchronization task by using the codeless UI and Configure a batch synchronization task by using the code editor.
For information about all parameters that are configured and the code that is run when you use the code editor to configure a batch synchronization task, see Appendix: Code and parameters.

Appendix: Code and parameters

Configure a batch synchronization task by using the code editor

If you want to configure a batch synchronization task by using the code editor, you must configure the related parameters in the script based on the unified script format requirements. For more information, see Configure a batch synchronization task by using the code editor. The following information describes the parameters that you must configure for data sources when you configure a batch synchronization task by using the code editor.

Code for ApsaraDB for Doris Reader

{
  "type": "job",
  "version": "2.0",// The version number. 
  "steps": [
    {
      "stepType": "doris",// The plug-in name. 
      "parameter": {
        "column": [// The names of the columns. 
          "id"
        ],
        "connection": [
          {
            "querySql": [
              "select a,b from join1 c join join2 d on c.id = d.id;"
            ],
            "datasource": ""// The name of the data source. 
          }
        ],
        "where": "",// The WHERE clause. 
        "splitPk": "",// The shard key. 
        "encoding": "UTF-8"// The encoding format. 
      },
      "name": "Reader",
      "category": "reader"
    },
    {
      "stepType": "stream",
      "parameter": {},
      "name": "Writer",
      "category": "writer"
    }
  ],
  "setting": {
    "errorLimit": {
      "record": "0"// The maximum number of dirty data records allowed. 
    },
    "speed": {
      "throttle": true,// Specifies whether to enable throttling. The value false indicates that throttling is disabled, and the value true indicates that throttling is enabled. The mbps parameter takes effect only when the throttle parameter is set to true. 
      "concurrent": 1,// The maximum number of parallel threads. 
      "mbps": "12"// The maximum transmission rate. Unit: MB/s. 
    }
  },
  "order": {
    "hops": [
      {
        "from": "Reader",
        "to": "Writer"
      }
    ]
  }
}

Parameters in code for Doris Reader

Parameter	Description	Required	Default value
datasource	The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor.	Yes	No default value
table	The name of the table from which you want to read data. Each synchronization task can be used to synchronize data from only one table. For a sharded table, you can use the table parameter to specify the partitions from which you want to read data. Examples: Set the table parameter to `'table_[0-99]'`. This value indicates that Doris Reader reads data from the partitions `'table_0'` to `'table_99'` of the sharded table. Set the table parameter to `'"table": ["table_00[0-9]", "table_0[10-99]", "table_[100-999]"]'`. This value indicates that Doris Reader reads data from the partitions `'table_000'` to `'table_999'` of the sharded table. You can use this method only if the numerical suffixes of all the partition names are of the same length. Note Doris Reader reads data from the columns that are specified by the column parameter in the partitions that are specified by the table parameter. If a specified partition or column does not exist, the synchronization task fails.	Yes	No default value
column	The names of the columns from which you want to read data. Specify the names in a JSON array. The default value is `[]`, which indicates all the columns in the source table. You can select specific columns to read. The column order can be changed. This indicates that you can specify columns in an order different from the order specified by the schema of the source table. Constants are supported. The column names must be arranged in compliance with the SQL syntax supported by Doris Reader, such as `["id", "table","1","'mingya.wmy'","'null'", "to_char(a+1)","2.3","true"]`. id: a column name. table: the name of a column that contains reserved keywords. 1: an integer constant. 'mingya.wmy': a string constant, which is enclosed in single quotation marks ('). null: `" "` indicates an empty string. `null` indicates a null value. `'null'` indicates the string null. to_char(a+1): a function expression that is used to calculate the length of a string. 2.3: a floating-point constant. true: a Boolean value. The column* parameter must explicitly specify all the columns from which you want to read data. This parameter cannot be left empty.	Yes	No default value
splitPk	The field that is used for data sharding when Doris Reader reads data. If you configure this parameter, data sharding is performed based on the value of this parameter, and parallel threads can be used to read data. This improves data synchronization efficiency. We recommend that you set the splitPk parameter to the name of the primary key column of the table. Data can be evenly distributed to different shards based on the primary key column, instead of being intensively distributed only to specific shards. The splitPk parameter supports sharding for data only of integer data types. If you set the splitPk parameter to a field of an unsupported data type, such as a string, floating point, or date data type, the setting of this parameter is ignored, and a single thread is used to read data. If the splitPk parameter is not provided or is left empty, a single thread is used to read data.	No	No default value
where	The WHERE clause. For example, you can set this parameter to `gmt_create > $bizdate` to read the data that is generated on the current day. You can use the WHERE clause to read incremental data. If the where parameter is not provided or is left empty, Doris Reader reads all data. Do not set the where parameter to limit 10. This value does not conform to the constraints of Doris on the SQL WHERE clause.	No	No default value
querySql (advanced parameter, which is available only in the code editor)	The SQL statement that is used for refined data filtering. If you configure this parameter, data is filtered based only on the value of this parameter. For example, if you want to join multiple tables for data synchronization, configure `select a,b from table_a join table_b on table_a.id = table_b.id`. The priority of the querySql parameter is higher than the priorities of the table, column, where, and splitPk parameters. If you configure the querySql parameter, Doris Reader ignores the settings of the table, column, where, and splitPk parameters. The system parses the information, such as the username and password, of the data source specified by the datasource parameter from the querySql parameter. Note The name of the querySql parameter is case-sensitive. For example, querysql does not take effect.	No	No default value

Code for Doris Writer

{
  "stepType": "doris",// The plug-in name. 
  "parameter":
  {
    "postSql":// The SQL statement that you want to execute after the synchronization task is run. 
    [],
    "preSql":
    [],// The SQL statement that you want to execute before the synchronization task is run. 
    "datasource":"doris_datasource",// The name of the data source. 
    "table": "doris_table_name",// The name of the table. 
    "column":
    [
      "id",
      "table_id",
      "table_no",
      "table_name",
      "table_status"
    ],
    "loadProps":{
      "column_separator": "\\x01",// The column delimiter of data in the CSV format.
      "line_delimiter": "\\x02"// The row delimiter of data in the CSV format.
    }
  },
  "name": "Writer",
  "category": "writer"
}

Parameters in code for Doris Writer

Parameter	Description	Required	Default value
datasource	The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor.	Yes	No default value
table	The name of the table from which you want to read data.	Yes	No default value
column	The names of the columns to which you want to write data. Separate the names with commas (,), such as `"column":["id","name","age"]`. If you want to write data to all the columns in the destination table, set this parameter to an asterisk (), such as "column":[""].	Yes	No default value
preSql	The SQL statement that you want to execute before the synchronization task is run. For example, you can set this parameter to the SQL statement that is used to delete outdated data. You can execute only one SQL statement in the codeless UI and multiple SQL statements in the code editor.	No	No default value
postSql	The SQL statement that you want to execute after the synchronization task is run. For example, you can set this parameter to the SQL statement that is used to add a timestamp. You can execute only one SQL statement on the codeless UI and multiple SQL statements in the code editor.	No	No default value
maxBatchRows	The maximum number of rows that you can write to the destination table at a time. Both this parameter and the batchSize parameter determine the number of data records that you can write to the destination table at a time. Each time the cached data reaches the value of either parameter, the writer starts to write the data to the destination table.	No	500000
batchSize	The maximum amount of data that you can write to the destination table at a time. Both this parameter and the maxBatchRows parameter determine the number of data records that you can write to the destination table at a time. Each time the cached data reaches the value of either parameter, the writer starts to write the data to the destination table.	No	104857600
maxRetries	The maximum number of retries allowed after you failed to write multiple data records to the destination table at a time.	No	3
labelPrefix	The prefix of the labels of the files that you want to upload at a time. The final label is a globally unique label that consists of `label prefix and UUIDs` to ensure that data is not repeatedly written.	No	datax_doris_writer_
loadProps	The request parameter of the StreamLoad method, which is used to specify the format of data that you want to write. By default, data is written in the CSV format. If the loadProps parameter is left unspecified, the default CSV format is used, with `\t` as the column delimiter and `\n` as the row delimiter, as shown in the following sample settings: `"loadProps": { "format":"csv", "column_separator": "\t", "line_delimiter": "\n" }` If you want to write data in the JSON format, use the following settings: `"loadProps": { "format": "json" }`	No	No default value

Code for writing data to columns of specific aggregation types (aggregation types supported by Doris Writer)

You can use Doris Writer to write data to columns of specific aggregation types. When you use Doris Writer to write data to columns of specific aggregation types, you must configure the additional parameters.

For example, in the following Doris table, uuid is of the bitmap type (aggregation type) and sex is of the HLL type (aggregation type).

CREATE TABLE `example_table_1` (
  `user_id` int(11) NULL,
  `date` varchar(10) NULL DEFAULT "10.5",
  `city` varchar(10) NULL,
  `uuid` bitmap BITMAP_UNION NULL, -- Aggregation type
  `sex` HLL HLL_UNION  -- Aggregation type
) ENGINE=OLAP AGGREGATE KEY(`user_id`, `date`,`city`)
COMMENT 'OLAP' DISTRIBUTED BY HASH(`user_id`) BUCKETS 32

Insert raw data into the table:

user_id,date,city,uuid,sex
0,T0S4Pb,abc,43,'54'
1,T0S4Pd,fsd,34,'54'
2,T0S4Pb,fa3,53,'64'
4,T0S4Pb,fwe,87,'64'
5,T0S4Pb,gbr,90,'56'
2,iY3GiHkLF,234,100,'54'

When you use Doris Writer to write data to a column of an aggregation type, you must specify the column in writer.parameter.column and configure an aggregate function in writer.parameter.loadProps.columns. For example, you must use the aggregate function bitmap_hash for the uuid column and use the aggregate function hll_hash for the sex column.

Sample code:

{
    "stepType": "doris",// The plug-in name. 
    "writer":
    {
        "parameter":
        {
            "column":
            [
                "user_id",
                "date",
                "city",
                "uuid",// The aggregation type is bitmap.
                "sex"// The aggregation type is HLL.
            ],
            "loadProps":
            {
                "format": "csv",
                "column_separator": "\\x01",
                "line_delimiter": "\\x02",
                "columns": "user_id,date,city,k1,uuid=bitmap_hash(k1),k2,sex=hll_hash(k2)"// You must specify the aggregate functions.
            },
            "postSql":
            [
                "select count(1) from example_tbl_3"
            ],
            "preSql":
            [],
            "datasource":"doris_datasource",// The name of the data source. 
                    "table": "doris_table_name",// The name of the table. 
        }
          "name": "Writer",
              "category": "writer"
    }
}

Supported Doris versions

Limits

Supported data types

How it works

Prepare an ApsaraDB for OceanBase environment before data synchronization

Preparation 1: Check the version of your Doris database

Preparation 2: Prepare an account that has the required permissions

Preparation 3: Establish a network connection between the Doris database and a resource group

Add a data source

Develop a data synchronization task

Configure a batch synchronization task to synchronize data of a single table

Appendix: Code and parameters

Configure a batch synchronization task by using the code editor

Code for ApsaraDB for Doris Reader

Parameters in code for Doris Reader

Code for Doris Writer

Parameters in code for Doris Writer

Code for writing data to columns of specific aggregation types (aggregation types supported by Doris Writer)

Sales Support

Technical Support

Connect & Report Abuse

About Alibaba Cloud

Our Global Network

Quick Start

Global Offices

Olympic Games Paris 2024 New

Stade Roland Garros – Glitz from the Past New

Place de la Concorde – “Breaking” the Barriers New

Vaires-sur-Marne Nautical Stadium – Sports with Sustainability New

International Broadcast Center – Images, Sounds, and Data that Captivate Billions New

Customer Success Stories New

Trust Center

Security & Compliance Center

Cloud Compliance Resources

Security Compliance FAQs

Product & Feature Update New

Cloud Forward

Press Room

Alibaba Cloud e-Magazine New

Alibaba Cloud in Analyst Research

Notice

Go Global Service New

Go Global Alliance with Alibaba Cloud

China Gateway Hot

Information Compliance

China Gateway - MLPS 2.0 Compliance New

China Gateway - Networking

China Gateway - Global Application Acceleration New

China Gateway - Security

China Gateway - Data Security New

ICP Support Hot

China Gateway - Omnichannel Data Mid-End New

China Gateway - Organizational Data Mid-End New

China Gateway - Business Mid-End New

China Gateway - AI Service for Conversational Chatbots New

China Gateway - Online Education

China Gateway - Domain Registration

Work at Alibaba Cloud

Experienced Professionals

Students and Graduates

Free Trial

Pricing

Promo Center

Price Reduction

Pay Less and Deploy More

FinOps

Elastic Compute Service (ECS)

Simple Application Server (SAS)

Elastic GPU Service

Elastic Desktop Service (EDS)

Object Storage Service (OSS)

Cloud Enterprise Network (CEN)

Web Application Firewall (WAF)

Domain Names

Container Compute Service (ACS)

Secure Access Service Edge (SASE)

Intelligent Media Services(IMS)

Edge Security Acceleration (ESA)(Original DCDN)

Intelligent Media Management

DingTalk Enterprise

YiDA

Alibaba Cloud Model Studio

Apsara Prime - For Easy Cloud Product Selection

Alibaba Cloud ECS - Cater All Your Cloud Hosting Needs

1TB CDN—Get Free 1 TB Outbound Traffic Plan Now

Security—Under Attack? Get Free Security Support

Short Message Service - Free Testing is Available

Elastic Compute Service (ECS) Hot

CloudBox

Compute Nest

Dedicated Host Hot

ECS Bare Metal Instance

Elastic Desktop Service (EDS) Featured

Cloud Phone Beta

Elastic GPU Service Featured

Simple Application Server (SAS) Hot

Auto Scaling

Batch Compute

Elastic High Performance Computing (E-HPC)

Super Computing Cluster (SCC)

Function Compute (FC)