All Products
Search
Document Center

AnalyticDB:Migrate data from Doris to AnalyticDB for PostgreSQL

Last Updated:Sep 29, 2024

This topic describes how to migrate data from Doris to AnalyticDB for PostgreSQL.

Prerequisites

Procedure

Step 1: Create a destination table to store data

Create a destination table in the AnalyticDB for PostgreSQL instance to store Doris data. The destination table must have the same schema as the source table. For information about the syntax, see the "Table creation statements" section of this topic.

Step 2: Import data from Doris to OSS

Doris supports data export to object storage services by using the S3 protocol. OSS also supports the S3 protocol. You can directly upload the required data to OSS. The export capabilities of Doris vary based on different versions.

  • In Doris 2.0 and later, you can execute the EXPORT TABLE statement to export data in the CSV, TEXT, ORC, or Parquet format. The Parquet export format of Doris is incompatible with the mainstream Parquet format. Data that contains special characters such as line feeds cannot be exported in the CSV or TEXT format. We recommend that you export data in the ORC format because the ORC format offers a faster export speed.

  • In Doris 1.2, you can execute the EXPORT TABLE statement to export data only in the CSV format. We recommend that you execute the SELECT INTO OUTFILE statement to export data in the ORC format. This export method offers a slightly slower export speed than the EXPORT TABLE statement but allows you to use the WHERE clause to filter data.

Example

Execute the following statements to export data from Doris 2.0 and Doris 1.2 to OSS:

------ Doris 2.0
EXPORT TABLE s3_test TO "s3://bucket/dir/" 
PROPERTIES (
  "format"="orc"
)
WITH s3 (
    "AWS_ENDPOINT" = "oss-cn-shanghai-internal.aliyuncs.com",
    "AWS_ACCESS_KEY" = "LTA****",
    "AWS_SECRET_KEY" = "lQEI1TSJIY0******",
    "AWS_REGION" = "shanghai"
)
---- Doris 1.2 (with the c column of the DATETIME type)
SELECT a, b, CAST(c AS string) AS c FROM s3_test INTO OUTFILE "s3://bucket/dir/"
FORMAT AS orc
PROPERTIES
(
    "AWS_ENDPOINT" = "oss-cn-shanghai-internal.aliyuncs.com",
    "AWS_ACCESS_KEY" = "LTA****",
    "AWS_SECRET_KEY" = "lQEI1TSJIY0******",
    "AWS_REGION" = "shanghai"
);
Note

If the source table contains columns of the DATETIME type, you can execute only the SELECT INTO FILE statement. This is because DATETIME data exported from Doris is incompatible with mainstream products and must be converted into the STRING type.

Step 3: Import data from OSS to AnalyticDB for PostgreSQL

You can use the COPY statement or an OSS foreign table to import data from OSS to AnalyticDB for PostgreSQL.

Example

Execute the following COPY statement to import OSS data:

COPY test1 FROM 'oss://bucket/dir/' ACCESS_KEY_ID 'LTAI5t********' SECRET_ACCESS_KEY 'lQEI1T*******' 
FORMAT AS orc ENDPOINT 'oss-*****-
internal.aliyuncs.com' FDW 'oss_fdw' ;

Syntax conversion

Data types

Doris

AnalyticDB for PostgreSQL

Description

BOOLEAN

BOOLEAN

None.

TINYINT

SMALLINT

The TINYINT type is not supported in AnalyticDB for PostgreSQL.

SMALLINT

SMALLINT

None.

INT

INT

None.

BIGINT

BIGINT

None.

LARGEINT

DECIMAL

None.

FLOAT

FLOAT

None.

DOUBLE

DOUBLE

None.

DECIMAL

DECIMAL

None.

DATE

DATE

None.

DATETIME

TIMESTAMP/TIMSTAMPTZ

None.

CHAR

CHAR

None.

VARCHAR

VARCHAR

None.

STRING

TEXT

None.

HyperLogLog (HLL)

/

  • HLL provides an approximate count of unique elements. If the amount of data is large, the performance of HLL is higher than the Count Distinct() function. In most cases, the error rate of HLL is approximately 1%. Occasionally, the error rate of HLL reaches up to 2%. An HLL column cannot be used as a key column. The aggregation type HLL_UNION is used when a Doris table of the aggregate key model is created.

  • You do not need to specify a length and the default value for HLL columns. The length is specified by the system based on the degree of data aggregation. HLL columns can be queried or used only by using the following functions: hll_union_agg(), hll_raw_agg(), hll_cardinality(), and hll_hash().

BITMAP

/

BITMAP columns can be used in Doris tables of the aggregate or unique key model.

QUANTILE_STATE

/

None.

ARRAY

[ ]

None.

MAP

A custom composite type

None.

STRUCT

A custom composite type

None.

JSON

JSON

None.

AGG_STATE

/

None.

VARIANT

A custom composite type

None.

Table creation statements

You can create tables of the following models:

Model 1: Detail model

In the detail model, no primary key or aggregate key is specified. The DUPLICATE KEY clause specified in the table creation statement is used to specify the columns based on which underlying data is sorted. A table of the detail model corresponds to an append-optimized column-oriented (AOCO) or Beam table in AnalyticDB for PostgreSQL. You can use the ORDER BY clause to specify the sort key and enable the auto-merge feature to sort data on a regular basis.

Example
CREATE TABLE IF NOT EXISTS example_tbl_by_default
(
    `timestamp` DATETIME NOT NULL COMMENT "The time when the log was generated",
    `type` INT NOT NULL COMMENT "The type of the log",
    `error_code` INT COMMENT "The error code",
    `error_msg` VARCHAR(1024) COMMENT "The error message",
    `op_id` BIGINT COMMENT "The owner ID",
    `op_time` DATETIME COMMENT "The time when the error was resolved"
)
DUPLICATE KEY (`timestamp`,`type`,`error_code`)
DISTRIBUTED BY HASH(`type`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);

----AnalyticDB PostgreSQL
CREATE TABLE IF NOT EXISTS example_tbl_by_default
(
    "timestamp" TIMESTAMP NOT NULL ,
    "type" INT NOT NULL ,
    error_code INT ,
    error_msg VARCHAR(1024),
    op_id BIGINT,
    op_time TIMESTAMP
)
WITH(appendonly = true, orientation = column)
DISTRIBUTED BY("type")
ORDER BY("timestamp","type",error_code);

COMMENT ON COLUMN example_tbl_by_default.timestamp IS 'The time when the log was generated';

Model 2: Unique key model

In the unique key model, the UNIQUE KEY clause is used to specify the unique constraint and ensure the uniqueness of the primary key. A table of the unique key model corresponds to a heap table in AnalyticDB for PostgreSQL. You can use the PRIMARY KEY clause to specify the unique key.

Example
CREATE TABLE IF NOT EXISTS example_tbl_unique
(
    `user_id` LARGEINT NOT NULL COMMENT "The user ID",
    `username` VARCHAR(50) NOT NULL COMMENT "The nickname of the user",
    `city` VARCHAR(20) COMMENT "The city in which the user resides",
    `age` SMALLINT COMMENT "The age of the user",
    `sex` TINYINT COMMENT "The gender of the user",
    `phone` LARGEINT COMMENT "The phone number of the user",
    `address` VARCHAR(500) COMMENT "The address of the user",
    `register_time` DATETIME COMMENT "The time when the user registered"
)
UNIQUE KEY(`user_id`, `username`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);

----AnalyticDB PostgreSQL
CREATE TABLE IF NOT EXISTS example_tbl_unique
(
    user_id BIGINT NOT NULL,
    username VARCHAR(50) NOT NULL,
    city VARCHAR(20),
    age SMALLINT,
    sex SMALLINT,
    phone BIGINT,
    address VARCHAR(500),
    register_time TIMESTAMP,
    PRIMARY KEY (user_id, username)
)
DISTRIBUTED BY (user_id);

COMMENT ON COLUMN example_tbl_unique.user_id IS 'The user ID';

Model 3: Aggregate key model

When you import data to a table of the aggregate key model, rows that have the same aggregate key column values are aggregated into one row based on the specified aggregate key fields. A table of the aggregate key model corresponds to a heap table in AnalyticDB for PostgreSQL. You can create a unique index on the aggregate key and execute the UPSERT statement to insert data. For more information, see Use INSERT ON CONFLICT to overwrite data.

Example
CREATE TABLE IF NOT EXISTS example_tbl_agg1
(
    `user_id` LARGEINT NOT NULL COMMENT "The user ID",
    `date` DATE NOT NULL COMMENT "The date on which data is imported to the table",
    `city` VARCHAR(20) COMMENT "The city in which the user resides",
    `age` SMALLINT COMMENT "The age of the user",
    `sex` TINYINT COMMENT "The gender of the user",
    `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "The most recent time when the user visited",
    `cost` BIGINT SUM DEFAULT "0" COMMENT "The amount of money that the user spends",
    `max_dwell_time` INT MAX DEFAULT "0" COMMENT "The maximum dwell time of the user",
    `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "The minimum dwell time of the user"
)
AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);

-----AnalyticDB for PostgreSQL does not support automatic pre-aggregation.
CREATE TABLE IF NOT EXISTS example_tbl_agg1
(
    user_id BIGINT NOT NULL,
    "date" DATE NOT NULL,
    city VARCHAR(20),
    age SMALLINT,
    sex SMALLINT,
    last_visit_date TIMESTAMP DEFAULT '1970-01-01 00:00:00',
    cost BIGINT DEFAULT 0,
    max_dwell_time INT DEFAULT 0,
    min_dwell_time INT DEFAULT 99999,
    UNIQUE (user_id, "date", city, age, sex)
)
DISTRIBUTED BY(user_id);

INSERT INTO example_tbl_agg1 VALUES (10000,'2024-08-22','beijing', 18, 0, '2024-08-22 12:00:00', 20, 1000, 1000) ON CONFLICT (user_id, "date", city, age, sex) DO UPDATE SET last_visit_date = excluded.last_visit_date, cost = example_tbl_agg1.cost + excluded.cost, max_dwell_time = GREATEST(example_tbl_agg1.max_dwell_time, excluded.max_dwell_time), min_dwell_time = LEAST(example_tbl_agg1.min_dwell_time, excluded.min_dwell_time);

Partitioning and bucketing

In Doris, you can use the PARTITION BY clause to partition data and the DISTRIBUTED BY clause to further divide data into buckets. You can use the BUCKETS parameter to specify the number of buckets. In AnalyticDB for PostgreSQL, you can use the PARTITION BY clause to specify the partition key and the DISTRIBUTED BY clause to specify the distribution key.

Example
CREATE TABLE IF NOT EXISTS example_range_tbl
(
    `user_id` LARGEINT NOT NULL COMMENT "The user ID",
    `date` DATE NOT NULL COMMENT "The date on which data is imported to the table",
    `timestamp` DATETIME NOT NULL COMMENT "The time when data is imported to the table",
    `city` VARCHAR(20) COMMENT "The city in which the user resides",
    `age` SMALLINT COMMENT "The age of the user",
    `sex` TINYINT COMMENT "The gender of the user",
    `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "The most recent time when the user visited",
    `cost` BIGINT SUM DEFAULT "0" COMMENT "The amount of money that the user spends",
    `max_dwell_time` INT MAX DEFAULT "0" COMMENT "The maximum dwell time of the user",
    `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "The minimum dwell time of the user"
)
ENGINE=OLAP
AGGREGATE KEY(`user_id`, `date`, `timestamp`, `city`, `age`, `sex`)
PARTITION BY RANGE(`date`)
(
    PARTITION `p201701` VALUES LESS THAN ("2017-02-01"),
    PARTITION `p201702` VALUES LESS THAN ("2017-03-01"),
    PARTITION `p201703` VALUES LESS THAN ("2017-04-01"),
    PARTITION `p2018` VALUES [("2018-01-01"), ("2019-01-01"))
)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16
PROPERTIES
(
    "replication_num" = "1"
);

----AnalyticDB PostgreSQL
CREATE TABLE IF NOT EXISTS example_range_tbl
(
    user_id BIGINT NOT NULL,
    "date" DATE NOT NULL,
    city VARCHAR(20),
    age SMALLINT,
    sex SMALLINT,
    visit_date TIMESTAMP DEFAULT '1970-01-01 00:00:00',
    a_cost BIGINT DEFAULT 0,
    dwell_time INT DEFAULT 0
)
PARTITION BY RANGE("date")
(
    PARTITION p201701 VALUES START ("2017-02-01") INCLUSIVE,
    PARTITION p201702 VALUES START ("2017-03-01") INCLUSIVE,
    PARTITION p201703 VALUES START ("2017-04-01") INCLUSIVE,
    PARTITION p2018 VALUES START ("2018-01-01") INCLUSIVE END ("2019-01-01") EXCLUSIVE
)
DISTRIBUTED BY (user_id);