This topic describes how to migrate data from Doris to AnalyticDB for PostgreSQL.
Prerequisites
An Express Connect circuit is created for an on-premises data center. For more information, see Connect an on-premises data center to a VPC by using an Express Connect circuit.
Alibaba Cloud Object Storage Service (OSS) is activated. For more information, see What is OSS?
An OSS bucket is created. For more information, see Create a bucket.
An AnalyticDB for PostgreSQL instance is created. For more information, see Create an instance.
Procedure
Step 1: Create a destination table to store data
Create a destination table in the AnalyticDB for PostgreSQL instance to store Doris data. The destination table must have the same schema as the source table. For information about the syntax, see the "Table creation statements" section of this topic.
Step 2: Import data from Doris to OSS
Doris supports data export to object storage services by using the S3 protocol. OSS also supports the S3 protocol. You can directly upload the required data to OSS. The export capabilities of Doris vary based on different versions.
In Doris 2.0 and later, you can execute the
EXPORT TABLE
statement to export data in the CSV, TEXT, ORC, or Parquet format. The Parquet export format of Doris is incompatible with the mainstream Parquet format. Data that contains special characters such as line feeds cannot be exported in the CSV or TEXT format. We recommend that you export data in the ORC format because the ORC format offers a faster export speed.In Doris 1.2, you can execute the
EXPORT TABLE
statement to export data only in the CSV format. We recommend that you execute theSELECT INTO OUTFILE
statement to export data in the ORC format. This export method offers a slightly slower export speed than theEXPORT TABLE
statement but allows you to use theWHERE
clause to filter data.
Example
Execute the following statements to export data from Doris 2.0 and Doris 1.2 to OSS:
------ Doris 2.0
EXPORT TABLE s3_test TO "s3://bucket/dir/"
PROPERTIES (
"format"="orc"
)
WITH s3 (
"AWS_ENDPOINT" = "oss-cn-shanghai-internal.aliyuncs.com",
"AWS_ACCESS_KEY" = "LTA****",
"AWS_SECRET_KEY" = "lQEI1TSJIY0******",
"AWS_REGION" = "shanghai"
)
---- Doris 1.2 (with the c column of the DATETIME type)
SELECT a, b, CAST(c AS string) AS c FROM s3_test INTO OUTFILE "s3://bucket/dir/"
FORMAT AS orc
PROPERTIES
(
"AWS_ENDPOINT" = "oss-cn-shanghai-internal.aliyuncs.com",
"AWS_ACCESS_KEY" = "LTA****",
"AWS_SECRET_KEY" = "lQEI1TSJIY0******",
"AWS_REGION" = "shanghai"
);
If the source table contains columns of the DATETIME type, you can execute only the SELECT INTO FILE statement. This is because DATETIME data exported from Doris is incompatible with mainstream products and must be converted into the STRING type.
Step 3: Import data from OSS to AnalyticDB for PostgreSQL
You can use the COPY statement or an OSS foreign table to import data from OSS to AnalyticDB for PostgreSQL.
For information about how to use the COPY statement to import OSS data, see Use the COPY or UNLOAD statement to import or export data between OSS foreign tables and AnalyticDB for PostgreSQL tables.
For information about how to use an OSS foreign table to import OSS data, see Use OSS foreign tables for data lake analysis.
Example
Execute the following COPY statement to import OSS data:
COPY test1 FROM 'oss://bucket/dir/' ACCESS_KEY_ID 'LTAI5t********' SECRET_ACCESS_KEY 'lQEI1T*******'
FORMAT AS orc ENDPOINT 'oss-*****-
internal.aliyuncs.com' FDW 'oss_fdw' ;
Syntax conversion
Data types
Doris | AnalyticDB for PostgreSQL | Description |
BOOLEAN | BOOLEAN | None. |
TINYINT | SMALLINT | The TINYINT type is not supported in AnalyticDB for PostgreSQL. |
SMALLINT | SMALLINT | None. |
INT | INT | None. |
BIGINT | BIGINT | None. |
LARGEINT | DECIMAL | None. |
FLOAT | FLOAT | None. |
DOUBLE | DOUBLE | None. |
DECIMAL | DECIMAL | None. |
DATE | DATE | None. |
DATETIME | TIMESTAMP/TIMSTAMPTZ | None. |
CHAR | CHAR | None. |
VARCHAR | VARCHAR | None. |
STRING | TEXT | None. |
HyperLogLog (HLL) | / |
|
BITMAP | / | BITMAP columns can be used in Doris tables of the aggregate or unique key model. |
QUANTILE_STATE | / | None. |
ARRAY | [ ] | None. |
MAP | A custom composite type | None. |
STRUCT | A custom composite type | None. |
JSON | JSON | None. |
AGG_STATE | / | None. |
VARIANT | A custom composite type | None. |
Table creation statements
You can create tables of the following models:
Model 1: Detail model
In the detail model, no primary key or aggregate key is specified. The DUPLICATE KEY clause specified in the table creation statement is used to specify the columns based on which underlying data is sorted. A table of the detail model corresponds to an append-optimized column-oriented (AOCO) or Beam table in AnalyticDB for PostgreSQL. You can use the ORDER BY clause to specify the sort key and enable the auto-merge feature to sort data on a regular basis.
Example
CREATE TABLE IF NOT EXISTS example_tbl_by_default
(
`timestamp` DATETIME NOT NULL COMMENT "The time when the log was generated",
`type` INT NOT NULL COMMENT "The type of the log",
`error_code` INT COMMENT "The error code",
`error_msg` VARCHAR(1024) COMMENT "The error message",
`op_id` BIGINT COMMENT "The owner ID",
`op_time` DATETIME COMMENT "The time when the error was resolved"
)
DUPLICATE KEY (`timestamp`,`type`,`error_code`)
DISTRIBUTED BY HASH(`type`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);
----AnalyticDB PostgreSQL
CREATE TABLE IF NOT EXISTS example_tbl_by_default
(
"timestamp" TIMESTAMP NOT NULL ,
"type" INT NOT NULL ,
error_code INT ,
error_msg VARCHAR(1024),
op_id BIGINT,
op_time TIMESTAMP
)
WITH(appendonly = true, orientation = column)
DISTRIBUTED BY("type")
ORDER BY("timestamp","type",error_code);
COMMENT ON COLUMN example_tbl_by_default.timestamp IS 'The time when the log was generated';
Model 2: Unique key model
In the unique key model, the UNIQUE KEY clause is used to specify the unique constraint and ensure the uniqueness of the primary key. A table of the unique key model corresponds to a heap table in AnalyticDB for PostgreSQL. You can use the PRIMARY KEY clause to specify the unique key.
Example
CREATE TABLE IF NOT EXISTS example_tbl_unique
(
`user_id` LARGEINT NOT NULL COMMENT "The user ID",
`username` VARCHAR(50) NOT NULL COMMENT "The nickname of the user",
`city` VARCHAR(20) COMMENT "The city in which the user resides",
`age` SMALLINT COMMENT "The age of the user",
`sex` TINYINT COMMENT "The gender of the user",
`phone` LARGEINT COMMENT "The phone number of the user",
`address` VARCHAR(500) COMMENT "The address of the user",
`register_time` DATETIME COMMENT "The time when the user registered"
)
UNIQUE KEY(`user_id`, `username`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);
----AnalyticDB PostgreSQL
CREATE TABLE IF NOT EXISTS example_tbl_unique
(
user_id BIGINT NOT NULL,
username VARCHAR(50) NOT NULL,
city VARCHAR(20),
age SMALLINT,
sex SMALLINT,
phone BIGINT,
address VARCHAR(500),
register_time TIMESTAMP,
PRIMARY KEY (user_id, username)
)
DISTRIBUTED BY (user_id);
COMMENT ON COLUMN example_tbl_unique.user_id IS 'The user ID';
Model 3: Aggregate key model
When you import data to a table of the aggregate key model, rows that have the same aggregate key column values are aggregated into one row based on the specified aggregate key fields
. A table of the aggregate key model corresponds to a heap table in AnalyticDB for PostgreSQL. You can create a unique index on the aggregate key and execute the UPSERT statement to insert data. For more information, see Use INSERT ON CONFLICT to overwrite data.
Example
CREATE TABLE IF NOT EXISTS example_tbl_agg1
(
`user_id` LARGEINT NOT NULL COMMENT "The user ID",
`date` DATE NOT NULL COMMENT "The date on which data is imported to the table",
`city` VARCHAR(20) COMMENT "The city in which the user resides",
`age` SMALLINT COMMENT "The age of the user",
`sex` TINYINT COMMENT "The gender of the user",
`last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "The most recent time when the user visited",
`cost` BIGINT SUM DEFAULT "0" COMMENT "The amount of money that the user spends",
`max_dwell_time` INT MAX DEFAULT "0" COMMENT "The maximum dwell time of the user",
`min_dwell_time` INT MIN DEFAULT "99999" COMMENT "The minimum dwell time of the user"
)
AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);
-----AnalyticDB for PostgreSQL does not support automatic pre-aggregation.
CREATE TABLE IF NOT EXISTS example_tbl_agg1
(
user_id BIGINT NOT NULL,
"date" DATE NOT NULL,
city VARCHAR(20),
age SMALLINT,
sex SMALLINT,
last_visit_date TIMESTAMP DEFAULT '1970-01-01 00:00:00',
cost BIGINT DEFAULT 0,
max_dwell_time INT DEFAULT 0,
min_dwell_time INT DEFAULT 99999,
UNIQUE (user_id, "date", city, age, sex)
)
DISTRIBUTED BY(user_id);
INSERT INTO example_tbl_agg1 VALUES (10000,'2024-08-22','beijing', 18, 0, '2024-08-22 12:00:00', 20, 1000, 1000) ON CONFLICT (user_id, "date", city, age, sex) DO UPDATE SET last_visit_date = excluded.last_visit_date, cost = example_tbl_agg1.cost + excluded.cost, max_dwell_time = GREATEST(example_tbl_agg1.max_dwell_time, excluded.max_dwell_time), min_dwell_time = LEAST(example_tbl_agg1.min_dwell_time, excluded.min_dwell_time);
Partitioning and bucketing
In Doris, you can use the PARTITION BY clause to partition data and the DISTRIBUTED BY clause to further divide data into buckets. You can use the BUCKETS parameter to specify the number of buckets. In AnalyticDB for PostgreSQL, you can use the PARTITION BY clause to specify the partition key and the DISTRIBUTED BY clause to specify the distribution key.
Example
CREATE TABLE IF NOT EXISTS example_range_tbl
(
`user_id` LARGEINT NOT NULL COMMENT "The user ID",
`date` DATE NOT NULL COMMENT "The date on which data is imported to the table",
`timestamp` DATETIME NOT NULL COMMENT "The time when data is imported to the table",
`city` VARCHAR(20) COMMENT "The city in which the user resides",
`age` SMALLINT COMMENT "The age of the user",
`sex` TINYINT COMMENT "The gender of the user",
`last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "The most recent time when the user visited",
`cost` BIGINT SUM DEFAULT "0" COMMENT "The amount of money that the user spends",
`max_dwell_time` INT MAX DEFAULT "0" COMMENT "The maximum dwell time of the user",
`min_dwell_time` INT MIN DEFAULT "99999" COMMENT "The minimum dwell time of the user"
)
ENGINE=OLAP
AGGREGATE KEY(`user_id`, `date`, `timestamp`, `city`, `age`, `sex`)
PARTITION BY RANGE(`date`)
(
PARTITION `p201701` VALUES LESS THAN ("2017-02-01"),
PARTITION `p201702` VALUES LESS THAN ("2017-03-01"),
PARTITION `p201703` VALUES LESS THAN ("2017-04-01"),
PARTITION `p2018` VALUES [("2018-01-01"), ("2019-01-01"))
)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16
PROPERTIES
(
"replication_num" = "1"
);
----AnalyticDB PostgreSQL
CREATE TABLE IF NOT EXISTS example_range_tbl
(
user_id BIGINT NOT NULL,
"date" DATE NOT NULL,
city VARCHAR(20),
age SMALLINT,
sex SMALLINT,
visit_date TIMESTAMP DEFAULT '1970-01-01 00:00:00',
a_cost BIGINT DEFAULT 0,
dwell_time INT DEFAULT 0
)
PARTITION BY RANGE("date")
(
PARTITION p201701 VALUES START ("2017-02-01") INCLUSIVE,
PARTITION p201702 VALUES START ("2017-03-01") INCLUSIVE,
PARTITION p201703 VALUES START ("2017-04-01") INCLUSIVE,
PARTITION p2018 VALUES START ("2018-01-01") INCLUSIVE END ("2019-01-01") EXCLUSIVE
)
DISTRIBUTED BY (user_id);