Data models supported by ApsaraDB for SelectDB - ApsaraDB for SelectDB

If you want to use ApsaraDB for SelectDB in different business scenarios, you must understand the data models supported by ApsaraDB for SelectDB. This helps you design solutions for your business. This topic describes the data models supported by ApsaraDB for SelectDB to help you design an optimal storage solution.

Background information

In ApsaraDB for SelectDB, data is logically described in tables. A table consists of rows and columns. A row indicates a row of data in a table. A column is used to describe a field in a row.

Columns can be divided into two types: key columns and value columns. From the business perspective, a key column corresponds to a dimension column and a value column corresponds to a metric column. Key columns and value columns are specified in the CREATE TABLE statement. The columns after the keywords DUPLICATE KEY, AGGREGATE KEY, and UNIQUE KEY in the CREATE TABLE statement are key columns, and other columns are value columns.

The preceding keywords correspond to the following three data models supported by ApsaraDB for SelectDB:

Aggregate key model
Unique key model
Duplicate key model

Aggregate key model

ApsaraDB for SelectDB processes the rows that contain the same data in the key columns of a table based on the data model that you select to create the table. The Aggregate key model aggregates all the rows that contain the same data in the key columns. The data of these rows in the value columns is pre-aggregated based on the aggregation type that is specified for each value column in the CREATE TABLE statement. After aggregation, only one row of data is retained.

This way, the Aggregate key model pre-aggregates rows to reduce the amount of data to be scanned and the amount of computing resources required for queries. This model is suitable for report-based statistical analysis scenarios. However, this model is not optimal for COUNT(*) queries, which are the queries that contain the COUNT(*) function. In addition, an aggregation type is specified for each value column. When you perform aggregate queries of other types, you must consider the semantic correctness of query statements.

The following examples describe the Aggregate key model and show you how to use the Aggregate key model.

Example 1: Import data for aggregation

The following table describes the schema of the example_tbl1 table.

ColumnName	Type	AggregationType	Comment
user_id	LARGEINT	N/A	The user ID.
date	DATE	N/A	The date on which data is written to the table.
city	VARCHAR(20)	N/A	The city in which the user resides.
age	SMALLINT	N/A	The age of the user.
sex	TINYINT	N/A	The gender of the user.
last_visit_date	DATETIME	REPLACE	The last time when the user paid a visit.
cost	BIGINT	SUM	The amount of money that the user spends.
max_dwell_time	INT	MAX	The maximum dwell time of the user.
min_dwell_time	INT	MIN	The minimum dwell time of the user.

The following sample code provides an example on how to create the example_tbl1 table. The partition and distribution information is omitted.

CREATE TABLE IF NOT EXISTS test.example_tbl1
(
    `user_id` LARGEINT NOT NULL COMMENT "The user ID",
    `date` DATE NOT NULL COMMENT "The date on which data is written to the table",
    `city` VARCHAR(20) COMMENT "The city in which the user resides",
    `age` SMALLINT COMMENT "The age of the user",
    `sex` TINYINT COMMENT "The gender of the user",
    `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "The last time when the user paid a visit",
    `cost` BIGINT SUM DEFAULT "0" COMMENT "The amount of money that the user spends",
    `max_dwell_time` INT MAX DEFAULT "0" COMMENT "The maximum dwell time of the user",
    `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "The minimum dwell time of the user"
)
AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;

The preceding table is a typical fact table that records user information and visit information. In star models, user information is stored in dimension tables and visit information is stored in fact tables. In this example, for the convenience of explanation, user information and visit information are stored in the same table.

The columns in the table are divided into key columns and value columns, which correspond to dimension columns and metric columns, based on whether an aggregation type is specified. Key columns include the columns for which no aggregation type is specified, such as user_id, date, and age. Value columns include the columns for which an aggregation type is specified. When data is imported to the table, all the rows that contain the same data in the key columns are aggregated, and the data of these rows in the value columns is aggregated based on the aggregation type that is specified for each value column. After aggregation, only one row of data is retained.

The following table describes the aggregation types.

Aggregation type	Description
SUM	Calculates the sum of the values in multiple rows. This type is applicable to numeric values.
MIN	Calculates the minimum value. This type is applicable to numeric values.
MAX	Calculates the maximum value. This type is applicable to numeric values.
REPLACE	Replaces the previous values with the newly imported values. For the rows that contain the same data in dimension columns, the values in metric columns are replaced with the newly imported values based on the order in which values are imported.
REPLACE_IF_NOT_NULL	Replaces values except for null values with the newly imported values. Different from the REPLACE type, this type does not replace null values. When you use this type, you must specify a null value instead of an empty string as the default value for fields. If you specify an empty string as the default value for fields, this type replaces the empty string with another one.
HLL_UNION	Aggregates columns of the HyperLogLog (HLL) type by using the HLL algorithm.
BITMAP_UNION	Aggregates columns of the BITMAP type, which performs a union aggregation of bitmaps.

Write the data described in the following table to the example_tbl1 table.

user_id	date	city	age	sex	last_visit_date	cost	max_dwell_time	min_dwell_time
10000	2017-10-01	Beijing	20	0	2017-10-01 06:00:00	20	10	10
10000	2017-10-01	Beijing	20	0	2017-10-01 07:00:00	15	2	2
10001	2017-10-01	Beijing	30	1	2017-10-01 17:05:45	2	22	22
10002	2017-10-02	Shanghai	20	1	2017-10-02 12:59:12	200	5	5
10003	2017-10-02	Guangzhou	32	0	2017-10-02 11:20:00	30	11	11
10004	2017-10-01	Shenzhen	35	0	2017-10-01 10:00:15	100	3	3
10004	2017-10-03	Shenzhen	35	0	2017-10-03 10:20:22	11	6	6

The following sample code provides an example on how to execute an SQL statement to import the data:

INSERT INTO example_db.example_tbl_agg1 VALUES
(10000,"2017-10-01","Beijing",20,0,"2017-10-01 06:00:00",20,10,10),
(10000,"2017-10-01","Beijing",20,0,"2017-10-01 07:00:00",15,2,2),
(10001,"2017-10-01","Beijing",30,1,"2017-10-01 17:05:45",2,22,22),
(10002,"2017-10-02","Shanghai",20,1,"2017-10-02 12:59:12",200,5,5),
(10003,"2017-10-02","Guangzhou",32,0,"2017-10-02 11:20:00",30,11,11),
(10004,"2017-10-01","Shenzhen",35,0,"2017-10-01 10:00:15",100,3,3),
(10004,"2017-10-03","Shenzhen",35,0,"2017-10-03 10:20:22",11,6,6);

The following table describes how the data is stored after the data is written to an ApsaraDB for SelectDB instance.

user_id	date	city	age	sex	last_visit_date	cost	max_dwell_time	min_dwell_time
10000	2017-10-01	Beijing	20	0	2017-10-01 07:00:00	35	10	2
10001	2017-10-01	Beijing	30	1	2017-10-01 17:05:45	2	22	22
10002	2017-10-02	Shanghai	20	1	2017-10-02 12:59:12	200	5	5
10003	2017-10-02	Guangzhou	32	0	2017-10-02 11:20:00	30	11	11
10004	2017-10-01	Shenzhen	35	0	2017-10-01 10:00:15	100	3	3
10004	2017-10-03	Shenzhen	35	0	2017-10-03 10:20:22	11	6	6

In the preceding table, only one row of aggregated data is retained for the user whose ID is 10000. The data of other users is consistent with the raw data. The following list analyzes how the data of the user whose ID is 10000 is aggregated:

The first five columns remain unchanged. Data starts to change from the sixth column last_visit_date.

2017-10-01 07:00:00: The aggregation type of the last_visit_date column is REPLACE. Therefore, 2017-10-01 06:00:00 is replaced with 2017-10-01 07:00:00.
Note
If the REPLACE type is used to aggregate the data imported in the same batch, the order in which the values are replaced is not fixed. In this example, the data that is stored can be 2017-10-01 06:00:00. However, if data is imported in different batches, the data imported in the new batch replaces the data imported in the previous batch.
35: The aggregation type of the cost column is SUM. Therefore, the updated value 35 is the result of 20 plus 15.
10: The aggregation type of the max_dwell_time column is MAX. Therefore, 10 is stored as the maximum value between 10 and 2.
2: The aggregation type of the min_dwell_time column is MIN. Therefore, 2 is stored as the minimum value between 10 and 2.

After aggregation, only the aggregated data is stored in ApsaraDB for SelectDB. The detailed raw data is no longer available.

Example 2: Aggregate imported data with existing data

The following sample code provides an example on how to create the example_tbl2 table. The partition and distribution information is omitted.

CREATE TABLE IF NOT EXISTS test.example_tbl2
(
    `user_id` LARGEINT NOT NULL COMMENT "The user ID",
    `date` DATE NOT NULL COMMENT "The date on which data is written to the table",
    `city` VARCHAR(20) COMMENT "The city in which the user resides",
    `age` SMALLINT COMMENT "The age of the user",
    `sex` TINYINT COMMENT "The gender of the user",
    `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "The last time when the user paid a visit",
    `cost` BIGINT SUM DEFAULT "0" COMMENT "The amount of money that the user spends",
    `max_dwell_time` INT MAX DEFAULT "0" COMMENT "The maximum dwell time of the user",
    `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "The minimum dwell time of the user"
)
AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;

Write the data described in the following table to the example_tbl2 table.

user_id	date	city	age	sex	last_visit_date	cost	max_dwell_time	min_dwell_time
10000	2017-10-01	Beijing	20	0	2017-10-01 07:00:00	35	10	2
10001	2017-10-01	Beijing	30	1	2017-10-01 17:05:45	2	22	22
10002	2017-10-02	Shanghai	20	1	2017-10-02 12:59:12	200	5	5
10003	2017-10-02	Guangzhou	32	0	2017-10-02 11:20:00	30	11	11
10004	2017-10-01	Shenzhen	35	0	2017-10-01 10:00:15	100	3	3
10004	2017-10-03	Shenzhen	35	0	2017-10-03 10:20:22	11	6	6

The following sample code provides an example on how to execute an SQL statement to import the data:

INSERT INTO test.example_tbl2 VALUES
(10000,"2017-10-01","Beijing",20,0,"2017-10-01 06:00:00",20,10,10),
(10000,"2017-10-01","Beijing",20,0,"2017-10-01 07:00:00",15,2,2),
(10001,"2017-10-01","Beijing",30,1,"2017-10-01 17:05:45",2,22,22),
(10002,"2017-10-02","Shanghai",20,1,"2017-10-02 12:59:12",200,5,5),
(10003,"2017-10-02","Guangzhou",32,0,"2017-10-02 11:20:00",30,11,11),
(10004,"2017-10-01","Shenzhen",35,0,"2017-10-01 10:00:15",100,3,3),
(10004,"2017-10-03","Shenzhen",35,0,"2017-10-03 10:20:22",11,6,6);

Then, write the data described in the following table to the example_tbl2 table.

user_id	date	city	age	sex	last_visit_date	cost	max_dwell_time	min_dwell_time
10004	2017-10-03	Shenzhen	35	0	2017-10-03 11:22:00	44	19	19
10005	2017-10-03	Changsha	29	1	2017-10-03 18:11:02	3	1	1

The following sample code provides an example on how to execute an SQL statement to import the data:

INSERT INTO test.example_tbl2 VALUES
(10004,"2017-10-03","Shenzhen",35,0,"2017-10-03 11:22:00",44,19,19),
(10005,"2017-10-03","Changsha",29,1,"2017-10-03 18:11:02",3,1,1);

The following table describes how the data is stored after the data is written to an ApsaraDB for SelectDB instance.

user_id	date	city	age	sex	last_visit_date	cost	max_dwell_time	min_dwell_time
10000	2017-10-01	Beijing	20	0	2017-10-01 07:00:00	35	10	2
10001	2017-10-01	Beijing	30	1	2017-10-01 17:05:45	2	22	22
10002	2017-10-02	Shanghai	20	1	2017-10-02 12:59:12	200	5	5
10003	2017-10-02	Guangzhou	32	0	2017-10-02 11:20:00	30	11	11
10004	2017-10-01	Shenzhen	35	0	2017-10-01 10:00:15	100	3	3
10004	2017-10-03	Shenzhen	35	0	2017-10-03 11:22:00	55	19	6
10005	2017-10-03	Changsha	29	1	2017-10-03 18:11:02	3	1	1

In the preceding table, the existing data of the user whose ID is 10004 is aggregated with the newly imported data. The data of the user whose ID is 10005 is imported to the table.

Data is aggregated in the following three stages in ApsaraDB for SelectDB:

The extract, transform, and load (ETL) stage in each batch of imported data. In this stage, each batch of imported data is internally aggregated.
The data compaction stage in the compute cluster. In this stage, the compute cluster aggregates the data that is imported in different batches.
The data query stage. The data involved in each query is aggregated.

In different stages, data is aggregated to different degrees. For example, if a batch of data is imported, it may not be aggregated with the existing data. However, the data that you can query is aggregated. The degree to which data is aggregated is transparent to you. You need to only assume that the queried data is completely aggregated, without the need to know the aggregation stages.

Example 3: Retain detailed data

The following table describes the schema of the example_tbl3 table.

ColumnName	Type	AggregationType	Comment
user_id	LARGEINT	N/A	The user ID.
date	DATE	N/A	The date on which data is written to the table.
timestamp	DATETIME	N/A	The time when data is written to the table, which is accurate to seconds.
city	VARCHAR(20)	N/A	The city in which the user resides.
age	SMALLINT	N/A	The age of the user.
sex	TINYINT	N/A	The gender of the user.
last_visit_date	DATETIME	REPLACE	The last time when the user paid a visit.
cost	BIGINT	SUM	The amount of money that the user spends.
max_dwell_time	INT	MAX	The maximum dwell time of the user.
min_dwell_time	INT	MIN	The minimum dwell time of the user.

The timestamp column is added to record the time when data is written to the table, which is accurate to seconds. In addition, the AGGREGATE KEY keyword is used to specify the following aggregate key columns: AGGREGATE KEY(user_id, date, timestamp, city, age, sex).

The following sample code provides an example on how to create the example_tbl3 table. The partition and distribution information is omitted.

CREATE TABLE IF NOT EXISTS test.example_tbl3
(
    `user_id` LARGEINT NOT NULL COMMENT "The user ID",
    `date` DATE NOT NULL COMMENT "The date on which data is written to the table",
    `timestamp` DATETIME NOT NULL COMMENT "The time when data is written to the table, which is accurate to seconds",
    `city` VARCHAR(20) COMMENT "The city in which the user resides",
    `age` SMALLINT COMMENT "The age of the user",
    `sex` TINYINT COMMENT "The gender of the user",
    `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "The last time when the user paid a visit",
    `cost` BIGINT SUM DEFAULT "0" COMMENT "The amount of money that the user spends",
    `max_dwell_time` INT MAX DEFAULT "0" COMMENT "The maximum dwell time of the user",
    `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "The minimum dwell time of the user"
)
AGGREGATE KEY(`user_id`, `date`, `timestamp`, `city`, `age`, `sex`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;

Write the data described in the following table to the example_tbl3 table.

user_id	date	timestamp	city	age	sex	last_visit_date	cost	max_dwell_time	min_dwell_time
10000	2017-10-01	2017-10-01 08:00:05	Beijing	20	0	2017-10-01 06:00:00	20	10	10
10000	2017-10-01	2017-10-01 09:00:05	Beijing	20	0	2017-10-01 07:00:00	15	2	2
10001	2017-10-01	2017-10-01 18:12:10	Beijing	30	1	2017-10-01 17:05:45	2	22	22
10002	2017-10-02	2017-10-02 13:10:00	Shanghai	20	1	2017-10-02 12:59:12	200	5	5
10003	2017-10-02	2017-10-02 13:15:00	Guangzhou	32	0	2017-10-02 11:20:00	30	11	11
10004	2017-10-01	2017-10-01 12:12:48	Shenzhen	35	0	2017-10-01 10:00:15	100	3	3
10004	2017-10-03	2017-10-03 12:38:20	Shenzhen	35	0	2017-10-03 10:20:22	11	6	6

The following sample code provides an example on how to execute an SQL statement to import the data:

INSERT INTO test.example_tbl3 VALUES
(10000,"2017-10-01","2017-10-01 08:00:05","Beijing",20,0,"2017-10-01 06:00:00",20,10,10),
(10000,"2017-10-01","2017-10-01 09:00:05","Beijing",20,0,"2017-10-01 07:00:00",15,2,2),
(10001,"2017-10-01","2017-10-01 18:12:10","Beijing",30,1,"2017-10-01 17:05:45",2,22,22),
(10002,"2017-10-02","2017-10-02 13:10:00","Shanghai",20,1,"2017-10-02 12:59:12",200,5,5),
(10003,"2017-10-02","2017-10-02 13:15:00","Guangzhou",32,0,"2017-10-02 11:20:00",30,11,11),
(10004,"2017-10-01","2017-10-01 12:12:48","Shenzhen",35,0,"2017-10-01 10:00:15",100,3,3),
(10004,"2017-10-03","2017-10-03 12:38:20","Shenzhen",35,0,"2017-10-03 10:20:22",11,6,6);

The following table describes how the data is stored after the data is written to an ApsaraDB for SelectDB instance.

user_id	date	timestamp	city	age	sex	last_visit_date	cost	max_dwell_time	min_dwell_time
10000	2017-10-01	2017-10-01 08:00:05	China (Beijing)	20	0	2017-10-01 06:00:00	20	10	10
10000	2017-10-01	2017-10-01 09:00:05	Beijing	20	0	2017-10-01 07:00:00	15	2	2
10001	2017-10-01	2017-10-01 18:12:10	Beijing	30	1	2017-10-01 17:05:45	2	22	22
10002	2017-10-02	2017-10-02 13:10:00	China (Shanghai)	20	1	2017-10-02 12:59:12	200	5	5
10003	2017-10-02	2017-10-02 13:15:00	Guangzhou	32	0	2017-10-02 11:20:00	30	11	11
10004	2017-10-01	2017-10-01 12:12:48	Shenzhen	35	0	2017-10-01 10:00:15	100	3	3
10004	2017-10-03	2017-10-03 12:38:20	Shenzhen	35	0	2017-10-03 10:20:22	11	6	6

The stored data is exactly the same as the imported data. No data is aggregated. After the timestamp column is added to this batch of data, the rows do not contain the same data in the key columns. In this case, the Aggregate key model does not aggregate data and retains complete detailed data.

Unique key model

When you analyze multiple dimensions of data, you need to ensure the uniqueness of keys. For example, configure uniqueness constraints for primary keys. To resolve this issue, ApsaraDB for SelectDB supports the Unique key model. In earlier versions, the Unique key model is essentially a special case of the Aggregate key model and a simplified method to present table schemas. The Aggregate key model is implemented based on the Merge on Read (MoR) implementation method, which is not suitable for some aggregate queries. In ApsaraDB for SelectDB V3.0, the Unique key model provides the Merge on Write (MoW) implementation method, which delivers optimal query performance by removing duplicates during data writes.

The Unique key model overwrites multiple rows that contain the same data in the specified key columns, and retains only the latest imported rows. This provides uniqueness constraints that are similar to those in relational databases.

The Unique key model ensures the uniqueness of primary keys for scenarios that require uniqueness constraints. This model can be used for the analysis of relational data, such as orders. If you have high-performance requirements for aggregate queries, we recommend that you use the MoW implementation method that is provided in the latest version of ApsaraDB for SelectDB. However, the Unique key model cannot take advantage of pre-aggregation such as ROLLUP in queries.

Note

If you use the Unique key model, we recommend that you use the MoW implementation method.

The following examples describe the two different implementation methods.

MoW

The MoW implementation method of the Unique key model is completely different from the Aggregate key model. The query performance of the MoW implementation method is closer to that of the Duplicate key model described in the following section. In scenarios in which primary key constraints are required, the MoW implementation method of the Unique key model delivers better query performance than the Aggregate key model, especially in aggregate queries and queries that require indexes to filter a large amount of data.

By default, the MoW feature is disabled in ApsaraDB for SelectDB V3.0. You can enable the MoW feature by configuring the following property:

"enable_unique_key_merge_on_write" = "true"

The following table describes the schema of the example_tbl6 table.

ColumnName	Type	AggregationType	Comment
user_id	BIGINT	N/A	The user ID.
username	VARCHAR(50)	N/A	The nickname of the user.
city	VARCHAR(20)	NONE	The city in which the user resides.
age	SMALLINT	NONE	The age of the user.
sex	TINYINT	NONE	The gender of the user.
phone	LARGEINT	NONE	The phone number of the user.
address	VARCHAR(500)	NONE	The address of the user.
register_time	DATETIME	NONE	The time when the user is registered.

The following sample code provides an example on how to create the example_tbl6 table and modify the properties of the table. The partition and distribution information is omitted.

CREATE TABLE IF NOT EXISTS test.example_tbl6
(
    `user_id` LARGEINT NOT NULL COMMENT "The user ID",
    `username` VARCHAR(50) NOT NULL COMMENT "The nickname of the user",
    `city` VARCHAR(20) COMMENT "The city in which the user resides",
    `age` SMALLINT COMMENT "The age of the user",
    `sex` TINYINT COMMENT "The gender of the user",
    `phone` LARGEINT COMMENT "The phone number of the user",
    `address` VARCHAR(500) COMMENT "The address of the user",
    `register_time` DATETIME COMMENT "The time when the user is registered"
)
UNIQUE KEY(`user_id`, `username`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16
PROPERTIES (
    "enable_unique_key_merge_on_write" = "true"
);

The table schema created by using the preceding CREATE TABLE statement is completely different from that created by using the Aggregate key model.

In a table that uses the Unique key model and for which the MoW feature is enabled, the data to be overwritten and updated is marked for deletion in the data import stage, and the latest data is written. When you query the table, all data marked for deletion is filtered out and the latest data is read. This eliminates the data aggregation process of the MoR implementation method and supports multiple types of predicate pushdown in most cases. Therefore, this can improve performance in multiple scenarios, especially in aggregate queries.

Note

By default, the MoW feature is disabled in ApsaraDB for SelectDB V3.0. You can enable the MoW feature by configuring the specified property when you create a table.
The MoR implementation method cannot be seamlessly upgraded to the MoW implementation method because they organize data in different ways. If you want to use the MoW implementation method, you must execute the INSERT INTO unique-mow-table SELECT * FROM source_table statements to import data to a new table.
The unique deletion sign and sequence column of the Unique key model can still be used if you use the MoW implementation method.

MoR

The following table describes the schema of the example_tbl4 table.

ColumnName	Type	IsKey	Comment
user_id	BIGINT	Yes	The user ID.
username	VARCHAR(50)	Yes	The nickname of the user.
city	VARCHAR(20)	No	The city in which the user resides.
age	SMALLINT	No	The age of the user.
sex	TINYINT	No	The gender of the user.
phone	LARGEINT	No	The phone number of the user.
address	VARCHAR(500)	No	The address of the user.
register_time	DATETIME	No	The time when the user is registered.

This is a typical table that records the basic user information. This type of data does not need to be aggregated. You need to only ensure the uniqueness of the primary key. In this example, the user_id and username columns are used as the primary key.

The following sample code provides an example on how to create the example_tbl4 table. The partition and distribution information is omitted.

CREATE TABLE IF NOT EXISTS test.example_tbl4
(
    `user_id` LARGEINT NOT NULL COMMENT "The user ID",
    `username` VARCHAR(50) NOT NULL COMMENT "The nickname of the user",
    `city` VARCHAR(20) COMMENT "The city in which the user resides",
    `age` SMALLINT COMMENT "The age of the user",
    `sex` TINYINT COMMENT "The gender of the user",
    `phone` LARGEINT COMMENT "The phone number of the user",
    `address` VARCHAR(500) COMMENT "The address of the user",
    `register_time` DATETIME COMMENT "The time when the user is registered"
)
UNIQUE KEY(`user_id`, `username`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;

The following table describes the schema of the example_tbl5 table.

ColumnName	Type	AggregationType	Comment
user_id	BIGINT	N/A	The user ID.
username	VARCHAR(50)	N/A	The nickname of the user.
city	VARCHAR(20)	REPLACE	The city in which the user resides.
age	SMALLINT	REPLACE	The age of the user.
sex	TINYINT	REPLACE	The gender of the user.
phone	LARGEINT	REPLACE	The phone number of the user.
address	VARCHAR(500)	REPLACE	The address of the user.
register_time	DATETIME	REPLACE	The time when the user is registered.

The following sample code provides an example on how to create the example_tbl5 table. The partition and distribution information is omitted.

CREATE TABLE IF NOT EXISTS test.example_tbl5
(
    `user_id` LARGEINT NOT NULL COMMENT "The user ID",
    `username` VARCHAR(50) NOT NULL COMMENT "The nickname of the user",
    `city` VARCHAR(20) REPLACE COMMENT "The city in which the user resides",
    `age` SMALLINT REPLACE COMMENT "The age of the user",
    `sex` TINYINT REPLACE COMMENT "The gender of the user",
    `phone` LARGEINT REPLACE COMMENT "The phone number of the user",
    `address` VARCHAR(500) REPLACE COMMENT "The address of the user",
    `register_time` DATETIME REPLACE COMMENT "The time when the user is registered"
)
AGGREGATE KEY(`user_id`, `username`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;

The schema of the preceding example_tbl4 table is the same as that of the example_tbl5 table that uses the Aggregate key model.

The MoR implementation method of the Unique key model can be replaced by the REPLACE type of the Aggregate key model. Their internal implementation and data storage methods are exactly the same.

Duplicate key model

When you analyze multiple dimensions of data, you may not need to use primary keys or aggregate data. In this case, you can use the Duplicate key model to meet your business requirements.

For the Duplicate key model, multiple rows that contain the same data in the specified key columns can be stored in the system without affecting each other. The data of these rows does not need to be pre-aggregated and no uniqueness constraints are required.

The Duplicate key model is suitable for ad hoc queries based on a dimension and can be used to analyze detailed data such as logs. The Duplicate key model cannot improve performance by pre-aggregating data or implement automatic updates by configuring uniqueness constraints.

The following example describes the implementation method of the Duplicate key model.

The following table describes the schema of the example_tbl7 table.

Column	Type	Sort key	Comment
timestamp	DATETIME	Yes	The time when the log was generated.
type	INT	Yes	The type of the log.
error_code	INT	Yes	The error code.
error_msg	VARCHAR(1024)	No	The error message.
op_id	BIGINT	No	The owner ID.
op_time	DATETIME	No	The time when the error was handled.

The following sample code provides an example on how to create the example_tbl7 table. The partition and distribution information is omitted.

CREATE TABLE IF NOT EXISTS test.example_tbl7
(
    `timestamp` DATETIME NOT NULL COMMENT "The time when the log was generated",
    `type` INT NOT NULL COMMENT "The type of the log",
    `error_code` INT COMMENT "The error code",
    `error_msg` VARCHAR(1024) COMMENT "The error message",
    `op_id` BIGINT COMMENT "The owner ID",
    `op_time` DATETIME COMMENT "The time when the error was handled"
)
DUPLICATE KEY(`timestamp`, `type`, `error_code`)
DISTRIBUTED BY HASH(`type`) BUCKETS 16;

The Duplicate key model is different from the Aggregate or Unique key model. Raw data is imported and stored without being aggregated. Rows are retained even if they contain the same data. The duplicate key that is specified in the CREATE TABLE statement indicates the columns based on which the data is sorted in the underlying storage. We recommend that you select the first two to four columns as the duplicate key.

Differences among data models

Limits of the Aggregate key model

The Aggregate key model presents the completely aggregated data. The data that is not aggregated, such as the data imported in two batches, must be consistently presented in a specific way. The following example describes how the Aggregate key model consistently presents data.

The following table describes the schema of the example_tbl8 table.

ColumnName	Type	AggregationType	Comment
user_id	LARGEINT	N/A	The user ID.
date	DATE	N/A	The date on which data is written to the table.
cost	BIGINT	SUM	The amount of money that the user spends.

Write the data described in the following tables to the example_tbl8 table in two batches.

The following table describes the data imported in the first batch.

user_id	date	cost
10001	2017-11-20	50
10002	2017-11-21	39

The following table describes the data imported in the second batch.

user_id	date	cost
10001	2017-11-20	1
10001	2017-11-21	5
10003	2017-11-22	22

Three rows are imported in the two batches for the user whose ID is 10001. Before the data is internally aggregated in ApsaraDB for SelectDB, the preceding five rows are stored in the underlying storage. To ensure that only the following aggregated data can be queried, the system automatically adds aggregate operators to the query engine. The following table describes the aggregated data that can be queried

user_id	date	cost
10001	2017-11-20	51
10001	2017-11-21	5
10002	2017-11-21	39
10003	2017-11-22	22

For example, if you execute the following SQL statement, the returned result is 5 instead of 1:

SELECT MIN(cost) FROM example_tbl8;

This way, data is consistently presented. However, query efficiency is reduced in some queries. In the following example, a COUNT(*) query is used.

SELECT COUNT(*) FROM example_tbl8;

In other databases, results are quickly returned for such a query. In actual implementation, a model can obtain the query results by counting rows and saving the count value when data is imported, or by scanning only one column to obtain the count value when data is queried, which requires a small amount of overhead. However, the Aggregate key model requires a large amount of overhead for such a query.

For example, you query the preceding data.

After you execute the SELECT COUNT(*) FROM example_tbl8; statement, 4 is expected to be returned. However, if only the user_id column is scanned and the data is aggregated during the query, 3 is returned. If the data is not aggregated during the query, 5 is returned because five rows are imported in two batches. Both results are wrong.

This is because if the data to be aggregated involves a large number of columns, a large amount of data needs to be scanned during the COUNT() query. To obtain the correct result 4, your model must read both the user_id and date columns, and aggregate data during the query. To be specific, during the COUNT() query, the model must scan all aggregate key columns and aggregate data to obtain the semantically correct results. In this example, the aggregate key columns include the user_id and date columns.

If you need to frequently perform COUNT(*) queries, we recommend that you add a column whose value is fixed to 1 and aggregation type is SUM to simulate the COUNT(*) function. For example, the count column is added to the schema of the example_tbl8 table. The following table describes the modified schema of the table.

ColumnName	Type	AggregateType	Comment
user_id	BIGINT	N/A	The user ID.
date	DATE	N/A	The date on which data is written to the table.
cost	BIGINT	SUM	The amount of money that the user spends.
count	BIGINT	SUM	The column that is used to count the total number of rows.

The value of the count column is fixed to 1. In this case, the SELECT COUNT(*) FROM table; statement is equivalent to the SELECT SUM(COUNT) FROM table; statement. The query efficiency of the latter is much higher than that of the former. However, this method has limits. You cannot repeatedly import the rows that contain the same data in the aggregate key columns. Otherwise, the SELECT SUM(COUNT) FROM table; statement can query only the original number of imported rows. In this case, the statement cannot represent the semantics of the SELECT COUNT(*) FROM table; statement.

Alternatively, add the count column whose value is fixed to 1 and aggregation type is REPLACE. In this case, the same results are returned for the SELECT SUM(COUNT) FROM table; and SELECT COUNT(*) FROM table; statements. In addition, this method allows you to repeatedly import the rows that contain the same data in the aggregate key columns.

MoW of the Unique key model

The MoW implementation method of the Unique key model does not have the same limits as the Aggregate key model. If you use MoW, the Unique key model adds a delete bitmap for each batch of imported rows to mark the data to be overwritten or deleted. In the preceding example, after the data of the first batch is imported, the data is stored in the following way.

The following table describes the results after the data of the first batch is imported.

user_id	date	cost	delete bit
10001	2017-11-20	50	false
10002	2017-11-21	39	false

After the data of the second batch is imported, duplicate rows in the first batch of data are marked as deleted. The following table describes the results after the data of the two batches is imported.

The following table describes the results after the data imported in the first batch is marked.

user_id	date	cost	delete bit
10001	2017-11-20	50	true
10002	2017-11-21	39	false

The following table describes the results after the data of the second batch is imported.

user_id	date	cost	delete bit
10001	2017-11-20	1	false
10001	2017-11-21	5	false
10003	2017-11-22	22	false

During queries, the model does not read the data marked true in the delete bitmap and no data is aggregated. In this example, four valid rows exist in the preceding data and 4 is expected to be returned. This enables minimum overhead because only one column is scanned.

In the test environment, COUNT(*) queries by using the MoW implementation method of the Unique key model deliver 10 times higher performance than those by using the Aggregate key model.

Limits of the Duplicate key model

The Duplicate key model does not have the same limits as the Aggregate key model. This is because the Duplicate key model does not involve aggregation semantics. The model can scan a column to obtain the semantically correct results for COUNT(*) queries.

Key columns

When you use the Duplicate, Aggregate, or Unique key model, the key columns are specified in the CREATE TABLE statement. However, the key columns vary based on the data model.

For the Duplicate key model, the key columns of a table can be regarded as sort columns instead of unique identifier columns.

For the Aggregate or Unique key model, the key columns of a table are both sort columns and unique identifier columns.

Suggestions for selecting a data model

The data model is determined when a table is created and cannot be changed. Therefore, it is important to select an appropriate data model.

The Aggregate key model can reduce the amount of data to be scanned and the amount of computing resources required for queries by pre-aggregating data, which can accelerate queries. Therefore, the Aggregate key model is suitable for report-based query scenarios with fixed patterns or report-based aggregate query scenarios. However, this model is not optimal for COUNT(*) queries. In addition, an aggregation type is specified for each value column. When you perform aggregate queries of other types, you must consider the semantic correctness of query statements.
The Unique key model ensures the uniqueness of primary keys for scenarios that require uniqueness constraints. The Unique key model is suitable for the analysis of relational data, such as orders and transaction data. However, the Unique key model cannot exploit the advantage brought by pre-aggregation such as ROLLUP in queries. If you have high-performance requirements for aggregate queries, we recommend that you use the MoW implementation method that is available in ApsaraDB for SelectDB V3.0 or later.
Although the Duplicate key model cannot take advantage of pre-aggregation, this model does not have the limits of the Aggregate key model. This model can exploit the advantage of column-oriented storage, which reads data only from the required columns without the need to read data from all key columns. The Duplicate key model is suitable for ad hoc queries based on a dimension and can be used to analyze detailed data such as logs.
If you need to update partial columns, you can obtain relevant suggestions. For more information, see Partial Update.