If you want to use ApsaraDB for SelectDB in different business scenarios, you must understand the data models supported by ApsaraDB for SelectDB. This helps you design solutions for your business. This topic describes the data models supported by ApsaraDB for SelectDB to help you design an optimal storage solution.
Background information
In ApsaraDB for SelectDB, data is logically described in tables. A table consists of rows and columns. A row indicates a row of data in a table. A column is used to describe a field in a row.
Columns can be divided into two types: key columns and value columns. From the business perspective, a key column corresponds to a dimension column and a value column corresponds to a metric column. Key columns and value columns are specified in the CREATE TABLE statement. The columns after the keywords DUPLICATE KEY
, AGGREGATE KEY
, and UNIQUE KEY
in the CREATE TABLE statement are key columns, and other columns are value columns.
The preceding keywords correspond to the following three data models supported by ApsaraDB for SelectDB:
Aggregate key model
Unique key model
Duplicate key model
Aggregate key model
ApsaraDB for SelectDB processes the rows that contain the same data in the key columns of a table based on the data model that you select to create the table. The Aggregate key model aggregates all the rows that contain the same data in the key columns. The data of these rows in the value columns is pre-aggregated based on the aggregation type that is specified for each value column in the CREATE TABLE statement. After aggregation, only one row of data is retained.
This way, the Aggregate key model pre-aggregates rows to reduce the amount of data to be scanned and the amount of computing resources required for queries. This model is suitable for report-based statistical analysis scenarios. However, this model is not optimal for COUNT(*)
queries, which are the queries that contain the COUNT(*) function. In addition, an aggregation type is specified for each value column. When you perform aggregate queries of other types, you must consider the semantic correctness of query statements.
The following examples describe the Aggregate key model and show you how to use the Aggregate key model.
Example 1: Import data for aggregation
The following table describes the schema of the example_tbl1
table.
Column | Type | Aggregation type | Comment |
user_id | LARGEINT | N/A | The user ID. |
date | DATE | N/A | The date on which data is written to the table. |
city | VARCHAR(20) | N/A | The city in which the user resides. |
age | SMALLINT | N/A | The age of the user. |
sex | TINYINT | N/A | The gender of the user. |
last_visit_date | DATETIME | REPLACE | The last time when the user paid a visit. |
cost | BIGINT | SUM | The amount of money that the user spends. |
max_dwell_time | INT | MAX | The maximum dwell time of the user. |
min_dwell_time | INT | MIN | The minimum dwell time of the user. |
The following sample code provides an example on how to create the example_tbl1
table. The partition and distribution information is omitted.
CREATE TABLE IF NOT EXISTS test.example_tbl1
(
`user_id` LARGEINT NOT NULL COMMENT "The user ID",
`date` DATE NOT NULL COMMENT "The date on which data is written to the table",
`city` VARCHAR(20) COMMENT "The city in which the user resides",
`age` SMALLINT COMMENT "The age of the user",
`sex` TINYINT COMMENT "The gender of the user",
`last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "The last time when the user paid a visit",
`cost` BIGINT SUM DEFAULT "0" COMMENT "The amount of money that the user spends",
`max_dwell_time` INT MAX DEFAULT "0" COMMENT "The maximum dwell time of the user",
`min_dwell_time` INT MIN DEFAULT "99999" COMMENT "The minimum dwell time of the user"
)
AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;
The preceding table is a typical fact table that records user information and visit information. In star models, user information is stored in dimension tables and visit information is stored in fact tables. In this example, for the convenience of explanation, user information and visit information are stored in the same table.
The columns in the table are divided into key columns and value columns, which correspond to dimension columns and metric columns, based on whether an aggregation type is specified. Key columns include the columns for which no aggregation type is specified, such as user_id
, date
, and age
. Value columns include the columns for which an aggregation type is specified. When data is imported to the table, all the rows that contain the same data in the key columns are aggregated, and the data of these rows in the value columns is aggregated based on the aggregation type that is specified for each value column. After aggregation, only one row of data is retained.
The following table describes the aggregation types.
Aggregation type | Description |
SUM | Calculates the sum of the values in multiple rows. This type is applicable to numeric values. |
MIN | Calculates the minimum value. This type is applicable to numeric values. |
MAX | Calculates the maximum value. This type is applicable to numeric values. |
REPLACE | Replaces the previous values with the newly imported values. For the rows that contain the same data in dimension columns, the values in metric columns are replaced with the newly imported values based on the order in which values are imported. |
REPLACE_IF_NOT_NULL | Replaces values except for null values with the newly imported values. Different from the REPLACE type, this type does not replace null values. When you use this type, you must specify a null value instead of an empty string as the default value for fields. If you specify an empty string as the default value for fields, this type replaces the empty string with another one. |
HLL_UNION | Aggregates columns of the HyperLogLog (HLL) type by using the HLL algorithm. |
BITMAP_UNION | Aggregates columns of the BITMAP type, which performs a union aggregation of bitmaps. |
Write the data described in the following table to the example_tbl1
table.
user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 06:00:00 | 20 | 10 | 10 |
10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00:00 | 15 | 2 | 2 |
10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 |
The following sample code provides an example on how to execute an SQL statement to import the data:
INSERT INTO example_db.example_tbl_agg1 VALUES
(10000,"2017-10-01","Beijing",20,0,"2017-10-01 06:00:00",20,10,10),
(10000,"2017-10-01","Beijing",20,0,"2017-10-01 07:00:00",15,2,2),
(10001,"2017-10-01","Beijing",30,1,"2017-10-01 17:05:45",2,22,22),
(10002,"2017-10-02","Shanghai",20,1,"2017-10-02 12:59:12",200,5,5),
(10003,"2017-10-02","Guangzhou",32,0,"2017-10-02 11:20:00",30,11,11),
(10004,"2017-10-01","Shenzhen",35,0,"2017-10-01 10:00:15",100,3,3),
(10004,"2017-10-03","Shenzhen",35,0,"2017-10-03 10:20:22",11,6,6);
The following table describes how the data is stored after the data is written to an ApsaraDB for SelectDB instance.
user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00:00 | 35 | 10 | 2 |
10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 |
In the preceding table, only one row of aggregated data is retained for the user whose ID is 10000. The data of other users is consistent with the raw data. The following list analyzes how the data of the user whose ID is 10000 is aggregated:
The first five columns remain unchanged. Data starts to change from the sixth column last_visit_date
.
2017-10-01 07:00:00
: The aggregation type of thelast_visit_date
column is REPLACE. Therefore,2017-10-01 06:00:00
is replaced with2017-10-01 07:00:00
.NoteIf the REPLACE type is used to aggregate the data imported in the same batch, the order in which the values are replaced is not fixed. In this example, the data that is stored can be
2017-10-01 06:00:00
. However, if data is imported in different batches, the data imported in the new batch replaces the data imported in the previous batch.35
: The aggregation type of thecost
column is SUM. Therefore, the updated value 35 is the result of 20 plus 15.10
: The aggregation type of themax_dwell_time
column is MAX. Therefore, 10 is stored as the maximum value between 10 and 2.2
: The aggregation type of themin_dwell_time
column is MIN. Therefore, 2 is stored as the minimum value between 10 and 2.
After aggregation, only the aggregated data is stored in ApsaraDB for SelectDB. The detailed raw data is no longer available.
Example 2: Aggregate imported data with existing data
The following sample code provides an example on how to create the example_tbl2
table. The partition and distribution information is omitted.
CREATE TABLE IF NOT EXISTS test.example_tbl2
(
`user_id` LARGEINT NOT NULL COMMENT "The user ID",
`date` DATE NOT NULL COMMENT "The date on which data is written to the table",
`city` VARCHAR(20) COMMENT "The city in which the user resides",
`age` SMALLINT COMMENT "The age of the user",
`sex` TINYINT COMMENT "The gender of the user",
`last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "The last time when the user paid a visit",
`cost` BIGINT SUM DEFAULT "0" COMMENT "The amount of money that the user spends",
`max_dwell_time` INT MAX DEFAULT "0" COMMENT "The maximum dwell time of the user",
`min_dwell_time` INT MIN DEFAULT "99999" COMMENT "The minimum dwell time of the user"
)
AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;
Write the data described in the following table to the example_tbl2
table.
user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00:00 | 35 | 10 | 2 |
10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 |
The following sample code provides an example on how to execute an SQL statement to import the data:
INSERT INTO test.example_tbl2 VALUES
(10000,"2017-10-01","Beijing",20,0,"2017-10-01 06:00:00",20,10,10),
(10000,"2017-10-01","Beijing",20,0,"2017-10-01 07:00:00",15,2,2),
(10001,"2017-10-01","Beijing",30,1,"2017-10-01 17:05:45",2,22,22),
(10002,"2017-10-02","Shanghai",20,1,"2017-10-02 12:59:12",200,5,5),
(10003,"2017-10-02","Guangzhou",32,0,"2017-10-02 11:20:00",30,11,11),
(10004,"2017-10-01","Shenzhen",35,0,"2017-10-01 10:00:15",100,3,3),
(10004,"2017-10-03","Shenzhen",35,0,"2017-10-03 10:20:22",11,6,6);
Then, write the data described in the following table to the example_tbl2
table.
user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 11:22:00 | 44 | 19 | 19 |
10005 | 2017-10-03 | Changsha | 29 | 1 | 2017-10-03 18:11:02 | 3 | 1 | 1 |
The following sample code provides an example on how to execute an SQL statement to import the data:
INSERT INTO test.example_tbl2 VALUES
(10004,"2017-10-03","Shenzhen",35,0,"2017-10-03 11:22:00",44,19,19),
(10005,"2017-10-03","Changsha",29,1,"2017-10-03 18:11:02",3,1,1);
The following table describes how the data is stored after the data is written to an ApsaraDB for SelectDB instance.
user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00:00 | 35 | 10 | 2 |
10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 11:22:00 | 55 | 19 | 6 |
10005 | 2017-10-03 | Changsha | 29 | 1 | 2017-10-03 18:11:02 | 3 | 1 | 1 |
In the preceding table, the existing data of the user whose ID is 10004 is aggregated with the newly imported data. The data of the user whose ID is 10005 is imported to the table.
Data is aggregated in the following three stages in ApsaraDB for SelectDB:
The extract, transform, and load (ETL) stage in each batch of imported data. In this stage, each batch of imported data is internally aggregated.
The data compaction stage in the compute cluster. In this stage, the compute cluster aggregates the data that is imported in different batches.
The data query stage. The data involved in each query is aggregated.
In different stages, data is aggregated to different degrees. For example, if a batch of data is imported, it may not be aggregated with the existing data. However, the data that you can query is aggregated. The degree to which data is aggregated is transparent to you. You need to only assume that the queried data is completely aggregated, without the need to know the aggregation stages.
Example 3: Retain detailed data
The following table describes the schema of the example_tbl3
table.
Column | Type | Aggregation type | Comment |
user_id | LARGEINT | N/A | The user ID. |
date | DATE | N/A | The date on which data is written to the table. |
timestamp | DATETIME | N/A | The time when data is written to the table, which is accurate to seconds. |
city | VARCHAR(20) | N/A | The city in which the user resides. |
age | SMALLINT | N/A | The age of the user. |
sex | TINYINT | N/A | The gender of the user. |
last_visit_date | DATETIME | REPLACE | The last time when the user paid a visit. |
cost | BIGINT | SUM | The amount of money that the user spends. |
max_dwell_time | INT | MAX | The maximum dwell time of the user. |
min_dwell_time | INT | MIN | The minimum dwell time of the user. |
The timestamp
column is added to record the time when data is written to the table, which is accurate to seconds. In addition, the AGGREGATE KEY
keyword is used to specify the following aggregate key columns: AGGREGATE KEY(user_id, date, timestamp, city, age, sex)
.
The following sample code provides an example on how to create the example_tbl3
table. The partition and distribution information is omitted.
CREATE TABLE IF NOT EXISTS test.example_tbl3
(
`user_id` LARGEINT NOT NULL COMMENT "The user ID",
`date` DATE NOT NULL COMMENT "The date on which data is written to the table",
`timestamp` DATETIME NOT NULL COMMENT "The time when data is written to the table, which is accurate to seconds",
`city` VARCHAR(20) COMMENT "The city in which the user resides",
`age` SMALLINT COMMENT "The age of the user",
`sex` TINYINT COMMENT "The gender of the user",
`last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "The last time when the user paid a visit",
`cost` BIGINT SUM DEFAULT "0" COMMENT "The amount of money that the user spends",
`max_dwell_time` INT MAX DEFAULT "0" COMMENT "The maximum dwell time of the user",
`min_dwell_time` INT MIN DEFAULT "99999" COMMENT "The minimum dwell time of the user"
)
AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;
Write the data described in the following table to the example_tbl3
table.
user_id | date | timestamp | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
10000 | 2017-10-01 | 2017-10-01 08:00:05 | Beijing | 20 | 0 | 2017-10-01 06:00:00 | 20 | 10 | 10 |
10000 | 2017-10-01 | 2017-10-01 09:00:05 | Beijing | 20 | 0 | 2017-10-01 07:00:00 | 15 | 2 | 2 |
10001 | 2017-10-01 | 2017-10-01 18:12:10 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
10002 | 2017-10-02 | 2017-10-02 13:10:00 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
10003 | 2017-10-02 | 2017-10-02 13:15:00 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
10004 | 2017-10-01 | 2017-10-01 12:12:48 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
10004 | 2017-10-03 | 2017-10-03 12:38:20 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 |
The following sample code provides an example on how to execute an SQL statement to import the data:
INSERT INTO test.example_tbl3 VALUES
(10000,"2017-10-01","2017-10-01 08:00:05","Beijing",20,0,"2017-10-01 06:00:00",20,10,10),
(10000,"2017-10-01","2017-10-01 09:00:05","Beijing",20,0,"2017-10-01 07:00:00",15,2,2),
(10001,"2017-10-01","2017-10-01 18:12:10","Beijing",30,1,"2017-10-01 17:05:45",2,22,22),
(10002,"2017-10-02","2017-10-02 13:10:00","Shanghai",20,1,"2017-10-02 12:59:12",200,5,5),
(10003,"2017-10-02","2017-10-02 13:15:00","Guangzhou",32,0,"2017-10-02 11:20:00",30,11,11),
(10004,"2017-10-01","2017-10-01 12:12:48","Shenzhen",35,0,"2017-10-01 10:00:15",100,3,3),
(10004,"2017-10-03","2017-10-03 12:38:20","Shenzhen",35,0,"2017-10-03 10:20:22",11,6,6);
The following table describes how the data is stored after the data is written to an ApsaraDB for SelectDB instance.
user_id | date | timestamp | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time |
10000 | 2017-10-01 | 2017-10-01 08:00:05 | Beijing | 20 | 0 | 2017-10-01 06:00:00 | 20 | 10 | 10 |
10000 | 2017-10-01 | 2017-10-01 09:00:05 | Beijing | 20 | 0 | 2017-10-01 07:00:00 | 15 | 2 | 2 |
10001 | 2017-10-01 | 2017-10-01 18:12:10 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 |
10002 | 2017-10-02 | 2017-10-02 13:10:00 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 |
10003 | 2017-10-02 | 2017-10-02 13:15:00 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 |
10004 | 2017-10-01 | 2017-10-01 12:12:48 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 |
10004 | 2017-10-03 | 2017-10-03 12:38:20 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 |
The stored data is exactly the same as the imported data. No data is aggregated. After the timestamp
column is added to this batch of data, the rows do not contain the same data in the key columns. In this case, the Aggregate key model does not aggregate data and retains complete detailed data.
Unique key model
When you analyze multiple dimensions of data, you need to ensure the uniqueness of keys. For example, configure uniqueness constraints for primary keys. To resolve this issue, ApsaraDB for SelectDB supports the Unique key model. In earlier versions, the Unique key model is essentially a special case of the Aggregate key model and a simplified method to present table schemas. The Aggregate key model is implemented based on the Merge on Read (MoR) implementation method, which is not suitable for some aggregate queries. In ApsaraDB for SelectDB V3.0, the Unique key model provides the Merge on Write (MoW) implementation method, which delivers optimal query performance by removing duplicates during data writes.
The Unique key model overwrites multiple rows that contain the same data in the specified key columns, and retains only the latest imported rows. This provides uniqueness constraints that are similar to those in relational databases.
The Unique key model ensures the uniqueness of primary keys for scenarios that require uniqueness constraints. This model can be used for the analysis of relational data, such as orders. If you have high-performance requirements for aggregate queries, we recommend that you use the MoW implementation method that is provided in the latest version of ApsaraDB for SelectDB. However, the Unique key model cannot take advantage of pre-aggregation such as ROLLUP in queries.
If you use the Unique key model, we recommend that you use the MoW implementation method.
The following examples describe the two different implementation methods.
MoW
The MoW implementation method of the Unique key model is completely different from the Aggregate key model. The query performance of the MoW implementation method is closer to that of the Duplicate key model described in the following section. In scenarios in which primary key constraints are required, the MoW implementation method of the Unique key model delivers better query performance than the Aggregate key model, especially in aggregate queries and queries that require indexes to filter a large amount of data.
By default, the MoW feature is disabled in ApsaraDB for SelectDB V3.0. You can enable the MoW feature by configuring the following property:
"enable_unique_key_merge_on_write" = "true"
The following table describes the schema of the example_tbl6
table.
Column | Type | Aggregation type | Comment |
user_id | BIGINT | N/A | The user ID. |
username | VARCHAR(50) | N/A | The nickname of the user. |
city | VARCHAR(20) | NONE | The city in which the user resides. |
age | SMALLINT | NONE | The age of the user. |
sex | TINYINT | NONE | The gender of the user. |
phone | LARGEINT | NONE | The phone number of the user. |
address | VARCHAR(500) | NONE | The address of the user. |
register_time | DATETIME | NONE | The time when the user is registered. |
The following sample code provides an example on how to create the example_tbl6
table and modify the properties of the table. The partition and distribution information is omitted.
CREATE TABLE IF NOT EXISTS test.example_tbl6
(
`user_id` LARGEINT NOT NULL COMMENT "The user ID",
`username` VARCHAR(50) NOT NULL COMMENT "The nickname of the user",
`city` VARCHAR(20) COMMENT "The city in which the user resides",
`age` SMALLINT COMMENT "The age of the user",
`sex` TINYINT COMMENT "The gender of the user",
`phone` LARGEINT COMMENT "The phone number of the user",
`address` VARCHAR(500) COMMENT "The address of the user",
`register_time` DATETIME COMMENT "The time when the user is registered"
)
UNIQUE KEY(`user_id`, `username`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16
PROPERTIES (
"enable_unique_key_merge_on_write" = "true"
);
The table schema created by using the preceding CREATE TABLE statement is completely different from that created by using the Aggregate key model.
In a table that uses the Unique key model and for which the MoW feature is enabled, the data to be overwritten and updated is marked for deletion in the data import stage, and the latest data is written. When you query the table, all data marked for deletion is filtered out and the latest data is read. This eliminates the data aggregation process of the MoR implementation method and supports multiple types of predicate pushdown in most cases. Therefore, this can improve performance in multiple scenarios, especially in aggregate queries.
By default, the MoW feature is disabled in ApsaraDB for SelectDB V3.0. You can enable the MoW feature by configuring the specified property when you create a table.
The MoR implementation method cannot be seamlessly upgraded to the MoW implementation method because they organize data in different ways. If you want to use the MoW implementation method, you must execute the
INSERT INTO unique-mow-table SELECT * FROM source_table
statements to import data to a new table.The unique deletion sign and sequence column of the Unique key model can still be used if you use the MoW implementation method.
MoR
The following table describes the schema of the example_tbl4
table.
Column | Type | Primary key | Comment |
user_id | BIGINT | Yes | The user ID. |
username | VARCHAR(50) | Yes | The nickname of the user. |
city | VARCHAR(20) | No | The city in which the user resides. |
age | SMALLINT | No | The age of the user. |
sex | TINYINT | No | The gender of the user. |
phone | LARGEINT | No | The phone number of the user. |
address | VARCHAR(500) | No | The address of the user. |
register_time | DATETIME | No | The time when the user is registered. |
This is a typical table that records the basic user information. This type of data does not need to be aggregated. You need to only ensure the uniqueness of the primary key. In this example, the user_id and username columns are used as the primary key.
The following sample code provides an example on how to create the example_tbl4
table. The partition and distribution information is omitted.
CREATE TABLE IF NOT EXISTS test.example_tbl4
(
`user_id` LARGEINT NOT NULL COMMENT "The user ID",
`username` VARCHAR(50) NOT NULL COMMENT "The nickname of the user",
`city` VARCHAR(20) COMMENT "The city in which the user resides",
`age` SMALLINT COMMENT "The age of the user",
`sex` TINYINT COMMENT "The gender of the user",
`phone` LARGEINT COMMENT "The phone number of the user",
`address` VARCHAR(500) COMMENT "The address of the user",
`register_time` DATETIME COMMENT "The time when the user is registered"
)
UNIQUE KEY(`user_id`, `username`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;
The following table describes the schema of the example_tbl5
table.
Column | Type | Aggregation type | Comment |
user_id | BIGINT | N/A | The user ID. |
username | VARCHAR(50) | N/A | The nickname of the user. |
city | VARCHAR(20) | REPLACE | The city in which the user resides. |
age | SMALLINT | REPLACE | The age of the user. |
sex | TINYINT | REPLACE | The gender of the user. |
phone | LARGEINT | REPLACE | The phone number of the user. |
address | VARCHAR(500) | REPLACE | The address of the user. |
register_time | DATETIME | REPLACE | The time when the user is registered. |
The following sample code provides an example on how to create the example_tbl5
table. The partition and distribution information is omitted.
CREATE TABLE IF NOT EXISTS test.example_tbl5
(
`user_id` LARGEINT NOT NULL COMMENT "The user ID",
`username` VARCHAR(50) NOT NULL COMMENT "The nickname of the user",
`city` VARCHAR(20) REPLACE COMMENT "The city in which the user resides",
`age` SMALLINT REPLACE COMMENT "The age of the user",
`sex` TINYINT REPLACE COMMENT "The gender of the user",
`phone` LARGEINT REPLACE COMMENT "The phone number of the user",
`address` VARCHAR(500) REPLACE COMMENT "The address of the user",
`register_time` DATETIME REPLACE COMMENT "The time when the user is registered"
)
AGGREGATE KEY(`user_id`, `username`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;
The schema of the preceding example_tbl4
table is the same as that of the example_tbl5
table that uses the Aggregate key model.
The MoR implementation method of the Unique key model can be replaced by the REPLACE type of the Aggregate key model. Their internal implementation and data storage methods are exactly the same.
Duplicate key model
When you analyze multiple dimensions of data, you may not need to use primary keys or aggregate data. In this case, you can use the Duplicate key model to meet your business requirements.
For the Duplicate key model, multiple rows that contain the same data in the specified key columns can be stored in the system without affecting each other. The data of these rows does not need to be pre-aggregated and no uniqueness constraints are required.
The Duplicate key model is suitable for ad hoc queries based on a dimension and can be used to analyze detailed data such as logs. The Duplicate key model cannot improve performance by pre-aggregating data or implement automatic updates by configuring uniqueness constraints.
The following example describes the implementation method of the Duplicate key model.
The following table describes the schema of the example_tbl7
table.
Column | Type | Sort key | Comment |
timestamp | DATETIME | Yes | The time when the log was generated. |
type | INT | Yes | The type of the log. |
error_code | INT | Yes | The error code. |
error_msg | VARCHAR(1024) | No | The error message. |
op_id | BIGINT | No | The owner ID. |
op_time | DATETIME | No | The time when the error was handled. |
The following sample code provides an example on how to create the example_tbl7
table. The partition and distribution information is omitted.
CREATE TABLE IF NOT EXISTS test.example_tbl7
(
`timestamp` DATETIME NOT NULL COMMENT "The time when the log was generated",
`type` INT NOT NULL COMMENT "The type of the log",
`error_code` INT COMMENT "The error code",
`error_msg` VARCHAR(1024) COMMENT "The error message",
`op_id` BIGINT COMMENT "The owner ID",
`op_time` DATETIME COMMENT "The time when the error was handled"
)
DUPLICATE KEY(`timestamp`, `type`, `error_code`)
DISTRIBUTED BY HASH(`type`) BUCKETS 16;
The Duplicate key model is different from the Aggregate or Unique key model. Raw data is imported and stored without being aggregated. Rows are retained even if they contain the same data. The duplicate key that is specified in the CREATE TABLE statement indicates the columns based on which the data is sorted in the underlying storage. We recommend that you select the first two to four columns as the duplicate key.
Differences among data models
Limits of the Aggregate key model
The Aggregate key model presents the completely aggregated data. The data that is not aggregated, such as the data imported in two batches, must be consistently presented in a specific way. The following example describes how the Aggregate key model consistently presents data.
The following table describes the schema of the example_tbl8
table.
Column | Type | Aggregation type | Comment |
user_id | LARGEINT | N/A | The user ID. |
date | DATE | N/A | The date on which data is written to the table. |
cost | BIGINT | SUM | The amount of money that the user spends. |
Write the data described in the following tables to the example_tbl8
table in two batches.
The following table describes the data imported in the first batch.
user_id | date | cost |
10001 | 2017-11-20 | 50 |
10002 | 2017-11-21 | 39 |
The following table describes the data imported in the second batch.
user_id | date | cost |
10001 | 2017-11-20 | 1 |
10001 | 2017-11-21 | 5 |
10003 | 2017-11-22 | 22 |
Three rows are imported in the two batches for the user whose ID is 10001. Before the data is internally aggregated in ApsaraDB for SelectDB, the preceding five rows are stored in the underlying storage. To ensure that only the following aggregated data can be queried, the system automatically adds aggregate operators to the query engine. The following table describes the aggregated data that can be queried
user_id | date | cost |
10001 | 2017-11-20 | 51 |
10001 | 2017-11-21 | 5 |
10002 | 2017-11-21 | 39 |
10003 | 2017-11-22 | 22 |
For example, if you execute the following SQL statement, the returned result is 5 instead of 1:
SELECT MIN(cost) FROM example_tbl8;
This way, data is consistently presented. However, query efficiency is reduced in some queries. In the following example, a COUNT(*)
query is used.
SELECT COUNT(*) FROM example_tbl8;
In other databases, results are quickly returned for such a query. In actual implementation, a model can obtain the query results by counting rows and saving the count value when data is imported, or by scanning only one column to obtain the count value when data is queried, which requires a small amount of overhead. However, the Aggregate key model requires a large amount of overhead for such a query.
For example, you query the preceding data.
After you execute the SELECT COUNT(*) FROM example_tbl8;
statement, 4 is expected to be returned. However, if only the user_id
column is scanned and the data is aggregated during the query, 3 is returned. If the data is not aggregated during the query, 5 is returned because five rows are imported in two batches. Both results are wrong.
This is because if the data to be aggregated involves a large number of columns, a large amount of data needs to be scanned during the COUNT()
query. To obtain the correct result 4, your model must read both the user_id
and date
columns, and aggregate data during the query. To be specific, during the COUNT()
query, the model must scan all aggregate key columns and aggregate data to obtain the semantically correct results. In this example, the aggregate key columns include the user_id
and date
columns.
If you need to frequently perform COUNT(*)
queries, we recommend that you add a column whose value is fixed to 1 and aggregation type is SUM to simulate the COUNT(*)
function. For example, the count
column is added to the schema of the example_tbl8
table. The following table describes the modified schema of the table.
Column | Type | Aggregate type | Comment |
user_id | BIGINT | N/A | The user ID. |
date | DATE | N/A | The date on which data is written to the table. |
cost | BIGINT | SUM | The amount of money that the user spends. |
count | BIGINT | SUM | The column that is used to count the total number of rows. |
The value of the count column is fixed to 1. In this case, the SELECT COUNT(*) FROM table;
statement is equivalent to the SELECT SUM(COUNT) FROM table;
statement. The query efficiency of the latter is much higher than that of the former. However, this method has limits. You cannot repeatedly import the rows that contain the same data in the aggregate key columns. Otherwise, the SELECT SUM(COUNT) FROM table;
statement can query only the original number of imported rows. In this case, the statement cannot represent the semantics of the SELECT COUNT(*) FROM table;
statement.
Alternatively, add the count
column whose value is fixed to 1 and aggregation type is REPLACE. In this case, the same results are returned for the SELECT SUM(COUNT) FROM table;
and SELECT COUNT(*) FROM table;
statements. In addition, this method allows you to repeatedly import the rows that contain the same data in the aggregate key columns.
MoW of the Unique key model
The MoW implementation method of the Unique key model does not have the same limits as the Aggregate key model. If you use MoW, the Unique key model adds a delete bitmap for each batch of imported rows to mark the data to be overwritten or deleted. In the preceding example, after the data of the first batch is imported, the data is stored in the following way.
The following table describes the results after the data of the first batch is imported.
user_id | date | cost | delete bit |
10001 | 2017-11-20 | 50 | false |
10002 | 2017-11-21 | 39 | false |
After the data of the second batch is imported, duplicate rows in the first batch of data are marked as deleted. The following table describes the results after the data of the two batches is imported.
The following table describes the results after the data imported in the first batch is marked.
user_id | date | cost | delete bit |
10001 | 2017-11-20 | 50 | true |
10002 | 2017-11-21 | 39 | false |
The following table describes the results after the data of the second batch is imported.
user_id | date | cost | delete bit |
10001 | 2017-11-20 | 1 | false |
10001 | 2017-11-21 | 5 | false |
10003 | 2017-11-22 | 22 | false |
During queries, the model does not read the data marked true in the delete bitmap and no data is aggregated. In this example, four valid rows exist in the preceding data and 4 is expected to be returned. This enables minimum overhead because only one column is scanned.
In the test environment, COUNT(*)
queries by using the MoW implementation method of the Unique key model deliver 10 times higher performance than those by using the Aggregate key model.
Limits of the Duplicate key model
The Duplicate key model does not have the same limits as the Aggregate key model. This is because the Duplicate key model does not involve aggregation semantics. The model can scan a column to obtain the semantically correct results for COUNT(*) queries.
Key columns
When you use the Duplicate, Aggregate, or Unique key model, the key columns are specified in the CREATE TABLE statement. However, the key columns vary based on the data model.
For the Duplicate key model, the key columns of a table can be regarded as sort columns instead of unique identifier columns.
For the Aggregate or Unique key model, the key columns of a table are both sort columns and unique identifier columns.
Suggestions for selecting a data model
The data model is determined when a table is created and cannot be changed. Therefore, it is important to select an appropriate data model.
The Aggregate key model can reduce the amount of data to be scanned and the amount of computing resources required for queries by pre-aggregating data, which can accelerate queries. Therefore, the Aggregate key model is suitable for report-based query scenarios with fixed patterns or report-based aggregate query scenarios. However, this model is not optimal for COUNT(*) queries. In addition, an aggregation type is specified for each value column. When you perform aggregate queries of other types, you must consider the semantic correctness of query statements.
The Unique key model ensures the uniqueness of primary keys for scenarios that require uniqueness constraints. The Unique key model is suitable for the analysis of relational data, such as orders and transaction data. However, the Unique key model cannot exploit the advantage brought by pre-aggregation such as ROLLUP in queries. If you have high-performance requirements for aggregate queries, we recommend that you use the MoW implementation method that is available in ApsaraDB for SelectDB V3.0 or later.
Although the Duplicate key model cannot take advantage of pre-aggregation, this model does not have the limits of the Aggregate key model. This model can exploit the advantage of column-oriented storage, which reads data only from the required columns without the need to read data from all key columns. The Duplicate key model is suitable for ad hoc queries based on a dimension and can be used to analyze detailed data such as logs.
If you need to update partial columns, you can obtain relevant suggestions. For more information, see Partial Update.