Partitioning
Partitions are used to divide data into different ranges. This is similar to dividing a table into multiple child tables. This way, you can manage data in partitions. When you use partitions, take note of the following items:
You can specify one or more columns as partition key columns. Partition key columns must be key columns.
You must enclose partition key values in double quotation marks (") regardless of the partition key column type.
The number of partitions that you can create is theoretically unlimited.
If you create a table without specifying partitions, the system automatically creates a partition whose name is the same as the table and that contains the full data of the table. This partition is invisible to you and cannot be deleted or modified.
When you create a partition, the range of the partition cannot overlap with that of another partition.
Range partitioning
In most cases, time columns are used as partition key columns during range partitioning to facilitate the management of new and historical data. Range partitions allow you to specify only the upper limit by executing the VALUES LESS THAN (...)
statement. The system uses the upper limit of the previous partition as the lower limit of the current partition to create a partition whose range is left-closed and right-open. You can also specify the upper and lower limits by executing the VALUES [...
statement to create a partition whose range is left-closed and right-open.
Single-column partitioning
This section describes how the partition ranges of a table change if you execute the VALUES LESS THAN (...)
statement to create or delete partitions for the table. The following sample code provides an example:
Create a table named test_table
.
CREATE TABLE IF NOT EXISTS test_db.test_table
(
`user_id` LARGEINT NOT NULL COMMENT "The user ID",
`date` DATE NOT NULL COMMENT "The date on which data is imported to the table",
`timestamp` DATETIME NOT NULL COMMENT "The time when data is imported to the table",
`city` VARCHAR(20) COMMENT "The city in which the user resides",
`age` SMALLINT COMMENT "The age of the user",
`sex` TINYINT COMMENT "The gender of the user",
`last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "The last time when the user paid a visit",
`cost` BIGINT SUM DEFAULT "0" COMMENT "The amount of money that the user spends",
`max_dwell_time` INT MAX DEFAULT "0" COMMENT "The maximum dwell time of the user",
`min_dwell_time` INT MIN DEFAULT "99999" COMMENT "The minimum dwell time of the user"
)ENGINE=OLAP
AGGREGATE KEY(`user_id`, `date`, `timestamp`, `city`, `age`, `sex`)
PARTITION BY RANGE(`date`)
(
PARTITION `p201701` VALUES LESS THAN ("2017-02-01"),
PARTITION `p201702` VALUES LESS THAN ("2017-03-01"),
PARTITION `p201703` VALUES LESS THAN ("2017-04-01")
)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;
After the test_table
table is created, the following three partitions are automatically created:
p201701: [MIN_VALUE, 2017-02-01)
p201702: [2017-02-01, 2017-03-01)
p201703: [2017-03-01, 2017-04-01)
Execute the ALTER TABLE test_db.test_table ADD PARTITION p201705 VALUES LESS THAN ("2017-06-01");
statement to create a partition named p201705
. The following sample code shows the partitioning results:
p201701: [MIN_VALUE, 2017-02-01)
p201702: [2017-02-01, 2017-03-01)
p201703: [2017-03-01, 2017-04-01)
p201705: [2017-04-01, 2017-06-01)
Execute the ALTER TABLE test_db.test_table DROP PARTITION p201703;
statement to delete the p201703
partition. The following sample code shows the partitioning results:
p201701: [MIN_VALUE, 2017-02-01)
p201702: [2017-02-01, 2017-03-01)
p201705: [2017-04-01, 2017-06-01)
Important
In the preceding example, after the p201703
partition is deleted, the ranges of the p201702
and p201705
partitions remain unchanged. However, the range [2017-03-01,2017-04-01) between the two ranges is vacant. The existing data in this range is also deleted. In this case, if the data to be imported is within the vacant range, the data cannot be imported.
Delete the p201702
partition. The following sample code shows the partitioning results:
p201701: [MIN_VALUE, 2017-02-01)
p201705: [2017-04-01, 2017-06-01)
The vacant range becomes [2017-02-01,2017-04-01).
Execute the `p201702new` VALUES LESS THAN ("2017-03-01")
statement to create a partition. The following sample code shows the partitioning results:
p201701: [MIN_VALUE, 2017-02-01)
p201702new: [2017-02-01, 2017-03-01)
p201705: [2017-04-01, 2017-06-01)
The vacant range becomes [2017-03-01,2017-04-01).
Delete the p201701 partition and execute the `p201612` VALUES LESS THAN ("2017-01-01")
statement to create a partition. The following sample code shows the partitioning results:
p201612: [MIN_VALUE, 2017-01-01)
p201702new: [2017-02-01, 2017-03-01)
p201705: [2017-04-01, 2017-06-01)
The vacant ranges become [2017-01-01,2017-02-01) and [2017-03-01,2017-04-01).
The preceding example shows that the ranges of existing partitions remain unchanged after you delete partitions. However, vacant ranges may occur. If you execute the VALUES LESS THAN (...)
statement to create a partition, the lower limit of the partition must be adjacent to the upper limit of the previous partition.
Multiple-column partitioning
When you create partitions for a table, you can partition data based on multiple columns. The following sample code provides an example:
PARTITION BY RANGE(`date`, `id`)
(
PARTITION `p201701_1000` VALUES LESS THAN ("2017-02-01", "1000"),
PARTITION `p201702_2000` VALUES LESS THAN ("2017-03-01", "2000"),
PARTITION `p201703_all` VALUES LESS THAN ("2017-04-01")
)
In this example, the date
and id
columns are specified as partition key columns. The date column is of the DATE type, and the id column is of the INT type. The following sample code shows the partitioning results:
* p201701_1000: [(MIN_VALUE, MIN_VALUE), ("2017-02-01", "1000") )
* p201702_2000: [("2017-02-01", "1000"), ("2017-03-01", "2000") )
* p201703_all: [("2017-03-01", "2000"), ("2017-04-01", MIN_VALUE))
In the last partition, only the value of the date
column is specified. By default, MIN_VALUE
is used as the value of the id
column. When you insert data, the system compares the data with the specified partition key values in sequence to determine the partition into which data is inserted. The following sample code provides an example:
* Data --> Partition
* 2017-01-01, 200 --> p201701_1000
* 2017-01-01, 2000 --> p201701_1000
* 2017-02-01, 100 --> p201701_1000
* 2017-02-01, 2000 --> p201702_2000
* 2017-02-15, 5000 --> p201702_2000
* 2017-03-01, 2000 --> p201703_all
* 2017-03-10, 1 --> p201703_all
* 2017-04-01, 1000 --> Failed to be imported.
* 2017-05-01, 1000 --> Failed to be imported.
List partitioning
List partitioning supports partition key columns of the following data types: BOOLEAN, TINYINT, SMALLINT, INT, BIGINT, LARGEINT, DATE, DATETIME, CHAR, and VARCHAR.
Partition key values are enumeration values. The data to be imported hits a partition only if the data contains one of the enumeration values of the partition.
You can specify the enumeration values contained in each partition by executing the VALUES IN (...) statement.
Single-column partitioning
This section describes how the partitions of a table change if you execute the VALUES IN (...)
statement to create or delete partitions for the table. The following sample code provides an example:
Create a table named test_table1
.
CREATE TABLE IF NOT EXISTS test_db.example_list_tbl1
(
`user_id` LARGEINT NOT NULL COMMENT "The user ID",
`date` DATE NOT NULL COMMENT "The date on which data is imported to the table",
`timestamp` DATETIME NOT NULL COMMENT "The time when data is imported to the table",
`city` VARCHAR(20) NOT NULL COMMENT "The city in which the user resides",
`age` SMALLINT COMMENT "The age of the user",
`sex` TINYINT COMMENT "The gender of the user",
`last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "The last time when the user paid a visit",
`cost` BIGINT SUM DEFAULT "0" COMMENT "The amount of money that the user spends",
`max_dwell_time` INT MAX DEFAULT "0" COMMENT "The maximum dwell time of the user",
`min_dwell_time` INT MIN DEFAULT "99999" COMMENT "The minimum dwell time of the user"
)
ENGINE=olap
AGGREGATE KEY(`user_id`, `date`, `timestamp`, `city`, `age`, `sex`)
PARTITION BY LIST(`city`)
(
PARTITION `p_cn` VALUES IN ("Beijing", "Shanghai", "Hong Kong"),
PARTITION `p_usa` VALUES IN ("New York", "San Francisco"),
PARTITION `p_jp` VALUES IN ("Tokyo")
)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 16;
After the test_table1
table is created, the following three partitions are automatically created:
p_cn: ("Beijing", "Shanghai", "Hong Kong")
p_usa: ("New York", "San Francisco")
p_jp: ("Tokyo")
Execute the `p_uk` VALUES IN ("London")
statement to create a partition. The following sample code shows the partitioning results:
p_cn: ("Beijing", "Shanghai", "Hong Kong")
p_usa: ("New York", "San Francisco")
p_jp: ("Tokyo")
p_uk: ("London")
Delete the p_jp
partition. The following sample code shows the partitioning results:
p_cn: ("Beijing", "Shanghai", "Hong Kong")
p_usa: ("New York", "San Francisco")
p_uk: ("London")
Multiple-column partitioning
When you create partitions for a table, you can partition data based on multiple columns. The following sample code provides an example:
PARTITION BY LIST(`id`, `city`)
(
PARTITION `p1_city` VALUES IN (("1", "Beijing"), ("1", "Shanghai")),
PARTITION `p2_city` VALUES IN (("2", "Beijing"), ("2", "Shanghai")),
PARTITION `p3_city` VALUES IN (("3", "Beijing"), ("3", "Shanghai"))
)
In this example, the id
and city
columns are specified as partition key columns. The id column is of the INT type, and the city column is of the VARCHAR type. The following sample code shows the partitioning results:
* p1_city: [("1", "Beijing"), ("1", "Shanghai")]
* p2_city: [("2", "Beijing"), ("2", "Shanghai")]
* p3_city: [("3", "Beijing"), ("3", "Shanghai")]
When you insert data, the system compares the data with the specified partition key values in sequence to determine the partition into which data is inserted. The following sample code provides an example:
* Data ---> Partition
* 1, Beijing ---> p1_city
* 1, Shanghai ---> p1_city
* 2, Shanghai ---> p2_city
* 3, Beijing ---> p3_city
* 1, Tianjin ---> Failed to be imported.
* 4, Beijing ---> Failed to be imported.