This topic describes how to design the schema of an AnalyticDB for MySQL table to optimize the table performance. The schema includes the table type, distribution key, partition key, primary key, and clustered index of the AnalyticDB for MySQL table.
Select a table type
AnalyticDB for MySQL supports replicated tables and standard tables. When you select a table type, take note of the following items:
A replicated table stores a data replica on each node of the AnalyticDB for MySQL cluster to which the table belongs. We recommend that you limit the volume of data stored in each replicated table to a maximum of 20,000 rows.
A standard table is a partitioned table and can make full use of the query capability of a distributed system to improve query efficiency. Each standard table can store up to hundreds of billions of data entries.
Select a distribution key
If you want to import incremental data, you can specify a distribution key and a partition key when you create a standard table. Use the DISTRIBUTED BY HASH(column_name,...)
clause to specify the distribution key. Then, the table is divided into shards based on the hash values of the column_name
field. For more information, see CREATE TABLE.
Syntax
DISTRIBUTED BY HASH(column_name,...)
Usage notes
Select fields whose values are evenly distributed as the distribution key, such as the transaction ID, device ID, user ID, or auto-increment columns.
NoteDo not select fields of the date, time, and timestamp types as the distribution key. The preceding fields are prone to data skew during data writing and may degrade write performance. Most queries are limited to a period of time such as a day or a month. In this case, the data that you want to query may exist on only a single node, and queries cannot leverage the processing capabilities of all nodes in a distributed database system. We recommend that you select fields of the date and time types as the partition key. For more information, see the "Select a partition key" section of this topic.
To reduce data shuffles, select fields that can be used to join tables as the distribution key. For example, if you want to query historical orders based on the customer, you can select the
customer_id
field as the distribution key.To filter data based on a distribution key, select the frequently used query condition fields as the distribution key.
Each table can contain only a single distribution key. Each distribution key can contain one or more fields. Select fewer fields to make the distribution key more suitable for a variety of complex queries.
If you do not specify a distribution key when you create a table, the following rules apply:
If the table has a primary key, AnalyticDB for MySQL uses the primary key as the distribution key.
If the table does not have a primary key, AnalyticDB for MySQL adds the
__adb_auto_id__
field to the table and uses the field as the primary key and the distribution key.
Select a partition key
If a single shard contains a large amount of data after you specify a distribution key for a table, you can divide the shard into partitions to improve data access performance. When you create a table, use the PARTITION BY
clause to specify a partition key to partition the table. For more information, see CREATE TABLE.
Syntax
Partition the table by using the value of the
column_name
field.PARTITION BY VALUE(column_name)
Partition the table by using the value of the
column_name
field that is converted into the%Y%m%d
format, such as20210101
.PARTITION BY VALUE(DATE_FORMAT(column_name, '%Y%m%d'))
Partition the table by using the value of the
column_name
field that is converted into the%Y%m
format, such as202101
.PARTITION BY VALUE(DATE_FORMAT(column_name, '%Y%m'))
Partition the table by using the value of the
column_name
field that is converted into the%Y
format, such as2021
.PARTITION BY VALUE(DATE_FORMAT(column_name, '%Y'))
Usage notes
When a table contains a large amount of data, specify an appropriate partition key for the table. If the table does not contain partitions or is improperly partitioned, the performance of the AnalyticDB for MySQL cluster may degrade. For information about how to perform diagnostics on partition field reasonability, see the Diagnostics on partition field reasonability section of the Data modeling diagnostics topic.
You can partition a table by year, month, day, or the original value. To ensure the query and write performance or stability of the AnalyticDB for MySQL cluster, specify a partition key to assign an appropriate amount of data to each partition.
We recommend that you keep data at rest in partitions. If multiple existing partitions are frequently updated each day, check whether the partition key is appropriate.
You can use the
LIFECYCLE N
parameter to manage the lifecycle of a table. Partitions are sorted and onlyN
partitions are retained.ImportantEach table supports a limited number of partitions. The data in a partitioned table cannot be permanently retained. For more information, see Limits.
Select a primary key
A primary key is used as a unique identifier of each record. When you create a table, you can use the PRIMARY KEY
clause to specify a primary key. For more information, see CREATE TABLE.
Syntax
PRIMARY KEY (column_name,...)
Usage notes
Only tables that have a primary key support operations such as DELETE and UPDATE.
The primary key of an AnalyticDB for MySQL table can be a single field or a combination of multiple fields. To ensure high performance, we recommend that you select one or a small number of numeric fields as the primary key.
A primary key must include a distribution key and a partition key. We recommend that you place a distribution key and a partition key in the front section of a composite primary key.
Select a clustered index
In a clustered index, the logical order of key values determines the physical order of corresponding rows in a table. When you select a clustered index, take note of the following items:
Each table supports a single clustered index. For information about how to create a clustered index, see CREATE TABLE.
We recommend that you select a field that is always included in data queries as the clustered index. For example, you want to allow students to query only their own final exam grades in an educational administration system. In this case, you can specify the student ID as the clustered index to ensure data locality and improve data query performance.
A clustered index sorts the entire table, which results in lower data write performance and high CPU utilization. We recommend that you do not use clustered indexes.
Example
Create a table named customer
that meets the following requirements:
The table data is partitioned by using the value of the
login_time
field that is converted into the%Y%m%d
format.Only data in the last 30 partitions is retained, which indicates that the LIFECYCLE N parameter is set to LIFECYCLE 30.
The table data is distributed by using the
customer_id
field.The primary key is set to
login_time, customer_id, phone_num
.
Execute the following statement to create the customer table:
CREATE TABLE customer (
customer_id bigint NOT NULL COMMENT 'Customer ID',
customer_name varchar NOT NULL COMMENT 'Customer name',
phone_num bigint NOT NULL COMMENT 'Phone number',
city_name varchar NOT NULL COMMENT 'City',
sex int NOT NULL COMMENT 'Gender',
id_number varchar NOT NULL COMMENT 'ID card number',
home_address varchar NOT NULL COMMENT 'Home address',
office_address varchar NOT NULL COMMENT 'Office address',
age int NOT NULL COMMENT 'Age',
login_time timestamp NOT NULL COMMENT 'Logon time',
PRIMARY KEY (login_time, customer_id, phone_num)
)
DISTRIBUTED BY HASH(customer_id)
PARTITION BY VALUE(DATE_FORMAT(login_time, '%Y%m%d')) LIFECYCLE 30
COMMENT 'Customer information table';
FAQ
Q: How do I query all partitions of a table and the statistics about the partitions after I divide the table into partitions?
A: Execute the following statements to query all partitions of the table and the statistics about the partitions:
SELECT partition_id, -- The name of the partition. row_count, -- The total number of rows in the partition. local_data_size, -- The local data storage of the partition. index_size, -- The index size of the partition. pk_size, -- The size of the primary key index of the partition. remote_data_size -- The remote data storage of the partition. FROM information_schema.kepler_partitions WHERE schema_name = '$DB' AND table_name ='$TABLE' AND partition_id > 0;
ImportantYou cannot query the partitions for which compaction is not triggered in incremental data. To query all partitions in real time, execute the
SELECT DISTINCT $partition_column from $db.$table;
statement.Q: What factors affect the number of shards? Can I manually change the number of shards?
A: The number of shards is automatically calculated based on the specifications of the AnalyticDB for MySQL cluster that you specify when you create the cluster. You cannot manually change the number of shards.
Q: Is the number of shards affected when I change the configurations of an AnalyticDB for MySQL cluster?
A: No, the number of shards is not affected when you change the configurations of the cluster.
Q: Can I change the distribution key or partition key of an AnalyticDB for MySQL table?
A: No, you cannot change the distribution key or partition key of an AnalyticDB for MySQL table. For information about how to use a different distribution key or partition key, see the "Can I change the distribution key or partition key of a table?" section of the ALTER TABLE topic.