Step 3: Learn the key points in database and table design

Updated at: 2024-12-25 01:44

Fine table schema design enables support for rich features and significantly improves the performance, maintainability, and scalability of the database system. Therefore, the database and table schema design is of vital importance. This topic describes the table properties that you must focus on when you design a table schema in ApsaraDB for SelectDB. The information helps you appropriately design tables to better meet your business requirements.

Important table properties

When you store your business data in SelectDB, it is of vital importance to design important table properties based on your business requirements. This helps you create a high-performance and easy-to-maintain table schema. The following table describes the important table properties of SelectDB.

Table property

Required

Description

References

Table property

Required

Description

References

Data model

Yes

Different data models are suitable for different business scenarios: The Unique model supports uniqueness constraints on the primary key and is used to meet flexible and efficient data update requirements.

The Duplicate model uses the append data write mode and is suitable for high-performance analysis of detailed data.

The Aggregate model supports data pre-aggregation and is suitable for data aggregation and statistics scenarios.

Data models

Tablet

Yes

Tablets are used to distribute data to different nodes in a cluster to manage and query large amounts of data by leveraging capabilities of a distributed system.

Partition

No

Partitioning allows you to divide a raw table into multiple child tables based on specified fields, such as time and region. Partitioning facilitates data management and query and accelerates queries.

Index

No

You can quickly filter or locate data based on indexes. This greatly improves the query performance.

Indexes

Data models

You can select an appropriate data model based on the functional and performance requirements of your data analysis scenarios. Different data models are suitable for different business scenarios. This section briefly describes the data models to help you understand and select a data model based on your business requirements. For more information, see Data models.

Basics

In SelectDB, data is organized and managed in the form of tables at the logical layer. Each table consists of rows and columns. A row indicates a row of data in a table. A column is used to describe a field in a row.

Columns can be divided into the following types:

  • Key column: The columns that are modified by the keywords UNIQUE KEY, AGGREGATE KEY, and DUPLICATE KEY in the CREATE TABLE statement are key columns.

  • Value column: All the other columns are value columns.

Select a model

In SelectDB, three types of data models are available for tables: Unique, Duplicate, and Aggregate.

Important
  • The data model is determined during table creation and cannot be modified.

  • If no data model is specified during table creation, the Duplicate model is used by default and the first three columns are automatically selected as the key columns.

  • In the Unique, Duplicate, and Aggregate models, data is sorted based on key columns.

Data model

Characteristic

Scenario

Shortcoming

Data model

Characteristic

Scenario

Shortcoming

Unique

The value for a key column in each row is unique.

If a key column has the same value in multiple rows, the row written to the table later overwrites the previous one.

This model is suitable for scenarios that require unique primary keys or efficient updates. For example, you can use the Unique model in data analysis scenarios, such as analysis of e-commerce orders and user attribute data.

  • You cannot pre-aggregate the data based on a synchronous materialized view to accelerate queries.

Duplicate

The value for a key column in multiple rows can be the same.

Multiple rows with the same value for a key column can be stored in the system at the same time.

This model has high data write and query efficiency, and is suitable for scenarios in which all raw data records are retained. For example, you can use the Duplicate model in detailed data analysis scenarios, such as log analysis and bill analysis.

  • You cannot pre-aggregate the data based on a synchronous materialized view to accelerate queries.

Aggregate

The value for a key column in each row is unique.

If multiple rows have the same value for a key column, the value columns in the rows are pre-aggregated based on the aggregation type specified during table creation.

Similar to the Cube model of traditional data warehouses, the Aggregate model is suitable for aggregate statistics scenarios that improve query performance by pre-aggregation. For example, you can use this model in data analysis scenarios such as website traffic analysis and custom reports.

  • This model provides only limited support for the COUNT(*) statement.

  • The aggregation type for the value columns is fixed.

Use a model

Use the Unique model
Use the Duplicate model
Use the Aggregate model

In the Unique model, if a key column has the same value in multiple rows, the row written to the table later overwrites the previous one. The Unique model provides two implementation methods: Merge on Read (MoR) and Merge on Write (MoW).

The MoW method is mature and stable and provides excellent query performance. Therefore, we recommend that you use the MoW method in the Unique model. The following example describes how to use the MoW method to implement the Unique model. For more information about the MoR method, see the MoR section of the "Data models" topic.

Usage notes

If you select the Unique model and want to use the MoW method, take note of the following items when you create a table:

  • Use the UNIQUE KEY keyword to specify a unique field as the primary key.

  • Enable MoW in the PROPERTIES section.

    "enable_unique_key_merge_on_write" = "true"
Example

The following sample code shows the SQL statement for creating the orders table. In this example, the Unique model is selected for the orders table, the order_id and order_time fields are used as the composite primary key, and the MoW method is enabled.

CREATE TABLE IF NOT EXISTS orders
(
    `order_id` LARGEINT NOT NULL COMMENT "The order ID.",
    `order_time` DATETIME NOT NULL COMMENT "The order time.",
    `customer_id` LARGEINT NOT NULL COMMENT "The user ID.",
    `total_amount` DOUBLE COMMENT "The total amount of the order.",
    `status` VARCHAR(20) COMMENT "The order status.",
    `payment_method` VARCHAR(20) COMMENT "The payment method.",
    `shipping_method` VARCHAR(20) COMMENT "The shipping method.",
    `customer_city` VARCHAR(20) COMMENT "The city in which the user resides.",
    `customer_address` VARCHAR(500) COMMENT "The address of the user."
)
UNIQUE KEY(`order_id`, `order_time`)
PARTITION BY RANGE(`order_time`) ()
DISTRIBUTED BY HASH(`order_id`)
PROPERTIES (
    "enable_unique_key_merge_on_write" = "true",
    "dynamic_partition.enable" = "true",
    "dynamic_partition.time_unit" = "DAY",
    "dynamic_partition.start" = "-7",
    "dynamic_partition.end" = "3",
    "dynamic_partition.prefix" = "p",
    "dynamic_partition.create_history_partition" = "true",
    "dynamic_partition.buckets" = "16"
);

In the Duplicate model, multiple rows with the same value for a key column are stored in the system at the same time. This model does not support pre-aggregation or require unique primary keys.

For example, you can use this model to record and analyze log data generated by a business system and sort the data by log time, log type, and error code. The following sample code shows the SQL statement for creating the log table. In this example, the Duplicate model is selected for the log table, and data is sorted by the log_time, log_type, and error_code fields.

CREATE TABLE IF NOT EXISTS log
(
    `log_time` DATETIME NOT NULL COMMENT "The time when the log was generated.",
    `log_type` INT NOT NULL COMMENT "The type of the log.",
    `error_code` INT COMMENT "The error code.",
    `error_msg` VARCHAR(1024) COMMENT "The error message.",
    `op_id` BIGINT COMMENT "The owner ID.",
    `op_time` DATETIME COMMENT "The time when the error was handled."
)
DUPLICATE KEY(`log_time`, `log_type`, `error_code`)
PARTITION BY RANGE(`log_time`) ()
DISTRIBUTED BY HASH(`log_type`)
PROPERTIES (
    "dynamic_partition.enable" = "true",
    "dynamic_partition.time_unit" = "DAY",
    "dynamic_partition.start" = "-7",
    "dynamic_partition.end" = "3",
    "dynamic_partition.prefix" = "p",
    "dynamic_partition.create_history_partition" = "true",
    "dynamic_partition.buckets" = "16"
);
Usage notes

In the Aggregate model, if multiple rows have the same value for a key column, the value columns in the rows are pre-aggregated based on the aggregation type specified during table creation. You must take note of the following items when you create a table of the Aggregate model:

  • Use the AGGREGATE KEY keyword to specify one or more key columns. The rows with the same value for the key columns are aggregated.

  • Specify an aggregation type for the value columns. The following table describes the aggregation types.

    Aggregation type

    Description

    SUM

    Calculates the sum of the values in multiple rows. This type is applicable to numeric values.

    MIN

    Calculates the minimum value. This type is applicable to numeric values.

    MAX

    Calculates the maximum value. This type is applicable to numeric values.

    REPLACE

    Replaces the previous values with the newly imported values. For the rows that contain the same data in dimension columns, the values in metric columns are replaced with the newly imported values based on the order in which values are imported.

    REPLACE_IF_NOT_NULL

    Replaces values except for null values with the newly imported values. Different from the REPLACE type, this type does not replace null values. When you use this type, you must specify a null value instead of an empty string as the default value for fields. If you specify an empty string as the default value for fields, this type replaces the empty string with another one.

    HLL_UNION

    Aggregates columns of the HyperLogLog (HLL) type by using the HLL algorithm.

    BITMAP_UNION

    Aggregates columns of the BITMAP type, which performs a union aggregation of bitmaps.

Example

For example, you can use the Aggregate model to perform statistical analysis on user behavior and record the following information: last visit time, total consumption, maximum dwell time, and minimum dwell time. The following sample code shows the SQL statement for creating the user_behavior table. In this example, the value columns are pre-aggregated if the following key columns have the same value in multiple rows: user_id, date, city, age, and sex. The data is aggregated based on the following rules:

  • Last visit time of the user: Use the maximum value of the last_visit_date field.

  • Total consumption: Calculate the sum of multiple data records.

  • Maximum dwell time: Use the maximum value of the max_dwell_time field.

  • Minimum dwell time: Use the minimum value of the min_dwell_time field.

CREATE TABLE IF NOT EXISTS user_behavior
(
    `user_id` LARGEINT NOT NULL COMMENT "The user ID.",
    `date` DATE NOT NULL COMMENT "The date on which data is written to the table.",
    `city` VARCHAR(20) COMMENT "The city in which the user resides.",
    `age` SMALLINT COMMENT "The age of the user.",
    `sex` TINYINT COMMENT "The gender of the user.",
    `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "The last time when the user paid a visit.",
    `cost` BIGINT SUM DEFAULT "0" COMMENT "The amount of money that the user spends.",
    `max_dwell_time` INT MAX DEFAULT "0" COMMENT "The maximum dwell time of the user.",
    `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "The minimum dwell time of the user."
)
AGGREGATE KEY(`user_id`, `date`, `city`, `age`, `sex`)
PARTITION BY RANGE(`date`) ()
DISTRIBUTED BY HASH(`user_id`)
PROPERTIES (
    "dynamic_partition.enable" = "true",
    "dynamic_partition.time_unit" = "DAY",
    "dynamic_partition.start" = "-7",
    "dynamic_partition.end" = "3",
    "dynamic_partition.prefix" = "p",
    "dynamic_partition.create_history_partition" = "true",
    "dynamic_partition.buckets" = "16"
);

Data division

SelectDB supports two layers of data division, as shown in the following figure. A table is logically divided into partitions in the first layer. A partition is the smallest unit for data management. A table is physically divided into tablets in the second layer. A tablet is the smallest unit for data operations, such as data distribution and data migration.

image

Association between partitions and tablets
  • A tablet belongs to only one partition, whereas a partition contains multiple tablets.

  • If partitioning is enabled when you create a table, the table is divided into partitions based on the partitioning rules and then into tablets based on the tableting rules. If partitioning is not enabled, the table is directly divided into tablets based on the tableting rules.

  • During data writing, data is first written to a partition, and then written to different tablets within the partition based on the tableting rules. Tableting is further division of partitioned data to distribute data more evenly and improve query efficiency.

Partitions

In the storage engine of SelectDB, partitioning is a data organization method that divides data in a table into multiple independent parts based on custom rules. Partitioning implements logical division of data. This improves query efficiency and makes data management more flexible and convenient. This section briefly describes the partitions to help you understand and select a partitioning mode based on your business requirements. For more information, see the Partitioning section of the "Partitioning and bucketing" topic and the Dynamic partitioning topic.

Select a partitioning mode

SelectDB supports two partitioning modes: range partitioning and list partitioning. ApsaraDB for SelectDB also provides the easy-to-use dynamic partitioning feature to implement automated partition management. The partitioning modes are suitable for different business scenarios.

Partitioning mode

Supported column data type

Method used to specify partition information

Scenario

Partitioning mode

Supported column data type

Method used to specify partition information

Scenario

Range

DATE, DATETIME, TINYINT, SMALLINT, INT, BIGINT, and LARGEINT

The following four methods are supported:

  1. VALUES [...): creates a partition whose range is left-closed and right-open.

  2. VALUES LESS THAN (...): creates a partition with only the upper limit. The lower limit is determined by the upper limit of the previous partition.

  3. BATCH RANGE: creates multiple partitions of the numeric and time types. These partitions are left-closed and right-open and have a preset step.

  4. MULTI RANGE: creates multiple partitions whose range is left-closed and right-open.

Range partitioning is suitable for managing the data division range. A typical scenario for this method is time-based partitioning.

List

BOOLEAN, TINYINT, SMALLINT, INT, BIGINT, LARGEINT, DATE, DATETIME, CHAR, and VARCHAR

VALUES IN (...): specifies the enumeration values contained in each partition.

List partitioning is suitable for data management based on existing types or fixed characteristics. Partition key columns usually have enumeration values. For example, you can divide and manage data based on the regions to which users belong.

Usage notes

  • In SelectDB, tables are classified into partitioned tables and non-partitioned tables. You can determine whether to enable partitioning when you create a table. This property is optional and cannot be modified once determined. Specifically, you can create or delete partitions for a partitioned table. You cannot create or delete partitions for a non-partitioned table.

  • You can specify one or more columns as partition key columns. Partition key columns must be key columns.

  • You must enclose partition key values in double quotation marks (") regardless of the data type of the partition key column.

  • Theoretically, the number of partitions is not limited.

  • When you create partitions, make sure that the ranges of the partitions do not overlap.

Use partitioning

Use range partitioning
Use list partitioning

Range partitioning is the most commonly used partitioning method to manage data based on the range of fields. In a typical scenario, a large amount of time series data is partitioned by time to facilitate management and optimize queries.

The ultimate goal of partitioning and tableting is to divide data reasonably. Comply with the following standards when you specify partitioning rules:

  • The data amount of each tablet must be in the range of 1 GB to 10 GB.

  • The partition granularity must be determined based on the amount of data that you want to manage. For example, if you want to delete historical log data by day, a partition granularity of day is appropriate.

The following sample code shows how to create a partitioned table, data in which is filtered by time range and historical log data is deleted by time. In this example, the log_time field is used as the partition key column.

CREATE TABLE IF NOT EXISTS log
(
 `log_time` DATETIME NOT NULL COMMENT "The time when the log was generated.",
 `log_type` INT NOT NULL COMMENT "The type of the log.",
 `error_code` INT COMMENT "The error code.",
 `error_msg` VARCHAR(1024) COMMENT "The error message.",
 `op_id` BIGINT COMMENT "The owner ID.",
 `op_time` DATETIME COMMENT "The time when the error was handled."
)
DUPLICATE KEY(`log_time`, `log_type`, `error_code`)
PARTITION BY RANGE(`log_time`)
(
 PARTITION `p20240201` VALUES [("2024-02-01"), ("2024-02-02")),
 PARTITION `p20240202` VALUES [("2024-02-02"), ("2024-02-03")),
 PARTITION `p20240203` VALUES [("2024-02-03"), ("2024-02-04"))
)
DISTRIBUTED BY HASH(`log_type`)
PROPERTIES ();

After the table is created, you can execute the following SQL statement to view the partition information of the table:

SHOW partitions FROM log;
p20240201: [("2024-02-01"), ("2024-02-02"))
p20240202: [("2024-02-02"), ("2024-02-03"))
p20240203: [("2024-02-03"), ("2024-02-04"))

When you execute the following statement to query data, the partition p20240202: [("2024-02-02"), ("2024-02-03")) is hit. The system does not scan the data in the other two partitions. This accelerates data queries.

SELECT * FROM orders WHERE order_time = '2024-02-02';

List partitioning divides and manages data based on the enumeration values of partition key columns. When you query data from a list partitioned table, you can prune partitions based on filter conditions to improve query performance.

You can select list partition key columns based on the fields that are commonly used to manage your business data. Data must be evenly distributed to the partitions to prevent serious data skew.

For example, a large amount of order data exists in e-commerce scenarios. In some scenarios, you want to query and analyze the order data based on the cities to which users belong. To ease data management and query, you can specify the customer_city field as a partition key column. In this example, the order data is distributed in the following cities:

  • Beijing, Shanghai, and Hong Kong: 6 GB

  • New York and San Francisco: 5 GB

  • Tokyo: 5 GB

The following sample code shows how to create a list partitioned table for the order data:

CREATE TABLE IF NOT EXISTS orders
(
    `order_id` LARGEINT NOT NULL COMMENT "The order ID.",
    `order_time` DATETIME NOT NULL COMMENT "The order time.",
    `customer_city` VARCHAR(20) COMMENT "The city in which the user resides.",
    `customer_id` LARGEINT NOT NULL COMMENT "The user ID.",
    `total_amount` DOUBLE COMMENT "The total amount of the order.",
    `status` VARCHAR(20) COMMENT "The order status.",
    `payment_method` VARCHAR(20) COMMENT "The payment method.",
    `shipping_method` VARCHAR(20) COMMENT "The shipping method.",
    `customer_address` VARCHAR(500) COMMENT "The address of the user."
)
UNIQUE KEY(`order_id`, `order_time`, `customer_city`)
PARTITION BY LIST(`customer_city`)
(
    PARTITION `p_cn` VALUES IN ("Beijing", "Shanghai", "Hong Kong"),
    PARTITION `p_usa` VALUES IN ("New York", "San Francisco"),
    PARTITION `p_jp` VALUES IN ("Tokyo")
)
DISTRIBUTED BY HASH(`order_id`) BUCKETS 16
PROPERTIES (
    "enable_unique_key_merge_on_write" = "true"
);

After the table is created, you can execute the following SQL statement to view the partition information of the table. Three partitions are automatically generated for the table:

SHOW partitions FROM orders;
p_cn: ("Beijing", "Shanghai", "Hong Kong")
p_usa: ("New York", "San Francisco")
p_jp: ("Tokyo")

When you execute the following statement to query data, the partition p_jp: ("Tokyo") is hit. The system does not scan the data in the other two partitions. This accelerates data queries.

SELECT * FROM orders WHERE customer_city = 'Tokyo';

Use dynamic partitioning

In a production environment, a data table may have a large number of partitions, leading to heavy workload of manual partition management. This brings additional maintenance costs for the database administrator. SelectDB allows you to configure dynamic partitioning rules during table creation to implement automated partition management.

For example, in e-commerce scenarios, you often need to query data in order information tables by time range and dump historical orders for archiving. You can specify the order_time field as a partition key column and enable dynamic partitioning in the PROPERTIES section. The following sample code shows how to create a dynamically partitioned table. In this example, the dynamic_partition.time_unit, dynamic_partition.start, and dynamic_partition.end parameters are specified in the PROPERTIES section to partition data by day, retain only the partitions of the last 180 days, and create partitions for the next three days in advance.

Important

The parentheses () at the end of the PARTITION BY RANGE('order_time') () statement are not a syntax error. If you want to use dynamic partitioning, the parentheses are required.

CREATE TABLE IF NOT EXISTS orders
(
    `order_id` LARGEINT NOT NULL COMMENT "The order ID.",
    `order_time` DATETIME NOT NULL COMMENT "The order time.",
    `customer_id` LARGEINT NOT NULL COMMENT "The user ID.",
    `total_amount` DOUBLE COMMENT "The total amount of the order.",
    `status` VARCHAR(20) COMMENT "The order status.",
    `payment_method` VARCHAR(20) COMMENT "The payment method.",
    `shipping_method` VARCHAR(20) COMMENT "The shipping method.",
    `customer_city` VARCHAR(20) COMMENT "The city in which the user resides.",
    `customer_address` VARCHAR(500) COMMENT "The address of the user."
)
UNIQUE KEY(`order_id`, `order_time`)
PARTITION BY RANGE(`order_time`) ()
DISTRIBUTED BY HASH(`order_id`)
PROPERTIES (
    "enable_unique_key_merge_on_write" = "true",
    "dynamic_partition.enable" = "true",
    "dynamic_partition.time_unit" = "DAY",
    "dynamic_partition.start" = "-180",
    "dynamic_partition.end" = "3",
    "dynamic_partition.prefix" = "p",
    "dynamic_partition.create_history_partition" = "true",
    "dynamic_partition.buckets" = "16"
);

If a table is expected to have a large number of partitions, we strongly recommend that you learn dynamic partitioning. For more information, see Dynamic partitioning.

Tablets

The storage engine of SelectDB divides data into different tablets based on the hash value of the specified column. Tablets are managed by different nodes in a cluster. The capability of a distributed system is leveraged to manage and query a large amount of data. When you create a table, use DISTRIBUTED BY HASH('<Tablet key column>') BUCKETS <Number of tablets> clause to configure the tablets. For more information, see the Bucketing section of the "Partitioning and bucketing" topic.

Usage notes

  • If partitioning is enabled, the DISTRIBUTED... clause describes the rules for dividing data in each partition. If partitioning is not enabled, this clause describes the rules for dividing the full data of a table.

  • You can specify multiple columns as tablet key columns.

    In the Aggregate and Unique models, the tablet key columns must be key columns. In the Duplicate model, the tablet key columns can be key or value columns.

    We recommend that you select columns with high cardinality as tablet key columns to distribute data and prevent data skew.

  • Theoretically, the number of tablets is not limited.

    Theoretically, the amount of data that can be stored in a single tablet is not limited. However, we recommend that you store 1 GB to 10 GB of data in a tablet.

    If each tablet contains a small amount of data, the number of tablets is large, leading to larger metadata management workload.

    If each tablet contains a large amount of data, replica migration is adversely affected and a distributed cluster cannot be fully leveraged. This also increases the costs of retrying failed operations such as schema changes or index creation. These operations are performed by tablet.

Select a tablet key column

The tablet key columns that you select when you design a table have an important impact on the query performance and query concurrency. The following table describes the rules for selecting tablet key columns. If your business has multiple query requirements, multiple types of tablet key columns may be required. In this case, preferentially select the tablet columns based on the major query requirements.

Rule

Benefit

Rule

Benefit

Select columns with high cardinality or a combination of multiple columns to preferentially guarantee even data distribution

Data is more evenly distributed across cluster nodes. Resources of the distributed system can be fully leveraged to improve query performance for queries that have poor filtering effects and need to scan a large amount of data.

Select columns that are frequently used in filter conditions to achieve a balance between data pruning and accelerated queries

Data with the same value for tablet key columns is aggregated. Data pruning is accelerated and query concurrency is improved for point queries that use specified tablet key columns as filter conditions.

Note

Point queries are often used to retrieve a small amount of data from a database based on specific conditions. This query method accurately locates and obtains a small amount of data from a database that meets specific conditions, such as filtering by primary key and column with high cardinality.

Example

In an e-commerce scenario, you need to query data by order in most cases. Sometimes, you need to perform statistical analysis based on full order data. In this case, you can select the high-cardinality column order_id from the key columns of the order information table as a tablet key column to ensure that data is evenly distributed to multiple tablets. Data in the order_id column is aggregated to offer the performance required by the preceding queries. The following sample code shows how to create a table for such a scenario:

CREATE TABLE IF NOT EXISTS orders
(
    `order_id` LARGEINT NOT NULL COMMENT "The order ID.",
    `order_time` DATETIME NOT NULL COMMENT "The order time.",
    `customer_id` LARGEINT NOT NULL COMMENT "The user ID.",
    `total_amount` DOUBLE COMMENT "The total amount of the order.",
    `status` VARCHAR(20) COMMENT "The order status.",
    `payment_method` VARCHAR(20) COMMENT "The payment method.",
    `shipping_method` VARCHAR(20) COMMENT "The shipping method.",
    `customer_city` VARCHAR(20) COMMENT "The city in which the user resides.",
    `customer_address` VARCHAR(500) COMMENT "The address of the user."
)
UNIQUE KEY(`order_id`, `order_time`)
PARTITION BY RANGE(`order_time`) ()
DISTRIBUTED BY HASH(`order_id`)
PROPERTIES (
    "enable_unique_key_merge_on_write" = "true",
    "dynamic_partition.enable" = "true",
    "dynamic_partition.time_unit" = "DAY",
    "dynamic_partition.start" = "-7",
    "dynamic_partition.end" = "3",
    "dynamic_partition.prefix" = "p",
    "dynamic_partition.create_history_partition" = "true",
    "dynamic_partition.buckets" = "16"
);

Indexes

Indexes are critical in database design, and proper indexes can greatly improve query performance. Indexes may occupy additional storage space and reduce write performance. This section briefly describes the commonly used indexes to help you understand and select indexes based on your business requirements. For more information, see Index-based acceleration.

Rules for creating indexes

  • In most cases, a specific key is used to automatically create a prefix index. This provides the optimal filtering effect. However, a table can have only one prefix index. We recommend that you select the key the is most frequently used as the filter condition.

  • Inverted indexes are preferred for other accelerated filtering requirements. Inverted indexes have a wide application scope and support a combination of multiple columns as the filter condition. You can use lightweight Bloom filter indexes and NGram Bloom filter indexes to match strings in equivalent and LIKE queries.

Select an index

In SelectDB, tables can use built-in indexes or custom indexes. Built-in indexes are automatically created by the system. You can create custom indexes when or after you create a table based on your business requirements.

Method

Index type

Query type supported

Query type not supported

Advantage

Disadvantage

Method

Index type

Query type supported

Query type not supported

Advantage

Disadvantage

Built-in

Prefix index

  • Equivalent and non-equivalent queries

  • Range query

  • Like query

  • Match query by keyword or phrase

Prefix indexes occupy a relatively small amount of storage space and can be fully cached in the memory. This allows the system to quickly locate data blocks and significantly improves query efficiency.

A table can have only one prefix index.

Custom

Inverted index (recommended)

  • Equivalent, non-equivalent, and range queries for strings, numbers, and date and time

  • String match query by keyword or phrase

  • Full-text search

N/A

Rich query types are supported. You can create indexes when and after you create a table based on your business requirements. You can also delete indexes.

These indexes occupy a large amount of storage space.

Bloom filter index

Equivalent query

  • Non-equivalent query

  • Range query

  • Like query

  • Match query by keyword or phrase

These indexes occupy small computing and storage resources.

Only equivalent queries support Bloom filter indexes.

NGram Bloom filter index

Like query

  • Equivalent and non-equivalent queries

  • Range query

  • Match query by keyword or phrase

These indexes accelerate LIKE queries and occupy small computing and storage resources.

NGram Bloom filter indexes can accelerate only LIKE queries.

Use indexes

Use an inverted index
Use a prefix index

SelectDB supports inverted indexes. You can use inverted indexes to perform full-text searches on data of the TEXT type and equivalent or range queries on data of ordinary fields. This way, you can quickly retrieve data that meets specific conditions from a large amount of data. This section describes how to create an inverted index. For more information, see Inverted indexes.

Create an inverted index when you create a table

In an e-commerce scenario, you need to frequently query order information based on keywords such as the user ID and user address. In this case, you can create an inverted index on the customer_id and customer_address fields to accelerate the queries. The following sample code shows how to create a table for such a scenario:

CREATE TABLE IF NOT EXISTS orders
(
    `order_id` LARGEINT NOT NULL COMMENT "The order ID.",
    `order_time` DATETIME NOT NULL COMMENT "The order time.",
    `customer_id` LARGEINT NOT NULL COMMENT "The user ID.",
    `total_amount` DOUBLE COMMENT "The total amount of the order.",
    `status` VARCHAR(20) COMMENT "The order status.",
    `payment_method` VARCHAR(20) COMMENT "The payment method.",
    `shipping_method` VARCHAR(20) COMMENT "The shipping method.",
    `customer_city` VARCHAR(20) COMMENT "The city in which the user resides.",
    `customer_address` VARCHAR(500) COMMENT "The address of the user.",
    INDEX idx_customer_id (`customer_id`) USING INVERTED,
    INDEX idx_customer_address (`customer_address`) USING INVERTED PROPERTIES("parser" = "chinese")
)
UNIQUE KEY(`order_id`, `order_time`)
PARTITION BY RANGE(`order_time`) ()
DISTRIBUTED BY HASH(`order_id`)
PROPERTIES (
    "enable_unique_key_merge_on_write" = "true",
    "dynamic_partition.enable" = "true",
    "dynamic_partition.time_unit" = "DAY",
    "dynamic_partition.start" = "-7",
    "dynamic_partition.end" = "3",
    "dynamic_partition.prefix" = "p",
    "dynamic_partition.create_history_partition" = "true",
    "dynamic_partition.buckets" = "16"
);
Create an inverted index on a column in an existing table

In an e-commerce scenario, you need to frequently query order information based on the user ID. However, no inverted index is created on the customer_id field when you create the orders table. In this case, you can execute the following statement to add an index for the table:

ALTER TABLE orders ADD INDEX idx_customer_id (`customer_id`) USING INVERTED;

A prefix index is created on one or more key columns in underlying data that is sorted by key columns. Essentially, prefix indexing is binary search based on the data sorting feature. Prefix indexes are built-in indexes. They are automatically created by SelectDB after a table is created.

No dedicated syntax is available to define prefix indexes. The system selects the first one or more key column fields in a table as the prefix index. The total length of the prefix index cannot exceed 36 bytes. The key column fields after a field of the VARCHAR type are not added to the prefix index.

The order of fields in a table is particularly important. It determines the fields to be used as in the prefix index. We strongly recommend that you determine the order of key column fields based on the following rules:

  • The key column fields with high cardinality that are frequently used as filter conditions are placed before other fields. For example, in the Use the Duplicate model section, the log_time field is placed before the error_code field.

  • The key column fields that are used as equivalent filter conditions are placed before the key column fields that are used as range filter conditions. For example, in the Use an inverted index section, the order_time field for range filtering is placed after the order_id field.

  • Ordinary-type fields are placed before VARCHAR-type fields. For example, the key column fields of the INT type are placed before the key column fields of the VARCHAR type.

Examples

In the Use an inverted index section, the order information table uses the prefix index order_id+order_time. If the query condition contains the order_id field or contains both the order_id and order_time fields, the query speed is greatly improved. The query speed in Example 1 is faster than that in Example 2.

Example 1

SELECT * FROM orders WHERE order_id = 1829239 and order_time = '2024-02-01';

Example 2

SELECT * FROM orders WHERE order_time = '2024-02-01';

What to do next

After you learn the first three steps of this tutorial, you have a basic understanding of SelectDB and are able to design database tables that meet your business requirements. Next, you can learn the detailed operations that you can perform, such as migrating data, querying data from external data sources, and updating the kernel version. For more information, see What to do next.

  • On this page (1, O)
  • Important table properties
  • Data models
  • Basics
  • Select a model
  • Use a model
  • Data division
  • Partitions
  • Select a partitioning mode
  • Usage notes
  • Use partitioning
  • Use dynamic partitioning
  • Tablets
  • Usage notes
  • Select a tablet key column
  • Example
  • Indexes
  • Rules for creating indexes
  • Select an index
  • Use indexes
  • What to do next
Feedback
phone Contact Us