This topic describes the principles, features, and types of global secondary indexes (GSIs).
Overview
GSIs are an important feature of PolarDB-X. Compared with local secondary indexes, the data of GSIs are distributed to data nodes based on specified partitioning rules. GSIs allow you to query data in methods that are different from the primary key and enforce unique constraints across the entire dataset.
Principles and features
In a partitioned table of a distributed database, data is organized and stored based on the partition key that is specified when the table is created. Therefore, queries based on the partition key can be quickly routed to the relevant partition. Queries that do not use the partition key require a full partition scan. In a distributed database, a full partition scan may increase the number of slow queries, reduce the system throughput, and cause the system to lose the linear scaling capability. To maintain optimal performance, we recommend that you implement measures to prevent all partitions from being scanned.
A query that includes a partition key is directed to a specific partition. In a system that has N data nodes, the average load of a query on a single node is 1/N of the query load. If the query does not include the partition key, a full scan across all partitions is triggered. The average load of a single query on each data node becomes 1. The upper limit of the performance of a single data node determines the upper limit of the performance of the distributed database. As a result, the system loses the linear scaling capability.
PolarDB-X provides the GSI feature to address the preceding issue. In PolarDB-X, you can use a GSI as a special partitioned table, which duplicates the data of specific columns of the primary table. Similar to a regular partitioned table, a GSI is horizontally split into several partitions based on user-defined partitioning rules and the partitions are distributed to different data nodes. If a query does not contain the partition key of the primary table but contains the partition key of the GSI, PolarDB-X first retrieves data from the relevant partition of the GSI and then accesses the primary table to obtain additional relevant information. This prevents a full scan across all partitions.
PolarDB-X uses distributed transactions to maintain strong data consistency between the primary table and GSI.
GSIs also support the following features:
You can modify, create, and delete GSIs without the need to lock tables.
You can specify custom covering columns to reduce the overhead of lookup operations on the primary table.
Invisible indexes are supported.
Types
GSIs
A GSI can provide a different partitioning method from the primary table. If the query conditions of an SQL statement do not contain the partition key of the primary table but contain the partition key of the GSI, the query does not scan all partitions.
For example, if you want to query the user_tbl table based on the user_id and name columns, you can create a GSI named g_i_name. This prevents all partitions from being scanned when you query the table based on the name column.
CREATE TABLE user(
user_id bigint,
name varchar(10),
addr varchar(30),
GLOBAL INDEX `g_i_name` (name) PARTITION BY HASH(name),
PRIMARY KEY(user_id)
) PARTITION BY KEY(user_id);
UGSIs
A unique global secondary index (UGSI) is a special GSI that has the properties of a regular GSI and implements global UNIQUE constraints.
For example, if you want the mobile number of a user to be globally unique in the user2 table, you can create a UGSI on the phone field as the index key.
CREATE TABLE user2(
user_id bigint,
phone varchar(20),
addr varchar(30),
UNIQUE GLOBAL INDEX `g_i_phone`(phone) PARTITION BY HASH(phone),
PRIMARY KEY(user_id)
) PARTITION BY KEY(user_id);
Clustered GSIs
A clustered global secondary index (clustered GSI) is a special GSI. By default, a clustered GSI contains the data of all columns of the primary table. The disk space occupied by a clustered GSI is equal to the disk space occupied by the primary table. You can use clustered GSIs to prevent all partitions from being scanned and the overhead of lookup operations.
For example, if you want to query the order_tbl table based on the user_id or order_id column, and you want to prevent lookup operations on the table when you query orders based on the user_id column, you can create a clustered GSI named cg_i_user on the user_id column as the index key. When you query orders based on the user_id column, PolarDB-X routes the query to a specific partition of cg_i_user. Because cg_i_user contains all data of the primary table, PolarDB-X does not need to look up the primary table.
CREATE TABLE order_tbl(
order_id bigint,
user_id bigint,
addr varchar(30),
info text,
create_time datetime,
CLUSTERED INDEX `cg_i_user`(user_id) PARTITION BY HASH(user_id),
PRIMARY KEY(order_id)
) PARTITION BY KEY(order_id);
Performance
The impact of GSIs on the read and write performance varies based on business scenarios. In most cases, GSIs degrade the write performance and significantly improve the read performance. In the following examples, the Sysbench scenario is used to show the impact of GSIs on the read and write throughput.
Read performance
Table | Threads | Sysbench seIect_random_ranges scenario | Sysbench seIect_random_points scenario | ||||
QPS | Average latency | 95% latency | QPS | Average latency | 95% latency | ||
Partitioned table | 128 | 2769.17 | 46.21 | 99.33 | 5226.99 | 24.48 | 42.61 |
256 | 3415.64 | 144.97 | 144.97 | 5476.76 | 46.73 | 82.96 | |
512 | 3272.46 | 156.31 | 257.95 | 5290.67 | 96.72 | 179.94 | |
1024 | 2453.16 | 416.12 | 539.71 | 5165.31 | 198.07 | 404.61 | |
Partitioned table + GSI | 128 | 9662.11 | 13.24 | 25.28 | 22584.89 | 5.66 | 9.73 |
256 | 10431.73 | 24.52 | 51.02 | 25558.26 | 10.01 | 17.95 | |
512 | 15634.51 | 32.72 | 73.13 | 27116.56 | 18.86 | 39.65 | |
1024 | 229448.76 | 44.53 | 108.68 | 32509.87 | 31.43 | 73.13 |
After a GSI is created:
In range queries, the QPS is increased from 3415.64 to 22948.76, which indicates that the performance of range queries is improved by 571%.
In point queries, the QPS is increased from 5476 to 32509.87, which indicates that the performance of point queries is improved by 493%.
Conclusion: GSIs can improve the query performance of Sysbench on the index column.
Data write performance
Table | Threads | Sysbench seIect_random_ranges scenario | Sysbench seIect_random_points scenario | ||||
QPS | Average latency | 95% latency | QPS | Average latency | 95% latency | ||
Partitioned table | 128 | 86548.12 | 8.87 | 10.27 | 113655.28 | 22.52 | 26.2 |
256 | 115774.71 | 13.26 | 19.29 | 149677.52 | 34.19 | 44.17 | |
512 | 143928.94 | 20.51 | 34.95 | 14555.16 | 70.28 | 112.67 | |
1024 | 153501.7 | 39.53 | 70.55 | 132150.69 | 131.58 | 287.38 | |
Partitioned table + GSI | 128 | 52069.22 | 14.25 | 18.28 | 90074.59 | 28.41 | 33.72 |
256 | 66250.79 | 23.17 | 32.53 | 114420.32 | 44.73 | 57.87 | |
512 | 75700.74 | 39.1 | 59.99 | 111093.61 | 92.09 | 142.39 | |
1024 | 76557.94 | 80.14 | 134.9 | 101828.32 | 182.51 | 350.33 |
After a GSI is created:
In write-only operations, the QPS is decreased from 153501.7 to 76557.94, which indicates that the write QPS is reduced by 50%.
In read and write operations, the QPS is decreased from 149677.52 to 114420.32, which indicates that the write and read QPS is reduced by 23%.
Conclusion: If a GSI is created, the write performance of Sysbench is degraded.