GSIs

This topic describes the principles, features, and types of global secondary indexes (GSIs).

Overview

GSIs are an important feature of PolarDB-X. Compared with local secondary indexes, the data of GSIs are distributed to data nodes based on specified partitioning rules. GSIs allow you to query data in methods that are different from the primary key and enforce unique constraints across the entire dataset.

Principles and features

In a partitioned table of a distributed database, data is organized and stored based on the partition key that is specified when the table is created. Therefore, queries based on the partition key can be quickly routed to the relevant partition. Queries that do not use the partition key require a full partition scan. In a distributed database, a full partition scan may increase the number of slow queries, reduce the system throughput, and cause the system to lose the linear scaling capability. To maintain optimal performance, we recommend that you implement measures to prevent all partitions from being scanned.

A query that includes a partition key is directed to a specific partition. In a system that has N data nodes, the average load of a query on a single node is 1/N of the query load. If the query does not include the partition key, a full scan across all partitions is triggered. The average load of a single query on each data node becomes 1. The upper limit of the performance of a single data node determines the upper limit of the performance of the distributed database. As a result, the system loses the linear scaling capability.

PolarDB-X provides the GSI feature to address the preceding issue. In PolarDB-X, you can use a GSI as a special partitioned table, which duplicates the data of specific columns of the primary table. Similar to a regular partitioned table, a GSI is horizontally split into several partitions based on user-defined partitioning rules and the partitions are distributed to different data nodes. If a query does not contain the partition key of the primary table but contains the partition key of the GSI, PolarDB-X first retrieves data from the relevant partition of the GSI and then accesses the primary table to obtain additional relevant information. This prevents a full scan across all partitions.

PolarDB-X uses distributed transactions to maintain strong data consistency between the primary table and GSI.

GSIs also support the following features:

You can modify, create, and delete GSIs without the need to lock tables.
You can specify custom covering columns to reduce the overhead of lookup operations on the primary table.
Invisible indexes are supported.

Types

A GSI can provide a different partitioning method from the primary table. If the query conditions of an SQL statement do not contain the partition key of the primary table but contain the partition key of the GSI, the query does not scan all partitions.

For example, if you want to query the user_tbl table based on the user_id and name columns, you can create a GSI named g_i_name. This prevents all partitions from being scanned when you query the table based on the name column.

CREATE TABLE user(
 user_id bigint,
 name varchar(10),
 addr varchar(30),
 GLOBAL INDEX `g_i_name` (name) PARTITION BY HASH(name),
 PRIMARY KEY(user_id)
) PARTITION BY KEY(user_id);

UGSIs

A unique global secondary index (UGSI) is a special GSI that has the properties of a regular GSI and implements global UNIQUE constraints.

For example, if you want the mobile number of a user to be globally unique in the user2 table, you can create a UGSI on the phone field as the index key.

CREATE TABLE user2(
 user_id bigint,
 phone varchar(20),
 addr varchar(30),
 UNIQUE GLOBAL INDEX `g_i_phone`(phone)  PARTITION BY HASH(phone), 
 PRIMARY KEY(user_id)
) PARTITION BY KEY(user_id);

Clustered GSIs

A clustered global secondary index (clustered GSI) is a special GSI. By default, a clustered GSI contains the data of all columns of the primary table. The disk space occupied by a clustered GSI is equal to the disk space occupied by the primary table. You can use clustered GSIs to prevent all partitions from being scanned and the overhead of lookup operations.

For example, if you want to query the order_tbl table based on the user_id or order_id column, and you want to prevent lookup operations on the table when you query orders based on the user_id column, you can create a clustered GSI named cg_i_user on the user_id column as the index key. When you query orders based on the user_id column, PolarDB-X routes the query to a specific partition of cg_i_user. Because cg_i_user contains all data of the primary table, PolarDB-X does not need to look up the primary table.

CREATE TABLE order_tbl(
 order_id bigint,
 user_id bigint,
 addr varchar(30),
 info text,
 create_time datetime,
 CLUSTERED INDEX `cg_i_user`(user_id) PARTITION BY HASH(user_id), 
 PRIMARY KEY(order_id)
) PARTITION BY KEY(order_id);

Performance

The impact of GSIs on the read and write performance varies based on business scenarios. In most cases, GSIs degrade the write performance and significantly improve the read performance. In the following examples, the Sysbench scenario is used to show the impact of GSIs on the read and write throughput.

Read performance

Table	Threads	Sysbench seIect_random_ranges scenario			Sysbench seIect_random_points scenario
Table	Threads	QPS	Average latency	95% latency	QPS	Average latency	95% latency
Partitioned table	128	2769.17	46.21	99.33	5226.99	24.48	42.61
	256	3415.64	144.97	144.97	5476.76	46.73	82.96
	512	3272.46	156.31	257.95	5290.67	96.72	179.94
	1024	2453.16	416.12	539.71	5165.31	198.07	404.61
Partitioned table + GSI	128	9662.11	13.24	25.28	22584.89	5.66	9.73
	256	10431.73	24.52	51.02	25558.26	10.01	17.95
	512	15634.51	32.72	73.13	27116.56	18.86	39.65
	1024	229448.76	44.53	108.68	32509.87	31.43	73.13

After a GSI is created:

In range queries, the QPS is increased from 3415.64 to 22948.76, which indicates that the performance of range queries is improved by 571%.
In point queries, the QPS is increased from 5476 to 32509.87, which indicates that the performance of point queries is improved by 493%.

Conclusion: GSIs can improve the query performance of Sysbench on the index column.

Data write performance

Table	Threads	Sysbench seIect_random_ranges scenario			Sysbench seIect_random_points scenario
Table	Threads	QPS	Average latency	95% latency	QPS	Average latency	95% latency
Partitioned table	128	86548.12	8.87	10.27	113655.28	22.52	26.2
	256	115774.71	13.26	19.29	149677.52	34.19	44.17
	512	143928.94	20.51	34.95	14555.16	70.28	112.67
	1024	153501.7	39.53	70.55	132150.69	131.58	287.38
Partitioned table + GSI	128	52069.22	14.25	18.28	90074.59	28.41	33.72
	256	66250.79	23.17	32.53	114420.32	44.73	57.87
	512	75700.74	39.1	59.99	111093.61	92.09	142.39
	1024	76557.94	80.14	134.9	101828.32	182.51	350.33

After a GSI is created:

In write-only operations, the QPS is decreased from 153501.7 to 76557.94, which indicates that the write QPS is reduced by 50%.
In read and write operations, the QPS is decreased from 149677.52 to 114420.32, which indicates that the write and read QPS is reduced by 23%.

Conclusion: If a GSI is created, the write performance of Sysbench is degraded.

References

How to use global secondary indexes

Clustered indexes