How to identify point queries and how to optimize queries as point queries - PolarDB

This topic explains how to identify a point query and how to optimize a query to become a point query.

A point query is a common access pattern for applications in Online Transaction Processing (OLTP) databases. It returns results by scanning only a small amount of data. For example, viewing order or product information on Taobao corresponds to a point query in the database.PolarDB-X is optimized for the response time (RT) and resource usage of point queries. This feature supports high throughput and is well-suited for high-concurrency read scenarios.

What is a point query?

A point query, as the name suggests, scans only a small amount of data. It is important to note that this refers to scanning a small amount of data, not just returning a small amount of data. For example, the statement select * from t1 order by c1 limit 1 returns only one row. However, if column c1 does not have an index, the database must scan and sort all rows in table t1 to return the result. This operation does not qualify as a point query.

In single-node databases, the most common point query retrieves data by primary key (PK). It uses the primary key index to quickly retrieve results, scanning on average only log(n) rows, where n is the total number of rows in the table. If you query using other conditions, you can add a local secondary index (LSI). The LSI is first scanned to obtain the primary key, and then a lookup is performed to retrieve the full record. In special cases, if the LSI includes all columns required by the query, the lookup step can be skipped.

PolarDB-X is a distributed database. To distribute data across multiple data nodes (DNs), it uses partitioned tables. Data is split into multiple partitions, and each partition is mapped to a DN. To split the data, you can select one or more columns as the partitioning dimension. These columns are called partitioning keys. In a distributed database, query performance is not only linearly dependent on the amount of data scanned but also increases with the number of shards scanned. Therefore, the definition of a point query must also include scanning a small number of partitions.

PolarDB-X provides transparent distributed capabilities. By default, it uses the primary key as the partition key. When you query by primary key, PolarDB-X first locates the partition where the data resides and then uses the primary key index on that partition to retrieve the results. This process achieves optimal performance. For queries that use other conditions, you can create a global secondary index (GSI). Using a GSI to optimize queries is similar to using a local secondary index (LSI). The primary key is retrieved first, and then the complete records are retrieved by accessing the table based on the index row IDs. The main difference is that a GSI is also a partitioned table, and its data is stored on different DNs from the primary table. In most cases, accessing the table based on index row IDs requires network transfers, which makes this operation more costly than in a standalone database. Therefore, PolarDB-X supports creating clustered indexes to eliminate the need to access the table based on index row IDs and achieve performance equivalent to that of primary key queries.

Important notes

Both LSIs and GSIs improve query performance at the cost of extra storage space and write overhead. You should evaluate their impact on write performance before you use them. Because the data distributions of LSIs and GSIs differ from that of the primary table, all writes that involve a GSI run in distributed transactions by default. This ensures strong consistency between the GSI and the primary table. Compared with scenarios that do not use a GSI, the write RT increases by two to three times. In addition, because the index and primary tables are written to in parallel, the chance of distributed deadlocks increases during high-concurrency writes. We recommend that you create no more than three GSIs per logical table.

How to Detect Point Queries

As previously described, a point query in a distributed database scans only a small number of shards and a small amount of data. You can check the execution plan to confirm how many shards a query scans. For more information about execution plans, see Optimization principles and execution plans. The following is an example of a point query:

Run the following statement to view the execution plan:

explain select c_custkey, c_name, c_address from customer where c_custkey = 42;

The following is the output:

+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| LOGICAL EXECUTIONPLAN                                                                                                                                            |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| LogicalView(tables="TEST1_000002_GROUP.customer_IVgG_10", sql="SELECT `c_custkey`, `c_name`, `c_address` FROM `customer` AS `customer` WHERE (`c_custkey` = ?)") |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Run the following statement to show the execution plan on the DNs and check whether the query hits the correct index on each DN. The DNs use MySQL, so their execution plans match those of MySQL. For more information, see the MySQL official documentation.

explain execute select c_custkey, c_name, c_address from customer where c_custkey = 42;

The following is the output:

+----+-------------+----------+------------+-------+---------------+---------+---------+-------+------+----------+-----------------+
| id | select_type | table    | partitions | type  | possible_keys | key     | key_len | ref   | rows | filtered | Extra           |
+----+-------------+----------+------------+-------+---------------+---------+---------+-------+------+----------+-----------------+
| 1  | SIMPLE      | customer | NULL       | const | PRIMARY       | PRIMARY | 4       | const | 1    | 100      | Using pk access |
+----+-------------+----------+------------+-------+---------------+---------+---------+-------+------+----------+-----------------+

How to optimize queries as point queries

Not all queries can be optimized to become point queries. Examples include full-table scans such as select * from t1, inefficient paged queries such as select * from t1 where c1 = 1 limit 100000, 10, and IN queries with a growing number of parameters. Queries that can be optimized to become point queries fall into two categories:

Queries with fixed-range scans: These include queries with equality conditions (or conditions that can be reduced to equality), small-range BETWEEN AND conditions, or IN conditions with a fixed number of parameters.
TopN queries with a fixed number of result rows, for example, select * from t1 where c1 > 42 limit 10 and select * from t1 order by c1 limit 10.

To optimize these types of queries, you can add suitable indexes to convert full-table scans into index scans. The following is an example:

Create the `customer` table.

CREATE TABLE `customer` (
  `c_custkey` int(11) NOT NULL,
  `c_name` varchar(25) NOT NULL,
  `c_address` varchar(40) NOT NULL,
  `c_nationkey` int(11) NOT NULL,
  `c_phone` varchar(15) NOT NULL,
  `c_acctbal` decimal(15,2) NOT NULL,
  `c_mktsegment` varchar(10) NOT NULL,
  `c_comment` varchar(117) NOT NULL,
  PRIMARY KEY (`c_custkey`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 dbpartition by hash(`c_custkey`) tbpartition by hash(`c_custkey`) tbpartitions 4;

Check the execution plan for select * from customer where c_phone = "11";.

explain select * from customer where c_phone = "11";

The following is the output:

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| LOGICAL EXECUTIONPLAN                                                                                                                                                                                                                            |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Gather(concurrent=true)                                                                                                                                                                                                                          |
|   LogicalView(tables="[000000-000003].customer_[00-15]", shardCount=16, sql="SELECT `c_custkey`, `c_name`, `c_address`, `c_nationkey`, `c_phone`, `c_acctbal`, `c_mktsegment`, `c_comment` FROM `customer` AS `customer` WHERE (`c_phone` = ?)") |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The `customer` table has only a primary key index. Even though an equality condition is used for `c_phone`, the query still scans all shards. You can add a GSI to optimize the query.
```
create global index g_i_phone on customer(c_phone) dbpartition by hash(c_phone);
```

After you add the GSI, check the execution plan.

explain select * from customer where c_phone = "11";

The following is the output:

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| LOGICAL EXECUTIONPLAN                                                                                                                                                                                                                                                  |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Project(c_custkey="c_custkey", c_name="c_name", c_address="c_address", c_nationkey="c_nationkey", c_phone="c_phone", c_acctbal="c_acctbal", c_mktsegment="c_mktsegment", c_comment="c_comment")                                                                        |
|   BKAJoin(condition="c_custkey = c_custkey", type="inner")                                                                                                                                                                                                             |
|     IndexScan(tables="TEST1_000000_GROUP.g_i_phone_2CSp", sql="SELECT `c_custkey`, `c_phone` FROM `g_i_phone` AS `g_i_phone` WHERE (`c_phone` = ?)")                                                                                                                   |
|     Gather(concurrent=true)                                                                                                                                                                                                                                            |
|       LogicalView(tables="[000000-000003].customer_[00-15]", shardCount=16, sql="SELECT `c_custkey`, `c_name`, `c_address`, `c_nationkey`, `c_acctbal`, `c_mktsegment`, `c_comment` FROM `customer` AS `customer` WHERE ((`c_phone` = ?) AND (`c_custkey` IN (...)))") |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

After the GSI is added, the query becomes a point query on the index table that is followed by a lookup. The lookup accesses only one shard. (The execution plan shows a full-table scan on the primary table because the exact shard to scan depends on the result from the index table. This cannot be determined during the EXPLAIN phase.)

Replace the GSI with a clustered index.

drop index g_i_phone on customer;

create clustered index g_i_phone on customer(c_phone) dbpartition by hash(c_phone);

After you create the clustered index, check the execution plan.

explain select * from customer where c_phone = "11";

The following is the output:

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| LOGICAL EXECUTIONPLAN                                                                                                                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| IndexScan(tables="TEST1_000000_GROUP.g_i_phone_fHmZ", sql="SELECT `c_custkey`, `c_name`, `c_address`, `c_nationkey`, `c_phone`, `c_acctbal`, `c_mktsegment`, `c_comment` FROM `g_i_phone` AS `g_i_phone` WHERE (`c_phone` = ?)") |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

After the GSI is replaced with a clustered index, the index table contains all columns of the primary table. A lookup is no longer needed. The execution plan shows a point query on the index table.

The preceding example describes the general process of using indexes to optimize point query performance. The key step is to choose the right column for the index based on the query pattern.