By Weiqi
Indexes are commonly used components in databases to accelerate queries by reducing data redundancy and reorganizing data. In general, proper indexes can improve the query performance of the system. However, there are some misconceptions about indexes, such as the belief that more indexes always improve query performance, as shown in the example below:
In fact, we cannot only focus on the benefits without considering costs. Distributed database systems generally support two types of indexes: the global indexes maintained by distributed global transactions, and the local indexes maintained by local transactions. Both types of indexes affect the write performance of the system to varying degrees. The following figure shows the impact on the write performance of the system when different numbers of indexes are created.
It can be seen that creating one global index will reduce the write performance of the system to about 30% of its original level. Even with MySQL, creating eight local indexes can decrease write performance to 85% of its original level.
Therefore, when we enjoy the benefits from query acceleration brought by indexes, we must also pay attention to the maintenance overhead introduced by them. Especially when the introduction of an index does not bring expected benefits, or the overhead is much greater than the query acceleration benefits, the index will become a burden. Such indexes are referred to as inefficient indexes, and we should avoid them to help the database achieve better write performance.
Can you tell how many inefficient indexes there are in the table warehouse mentioned at the beginning of the article? Let's first discuss the common inefficient indexes in applications, and then reveal the answer at the end of the article.
When the newly created index is not used by the database optimizer for its intended purpose, it is an inefficient index. It lurks within the database, and consumes write performance, but not provides query performance benefits. Therefore, it is crucial to identify and clean up such indexes promptly. Additionally, some indexes were heavily used for a period but are no longer utilized due to changes in business operations. These indexes that remain in the database are also inefficient ones.
The Alibaba Cloud PolarDB for Xscale (PolarDB-X) provides an INFORMATION_SCHEMA.GLOBAL_INDEXES view to query the usage of global indexes in a table. Therefore, you can know which global indexes are working and which global indexes are infrequently accessed or longly unused.
Index selectivity refers to a measure of the number of unique values (also known as cardinality) in the index compared to the total number of records (#T) in a data table. By definition, its value ranges from 1/#T to 1. The higher the selectivity of an index, the more efficient the query performance tends to be. This is because a highly selective index helps the database filter out more irrelevant rows during queries.
An efficient index example is the primary key index. Since primary keys are unique, their selectivity is the maximum value of 1. This high selectivity enables databases to efficiently retrieve data when using the primary key for queries. An inefficient index example is to build indexes on attributes such as gender and isDelete. How do we find these low-selectivity indexes? The most direct way is to manually check the true meaning of each index and exclude indexes with meanings such as "gender" and "Delete flag". Additionally, for global indexes, PolarDB-X supports viewing the cardinality and total record count of global indexes by using the INFORMATION_SCHEMA.GLOBAL_INDEXES view. Based on these two metrics, we can calculate the selectivity of the index.
Duplicate indexes refer to the indexes of the same type, in the same order, on the same columns. PolarDB-X does not restrict users from creating multiple duplicate indexes. Indexes need to be maintained synchronously when data is written to the database. Therefore, multiple duplicate indexes need to be maintained separately by the database. Additionally, when optimizing query statements, the optimizer needs to consider each duplicate index individually, which can impact performance. It is not common to deliberately introduce duplicate indexes, but accidental introduction is possible. The following SQL statement creates a single table in PolarDB-X:
Users might intend to create a primary key, add a unique constraint to it, and then create an index for querying. However, this approach inadvertently creates three duplicate indexes that overlap in functionality, which is unnecessary. Some indexes may be defined as non-duplicate indexes, but in practice, they may have overlapping functions, such as the following statements to create a table:
Some users may create an index idx_id_name that includes columns used in the WHERE clause of their query SQL statements. However, when constructing an index, the database usually fills the primary key in the value attribute of the index to facilitate scanning the primary table. Therefore, the index idx_name contains the primary key ID and has the same function as the index idx_name_id. Do not build such an index.
Redundant indexes are different from duplicate indexes. If you create an index (A, B) and then create an index (A), the latter becomes a redundant index because (A) is a prefix index of (A, B). The optimizer complies with the "leftmost prefix matching principle" when using indexes, that is, the left column in the index is preferentially used for matching, so the index (A, B) can be used as the index (A).
Redundant indexes often occur when users add new indexes to a data table. Some users prefer adding new indexes rather than expanding existing ones. However, we should prioritize expanding existing indexes rather than adding new indexes arbitrarily. If it's necessary to add a new index, pay attention to ensuring that the new index isn't redundant and doesn't render existing indexes redundant. Of course, blindly expanding existing indexes is also not advisable, as it can lead to excessively long indexes, which may impact other SQL queries that use the index. This involves a trade-off. In addition to considering the "leftmost prefix matching principle", we also need to pay attention to the unique constraint. In the case of a unique constraint, some indexes that appear to be redundant are not actually redundant.
Here, the index idx_id_name cannot completely replace the index idx_id, because the index idx_id can not only facilitate querying according to ID but also restrict ID not to repeat, while the index idx_id_name can only ensure (id, name) is not repeated.
Distributed databases with Shared-Nothing architecture like PolarDB-X typically introduce the concept of "partitioning". When creating tables, users specify one or several columns as partition keys. Data is routed within the database based on these partition keys, allowing data to be stored across different DNs. If a query's WHERE clause includes the partition key, the optimizer can quickly pinpoint a specific partition and perform data retrieval. However, if the WHERE clause does not include the partition key, the query needs to scan all partitions, which is similar to a full table scan in standalone MySQL. Scanning all partitions imposes significant overhead on distributed databases. In practical production use of databases, partitioning by a single dimension is often not flexible enough. Restricting the WHERE clause of a query to always include the 'partition column' is also not flexible enough.
Distributed databases generally support global indexes, which will be redundant with some data on the primary table, and the databases use different partition keys from the primary table. During querying, a partition is located based on the partition key of the global index, and then the partition key and primary key of the primary table are found from the partition. Finally, the primary table is scanned to obtain the complete data. Global indexes allow users' query statements to no longer be restricted by the requirement that the WHERE clause must include the partition column of the primary table, and they help avoid the cost of full partition scans. As can be seen from the above, the prerequisite for making good use of global indexes is to design a good partitioning method for global indexes, especially to avoid duplicate partitioning methods between global indexes and primary tables. For example, in the following table structure, the partitioning method of the global index g_id is exactly the same as that of the primary table tb4. The g_id causes the system to pay a write cost but does not bring benefits to query performance.
You must specify a partition key for a global index. Its data is stored on different DNs of PolarDB-X based on partitioning rules. If the partitioning rules for global indexes are not well-designed, it can lead to uneven partitioning. Some DNs would store a large amount of data and bear heavy read-write loads, while other DNs remain idle. This results in a waste of resources and can cause the database system to reach a performance bottleneck too early.
As shown in the following figure, suppose a business system has established a seller_order table. The business system is characterized by the fact that most orders come from a few big sellers. We only focus on the global index g_seller_id on the seller_order table, which uses the seller_id as the partition key. We assume that a big seller accounts for half of the total order volume of the system, and its data on the global index g_seller_id is routed to the P5 partition. We can see that the P5 partition will bear several times the load of other partitions.
A good global index should ensure that data is evenly distributed among different partitions as much as possible.
When using range partitioning as the partition strategy for global indexes in PolarDB-X, try to avoid using the time column as a partition column.
As shown in the statement above, the global index g_tm uses tm as the partition column for range partitioning, with its default value set to the current time. Here, we only consider the global index g_tm, and its partition P5 is a catch-all partition. After the '2023-07-01 11:00:00' time point, all new data to be inserted will be routed to the P5 partition (this is determined by the value of the tm column of the new data and the routing rule of the global index g_tm), so the P5 partition will become the bottleneck of data writing. The amount of data on the DN where the P5 partition is located will also keep accumulating. In the future, PolarDB-X will be optimized for this scenario. However, the usage in this example is still not recommended.
Let's first answer the question raised at the beginning of the article. There are four inefficient indexes in the warehouse table: duplicate index - idx_id (duplicate with primary key), duplicate index - idx_id_order_name (the same function as the primary key), redundant index - idx_order_id_order_name (can be replaced with the index idx_order_id_order_name_item_id), and low-selectivity index - idx_deleted_order_id.
This article summarizes some common inefficient indexes and the causes of their inefficiency. Cleaning these inefficient indexes on a regular basis can effectively improve the write performance of the database. Some readers may ask, what if there are so many tables and indexes that we don't have much time to check them one by one? That's OK. PolarDB-X provides the inspect index feature that allows you to automatically diagnose inefficient indexes with one click. It can also provide reasons and optimization suggestions. All inefficient indexes mentioned in this article can be identified. Therefore, you can easily check for inefficient indexes and optimize databases. Welcome to try it out.
For more information, see: https://www.alibabacloud.com/help/en/polardb/polardb-for-xscale/index-diagnostics
Try out database products for free:
Introduction to Application Scenarios of PostgreSQL JSON and Optimization of Shared Detoast
Introduction to the Core Features of the Partitioned Table in PolarDB for PostgreSQL
Alibaba Clouder - January 6, 2021
ApsaraDB - June 29, 2020
ApsaraDB - June 15, 2023
ApsaraDB - March 26, 2024
ApsaraDB - April 8, 2021
ApsaraDB - May 7, 2021
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAlibaba Cloud PolarDB for Xscale (PolarDB-X) is a cloud-native high-performance distributed database service independently developed by Alibaba Cloud.
Learn MoreAlibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn MoreHigh Performance Computing (HPC) and AI technology helps scientific research institutions to perform viral gene sequencing, conduct new drug research and development, and shorten the research and development cycle.
Learn MoreMore Posts by ApsaraDB