Guidelines for creating CCIs - PolarDB - Alibaba Cloud Documentation Center

This topic describes the guidelines for creating clustered columnar indexes (CCIs) for tables in specific scenarios to improve query performance.

Scenarios

Important

CCIs are asynchronously built based on Change Data Capture (CDC) nodes to help ensure consistent query results. However, second-level latency occurs when data is synchronized from the primary instance to column store read-only instances. Therefore, we recommend that you do not use CCIs in scenarios in which real-time data access is critical.

Scenarios that involve different query workloads or require acceleration for complex queries
Note
CCIs can significantly improve the performance of complex analytical processing (AP) queries.
Cold data archiving scenarios
Note
Column store read-only nodes (compute nodes) can access metadata by using CCI data stored in Object Storage Service (OSS). Therefore, you can use columnar indexing to store cold data in OSS to reduce storage costs. For more information, see TTL.
Scenarios in which historical snapshots are saved and queried
Note
A CCI can serve as a non-expiring replica of historical data. It allows you to retain and query historical snapshots. This feature is commonly used for audit and backup business.
ETL scenarios
Note
A CCI can serve as a data replica of the primary instance. You can connect to a column store read-only instance that uses a CCI to perform data extract, transform, load (ETL) and transfer the data to another system.

Partitioned tables

Common partitioning methods

Partitioning is a method used to divide large tables into smaller, more manageable pieces, which improves query performance and data management efficiency for large datasets. Syntax for partitioning CCIs:

PARTITION BY
HASH({column_name | partition_func(column_name)})
| KEY(column_list)
| RANGE({column_name | partition_func(column_name)})
| RANGE COLUMNS(column_list)
| LIST({column_name | partition_func(column_name)}) 
| LIST COLUMNS(column_list)} }

Note

Range partitioning (RANGE): partitions data based on the value ranges of a column. For example, if a table stores historical sales data, you can partition the table by sales year to place all records from the same year in the same partition.
List partitioning (LIST): partitions data based on whether values in a column fall within a predefined set of values. This partitioning method is commonly used in software as a service (SaaS) scenarios. For example, you can partition a user data table by country to place the user data of the same country in the same partition.
Hash partitioning (HASH/KEY): partitions data based on the hash value of a column, which ensures a more even distribution of data across partitions. Hash partitioning is suitable for scenarios in which data distribution is unpredictable or even data distribution is required.
For more information, see Partitioning types.

Guidelines for configuring the number of partitions

Theoretically, the number of partitions is determined based on the number of tables and the specifications of the column store read-only instance. By default, the number of partitions is 16. We recommend that you do not use the default number of partitions. A common guideline is to use the following formula to calculate the number of partitions: Number of computing nodes × Number of cores on a compute node. You can specify a number higher than the number calculated by using the preceding formula for potential data growth.

Note

If you perform operations that involve joining multiple tables, we recommend that you keep the number of partitions consistent across the involved tables within the same instance.

Guidelines for choosing a partitioning method

CCIs are primarily designed to improve the performance of AP queries, which involve aggregations and join operations in most cases. To fully leverage the parallel scanning and querying benefits of columnar storage, we recommend that you use hash partitioning.
For time-related queries, use the involved time column to create subpartitions. For other scenarios in which the query does not heavily rely on time-related conditions, do not create subpartitions for CCIs unless necessary. For more information, see Subpartitioning.
If the data that you want to query is not based on specific ranges, we recommend that you do not use range partitioning. If the data that you want to query is not based on a predefined list of values, we recommend that you do not use list partitioning. Even if the data that you want to query is suitable for range or list partitioning, we recommend that you prioritize row-oriented storage to meet your query requirements.
If your business archives cold data based on a CCI, we recommend that you use range partitioning based on a time column.

Guidelines for selecting a partition key

Select a column that has evenly distributed values as the partition key. For example, you can select the transaction ID column, device ID column, user ID column, or an auto-increment column.
Note
We recommend that you do not use the date, time, or timestamp column as the partition key. If most of your writes are concentrated in the same day or time period, data may accumulate on the same partition when you use time-based partition key. This can lead to degraded performance. Most queries focus on specific time ranges, such as records from the previous day or month. If you use a time-based key, you may access data on a single node that stores data for that specific date or time range. As a result, the distributed nature of the database is not effectively utilized. If the date or time column is required for your queries, you can use the date or time column as the subpartition key.
Select the column that is frequently involved in JOIN or GROUP BY operations as the partition key. This way, you can minimize data redistribution overhead. For example, if you want to analyze historical order data by customer, you can use the customer ID column as a partition key.
Select the column that is frequently used in non-range query conditions as the partition key to enable data pruning based on the key.
Each table can have only one partition key, which can consist of one or more fields. Fewer fields in the partition key can help improve the adaptability of the key in complex query scenarios.

Important

If you do not specify a partition key when you create a table, the system uses the primary key as the partition key. If no explicit primary key is available, the system uses the implicit primary key as the partition key.
After you create a CCI for a table, you can execute the check columnar partition db_name.tbl_name statement to check the data distribution across the partitions. This way, you can check whether the partition key is appropriate and identify potential data skew.

Sort key

Introduction

The sort key of a CCI determines how the data is sorted within the CCI file. The metadata of each column data block contains the minimum and maximum values of all data in the column data block. When you query data, all column data blocks involved in the query are scanned. If you enable Pruner, Pruner classifies the column data blocks into three categories based on the query conditions and metadata: relevant, possibly relevant, and irrelevant. Only relevant and possibly relevant column data blocks are scanned for queries. Column data blocks can be sorted in different order, which leads to different combinations of data. The filtering of Pruner varies based on the data arrangement. You can change the query conditions to change how data blocks are sorted. This helps improve the query performance. For more information about the Pruner feature, see Configure a filter algorithm for IMCI-based queries.

Guidelines for selecting a sort key

In scenarios in which a table is frequently queried based on the value range of a specific column, we recommend that you use the column as the sort key.
In scenarios in which data is retrieved in pages, we recommend that you use the column specified in the ORDER BY clause as the sort key.
In other scenarios, we recommend that you use the partition key as the sort key.

Dictionary encoding

Introduction

Dictionary encoding converts string values into numerical values, which can significantly accelerate operations such as GROUP BY and FILTER, improve data compression ratio, and reduce storage costs. In PolarDB-X, you can dictionary-encode specific fields when you create a CCI to build a dictionary mapping for the values of the fields. Examples:

# Specify columns col1 and col2 for dictionary encoding.
DICTIONARY_COLUMNS='col1,col2';
# Explicitly create a CCI on a table and specify columns for dictionary encoding.
CREATE CLUSTERED COLUMNAR INDEX `cc_i_seller` ON t_order (`seller_id`) partition by hash(`order_id`) partitions 16
dictionary_columns='order_id,seller_id';

Guidelines for selecting columns for dictionary encoding

We recommend that you select columns that have a low cardinality for dictionary encoding such as gender or region columns that have few distinct values.
We recommend that you do not apply dictionary encoding to all character columns to prevent additional encoding and decoding overheads.

Note

Dictionary-encoding columns that have a small cardinality can compress data to save storage space and improve query performance. However, queries in a distributed database involve parsing and merging of the dictionary. This can cause additional overheads. Therefore, dictionary-based queries are automatically disabled during the query process. To enable dictionary-based queries, set the ENABLE_COLUMNAR_SLICE_DICT parameter to TRUE.

FAQ

Is the number of partitions affected when I change the specifications of a cluster?
No.
Can I change the partition key, sort key, number of partitions, and dictionary-encoded columns for a CCI?
No. To change the partition key, sort key, number of partitions, and dictionary-encoded columns for a CCI, delete the CCI and then recreate the CCI.
Do I need to purchase a column store read-only instance to create CCIs?
You can create CCIs on the primary instance. To query CCI data, we recommend that you purchase a column store read-only instance.