PolarDB-X Clustered Columnar Index | Snapshot Reading for Transaction Consistency

By Siyun

Background

PolarDB-X V2.4 supports the clustered columnar index-based (CCI) query acceleration feature. It builds a mature column-oriented SQL engine on the low-cost OSS storage base to provide superior performance, higher cost-effectiveness, and query acceleration capabilities for query analysis.

CCI-based query acceleration is based on an architecture that separates storage and compute. The columnar engine builds CCIs, consumes binlogs of distributed transactions in real time, synchronizes them to CCIs, and stores them in Object Storage Service (OSS). The compute engine (CN) manages CCI-based queries and transactions. For a CCI-based query, the CN needs to pull the corresponding data from the remote OSS storage for MPP (Massively Parallel Processing) calculation.

While benefiting from the convenience and scalability provided by the compute-storage separation architecture, we are also facing new challenges, one of which is the transaction consistency and compatibility of queries. To address this, extensive work has been done to ensure the consistency, freshness, and compatibility of CCI-based queries.

Principles of Consistent Snapshot Construction

CCI Storage Model

Before introducing the principle of column-oriented consistent snapshot reading, we need to learn about the CCI storage model.

PolarDB-X CCIs are stored in the Delta + Main + Delete-and-Insert structure. Data constructed is stored in OSS in the format of CSV + ORC + Delete bitmap.

CSV and Delete bitmap support appending writing. The Update operation of incremental data is split into Insert and Delete operations. In each group commit cycle, the Insert data is appended to the end of the CSV file. The Delete operation locates the deleted data based on the primary key index, generates a bitmap, and appends it to the end of the Delete bitmap. When the incremental data exceeds a threshold, the columnar engine asynchronously performs compaction to convert incremental files into query-friendly ORC files.

Compared with Copy-on-Write and Merge-on-Read approaches, this model realizes low-latency synchronization of row-oriented data to column-oriented data and does not introduce excessive additional overhead on the query side.

For more information about how to create a CCI, please refer to A New Feature of PolarDB-X HTAP: Clustered Columnar Index

CCI Version Chain and Multi-version Concurrency Control

In each group commit, append write and compaction of the columnar engine indicates that a new version of a column-store snapshot is visible to the public. Each snapshot is uniquely identified by a Timestamp Oracle (TSO). Each time a column-store snapshot is generated, the columnar engine writes information about this snapshot to GMS, including TSO, the location and length of appended files, and files deleted and added during compaction. These column-store snapshots logically constitute a CCI version chain, as shown in the following figure.

PolarDB-X CCI-based queries also need to consider multi-version concurrency control (MVCC). Different from row-oriented storage, CCI write requests come from the internal columnar node and only read requests are directly from the user. As a result, it can be simplified into the following model.

The group commit and compaction of the columnar engine continuously extend the version chain through the two-phase commit protocol. In the preparation phase, a new version is applied on the version chain. After incremental data is written to OSS, the commit operation is performed to make this node on the version chain visible externally. In addition, to prevent the version chain from being extended indefinitely, the columnar engine also performs the purge operation to clean up excessively old versions, thereby freeing up space in GMS and OSS.

The columnar query engine needs to provide concurrent query capabilities externally. The leader CN in the MPP cluster actively detects the new version in the version chain and broadcasts it to other CNs, so that the column-store snapshots of the version can be queried regardless of which CN receives traffic. Each CCI-based query from the user is randomly routed to a CN. Then, the CN acts as the coordinator of MPP queries to find the latest column storage version in the version chain and build a column-store ReadView. A column-store ReadView contains the TSO, the list of visible files corresponding to the version, and the visible offset of the incremental file. Then, the coordinator constructs a column-oriented scan operator based on ReadView, and sends the scan tasks of these files to all compute nodes (Worker) according to certain rules. Based on the scheduling rules by file hash and partition hash, we guarantee the local cache affinity of each compute node. This feature will be described in the next chapter.

To prevent query failures caused by snapshots being cleared prematurely, the leader CN regularly collects all column-store ReadViews that are active on the CN, obtains the low watermark of the active column storage transactions, and notifies the columnar engine that only versions below this watermark can be purged.

Based on the CCI version chain, we also provide the capability of consistent row-column hybrid snapshot. Due to the group commit mechanism, one append write of the column storage corresponds to multiple row storage transactions. Therefore, in hybrid row-column queries, to ensure the consistency of row-store and column-store snapshots, we use the latest TSO of the column-store snapshot and leverage the MVCC mechanism of the row storage's InnoDB to read snapshots on the row storage.

Quick Construction of MPP-consistent Snapshots

Different from the compute-storage integration system that uses the write cache to query incremental data in place, PolarDB-X compute nodes still need to pull data from remote OSS when querying incremental data. This requires a more refined design of the cache system to reduce the latency of CCI queries and improve user experience. To optimize the construction efficiency and query efficiency of column-oriented consistent snapshots, we introduce affinity caching and adopt various optimization methods to accelerate the construction of snapshots and the query of incremental data.

Cache with Node Affinity

The local cache of the columnar query engine adopts a multi-level cache using both memory and local disks, including:

ORC block cache + file fragment cache in local disks
Object metadata cache
Multi-version cache of incremental data (multi-version CSV + multi-version delete bitmap)
Column-store ReadView cache

For an MPP column-oriented query, the coordinator constructs file tasks to be scanned by each scan operator based on the column-store ReadView. To maximize the use of the local cache of each compute node and minimize the cache miss probability, we need to ensure the node affinity of the cache. In other words, for all CCI-based queries, scan tasks of the same file are scheduled to the same compute node as much as possible.

In addition, for incremental data, we designed a multi-version data cache structure based on its characteristics to reduce the read overhead of incremental data. The overall cache design is shown in the following figure.

Relatively Fixed Scheduling Algorithm to Ensure Cache Node Affinity

When scheduling scan tasks, we introduce two scheduling algorithms for different scenarios:

Partition-level hash: perform route scheduling with compute-storage consistency. Ideally, we want to minimize the shuffle cost of data between compute nodes. For example, when performing a join, we want to schedule data shards with the same join key to the same node for computation. This reduces the transmission cost of data while ensuring the correctness of the result. Therefore, we adopt a shard routing algorithm with compute-storage consistency, so that data in the same partition can be directly connected to downstream operators without additional shuffles. Based on this premise, we fix the routing logic of shards so that scan tasks in the same partition are consistently routed to a specific CN. For more information, please refer to PolarDB-X Column-oriented Storage Query Engine.
File-level hash: schedule to a specific CN based on the file name hash. In scenarios where a query hits a single partition or data skew between partitions is severe, long tails may occur during partition-level scheduling. Therefore, the coordinator calculates the skew degree of partition-level scheduling. For scenarios where partition-level scheduling is degraded, we adopt a file-level scheduling strategy to disperse file names using a fixed hash algorithm, ensuring that the same file is consistently scheduled to a specific CN.

The scheduling logic of any scan operator for an MPP query must match one of the two algorithms. It can be concluded that if the cache space is large enough, any piece of data will only generate cache misses twice at most. Thanks to this scheduling mechanism, PolarDB-X CCI-based queries can adaptively adjust the scheduling logic in the face of different loads, while ensuring a certain degree of cache node affinity.

CSV Multi-version Cache Design

To ensure write performance and query freshness, incremental data uses a CSV file that can append write. Due to its row storage feature, the query performance of CSV is naturally inferior to ORC. On the other hand, CSV files reflect the most recently written incremental data, and their query performance directly affects the user experience with CCI-based queries. If a bottleneck occurs when reading incremental files, the user side will reflect a jitter in query performance when data is written. To this end, we reuse data as much as possible in the design of the CSV cache and trade space for time. We convert the CSV stored by row into the chunk cache stored by column on the compute node to reduce the cost of incremental data reading.

Given the append write feature of CSV, consider a scenario where the CSV file at snapshot T0 contains 2,700 rows, and an additional 300 rows are appended at snapshot T1 and another 200 rows are appended at snapshot T2. Then, for queries of T2 and T3, the cache of the first 2,700 rows of data and the first 3,000 rows can be reused. Therefore, we adopt a segmented multi-version cache approach to reuse the cache as much as possible. The overall design is shown in the following figure.

The CSV cache uses a skip list to maintain multiple versions of the cache node. On the leaf node, it points to all chunks appended to the corresponding version. In the query engine, data and intermediate results are stored in the form of chunks. By default, each chunk contains 1,000 rows of data, organized by columns in memory. The CSV cache directly retains chunks, allowing the executor to use the cached data without the overhead of converting CSV rows to columns. Based on the cache structure in the preceding figure, we mainly introduce the processing logic of the following three processes: query, load a new version, and cache purge.

Query: When a user queries a version of data, such as T2, the system locates the leaf node of this version through the skip list, and returns all the chunks appended from the earliest version to this version in sequence to obtain the data corresponding to this version.
Load a new version: When a user queries data of a new version that has not been loaded or when the compute node preheats the cache, the new version needs to be loaded. In this case, the compute node pulls the appended CSV part from the remote storage, converts it into chunks after parsing, and adds the node to the skip list.
Cache purge: If the amount of data to be appended each time is small, many versions are generated and the number of chunk rows in each version is small, resulting in memory fragmentation. Moreover, the advantages of column-oriented calculation cannot be utilized. Therefore, we regularly purge and consolidate old-version caches that will not be queried, merging fragmented chunks into larger ones. Deleted CSV files are removed directly from the cache.

Multi-version Cache Design for Delete Bitmap

For a delete operation, the columnar engine uses the primary key index to find all the delete positions corresponding to this batch of delete operations, that is, the row number in which file is deleted, and generates a bitmap after aggregation by file. We choose RoaringBitmap as the storage structure, which has a high compression ratio and good performance in our benchmark test. After the corresponding RoaringBitmap is generated, the columnar engine serializes it and appends it to the Delete Bitmap file in a fixed format for the query engine to pull. Similar to CSV, the cache of Delete bitmap also adopts a multi-version design. The overall design is shown in the following figure.

For the multi-version cache of Delete bitmaps, similar to CSV cache, we first load and parse the RoaringBitmap appended and deleted by each version, as well as its corresponding file (File ID) and version (TSO), and organize them in the form of linked lists or skip lists in memory. However, when generating a Bitmap of a certain version, we do not merge all incremental Bitmaps before the version, but first maintain the latest Bitmap T in memory, and then subtract (logical XOR) the Bitmaps of these versions that differ from the latest version. Since users generally query data of newer versions, this optimization is evident when the version chain is long.

When performing CCI-based queries, we need to merge the ORC or CSV data with the Delete bitmap and delete the corresponding rows to obtain the correct snapshot data. As shown in the figure, when we need the snapshot of the file with File ID 2 at time T1, we subtract Δ3 and Δ2 from Bitmap T to obtain the T1 version of the Bitmap, and then apply it to the chunk to generate the snapshot at time T1.

Integral Access to Memory Management and Spill to Prevent Large Queries from Slowing down Overall Performance

For these caches, we incorporate a unified memory management mechanism to prevent excessive memory-consuming large queries and some resident caches from dragging down the overall performance. When the memory is insufficient, the compute node triggers a spill to store the cache on the disk and release the memory.

Data Preheating

The cache of the columnar query engine optimizes for performance by trading space for time, reducing the cost of pulling data from remote storage and accelerating the generation and query of column-store snapshots. However, when new data is written, and users query this new version of data, a cache miss is bound to occur because this version of data has not been loaded in the cache. Therefore, new data can only be pulled from remote OSS. This problem causes performance jitters in CCI queries. Whenever new data is written, the query on the corresponding CCI will be stuck due to cache miss, which greatly affects user experience. Therefore, we introduce a data preheating mechanism to actively pull metadata and incremental data from OSS to the local disk and memory cache of compute nodes.

Bypass Loading of Incremental Data

When processing incremental data, to reduce the number of network I/O and improve the upload efficiency, the columnar engine first batches the data to the local disk, and then uploads it to the remote OSS. Therefore, for relatively new incremental data, in addition to persisting it in OSS, the columnar engine also retains a cache on the local disk. Appendable objects in OSS have a higher latency than normal objects when pulling newly written content for the first time. To accelerate incremental data queries, we introduce a bypass loading mechanism. When a compute node pulls the data that exists in the local disk of the columnar engine, the compute node directly requests the data through the internal network instead of from OSS, as indicated by the red dashed line in the diagram below.

Advanced Feature: CCI-based Flashback Query

Based on the column-store version chain and efficient column-store snapshot generation technology, we can implement flashback queries on CCIs. By default, users can only query data within the most recent period, specifically snapshots that have not been cleaned up by the columnar engine. For users who require flashback queries, they can enable the column-store snapshot saving feature to manually or automatically save column-store snapshots at a specified point in time so that the snapshots of this version are not deleted. During a flashback query, users can specify the time or TSO. The columnar query engine will find the version closest to the specified time on the version chain and build a column-store snapshot for queries. For the reason that flashback queries are different from regular business traffic and do not have temporal locality, the columnar query engine bypasses the cache on the main link during flashback queries and directly pulls data from remote storage for queries to avoid cache pollution. This feature will be described in detail in subsequent articles. Please stay tuned.

Compatibility: CCI-based Queries under DDL

Multi-version Column-store Schema

PolarDB-X CCIs support multiple versions of schemas. For more information, please refer to Full Support of PolarDB-X Columnar Index for Multi Schema.

When performing DDL (Data Definition Language) operations, CCIs use multiple versions of schemas to maintain a new version of the schema instead of immediately converting and integrating the data types of physical files. The newly written data is stored in the format of the new schema. In the query engine, we maintain the mapping relationship between multi-version schemas and specific ORC and CSV files. During the query, if the latest version of data is scanned, the returned results are directly parsed. If the old version of data is scanned, column mapping, data type conversion, and default value filling are performed when parsing the data. Based on this mechanism, we can provide the schema traversal capability in the future, that is, restore the schema at a specified time, or query data based on the schema at a specified time.

Column-store Schema Compaction

DDL is a common practice in databases. Frequent DDL operations are inevitable for many fast-iterated business systems. Based on the preceding asynchronous DDL solution, it is obvious that the cost of CCI-based queries is positively correlated with the number of DDL operations. In particular, when you repeatedly modify the column type of the same column, the data of the earliest version needs to be converted in multiple steps, which is unacceptable. Therefore, the compaction task of the column storage also converts the data type of the old version, converting all old version schemas to the latest version. This ensures that the snapshots of the same version do not contain multiple versions of the schema, reducing the overhead of subsequent CCI-based queries.

Full-type Compatibility

Based on the MySQL ecosystem, PolarDB-X CCI-based queries provide the same type compatibility as row-oriented queries. For CSV files, we adopt a storage format compatible with MySQL-native Binlog Write_rows_log_event and parse them during queries. For ORC files, since the native ORC format is not directly compatible with the MySQL type system, we have carried out massive modifications and optimizations, especially for types such as decimal, timestamp, and string. This improves the compression ratio and ensures compatibility with MySQL types.

After DDL occurs, the column-oriented query engine needs to convert types during queries. To ensure the same MySQL compatibility as row-oriented queries, we adhere to the MySQL logic and properly handle issues such as numerical precision and data truncation.

Community

PolarDB-X Clustered Columnar Index | Snapshot Reading for Transaction Consistency

Background

Principles of Consistent Snapshot Construction

CCI Storage Model

CCI Version Chain and Multi-version Concurrency Control

Quick Construction of MPP-consistent Snapshots

Cache with Node Affinity

Relatively Fixed Scheduling Algorithm to Ensure Cache Node Affinity

CSV Multi-version Cache Design

Multi-version Cache Design for Delete Bitmap

Integral Access to Memory Management and Spill to Prevent Large Queries from Slowing down Overall Performance

Data Preheating

Bypass Loading of Incremental Data

Advanced Feature: CCI-based Flashback Query

Compatibility: CCI-based Queries under DDL

Multi-version Column-store Schema

Column-store Schema Compaction

Full-type Compatibility

Appendix

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

PolarDB for Xscale

PolarDB for PostgreSQL

PolarDB for MySQL

Database for FinTech Solution

A Free Trial That Lets You Build Big!