PolarDB-X Column-oriented Storage Query Engine

This article focuses on the perspective of the SQL engine executor and introduces various new features in PolarDB-X column-oriented analysis query acceleration.

By Junqi

PolarDB-X V5.4.19 supports the column-oriented storage query acceleration. It builds a mature column-oriented storage SQL engine on the low-cost OSS storage base to provide query analysis acceleration capabilities with better performance and higher cost-effectiveness. Through the TPCH test set, we verified the performance of this new query system in analytical acceleration scenarios. The following table shows the performance data of the TPCH 100GB test set in six different public cloud instance types:

Instance Type	8c32g * 2	8c32g * 4	16c64g * 2	16c64g * 3	16c64g * 4	16c64g * 6
Total Time Required (Seconds)	99.82	48.55	50.31	33.00	25.54	20.56

The following table shows the performance data of the TPCH 1TB test set in three different public cloud instance types:

Instance Type	3*16C128GB	4*16C128GB	6*16C128GB
Total Time Required (Seconds)	508.76	408.45	233.93

In this article, we will focus on the perspective of the SQL engine executor and introduce various new features in PolarDB-X column-oriented analysis query acceleration.

Overall Introduction to the Column-oriented Storage Executor

Column-oriented Storage

PolarDB-X column-oriented storage stores column-oriented data in OSS in the ORC file format. Based on the data type, compression algorithm, and OSS storage, column-oriented storage data can be compressed 3 to 10 times with extremely low storage costs.

During the query process, the executor reads column-oriented data from OSS through the column-oriented storage Scan module and returns the column-oriented data format in chunks to the upper-layer operator. During this period, the column-oriented storage Scan module will also store the data in memory and local disk hierarchically according to certain cache policy and compression or decompression modes, optimize the query path through the principle of locality, and dynamically manage the cache through algorithms such as LFU, thereby improving the resource efficiency and performance of the overall system.

Since the column-oriented storage Scan module reads data, the data in the column-oriented table is always calculated in the form of columns in the operator tree, pipeline, and time slice. We have designed data structures, operator execution logic, and scheduling algorithms that are more suitable for column-oriented storage to maximize its performance advantages in analytical query scenarios.

Pipeline Framework

After receiving SQL statements, the database forms an executable SQL plan through the parser and the optimizer and then hands it over to the executor to complete the specific execution logic.

The PolarDB-X column-oriented storage executor adopts an MPP execution system based on a pipeline framework and time slice scheduling mechanism. For a tree-structured SQL plan composed of operators, the operators can be executed serially. That is, one operator is completed before the next one starts.

However, in the pipeline framework, the operator tree is split into pipelines (an operator chain), and each pipeline is a chain fragment composed of operators. Except for the last operator which is a blocking operator, the others are streaming operators.

After the pipeline obtains the degree of parallelism, it forms execution units called Drivers, whose execution logic is identical. Under the MPP time slice mechanism, if we assign each driver the data to be processed, we can form tasks and hand them over to the time slice in MPP for execution. In the MPP scheduling scenario, different pipelines form a fragment, which is used as the basic unit for scheduling on different physical nodes. The preceding pipeline processing process is repeated inside each node. The shuffle/exchange methods are used to transmit data between fragments.

The philosophy of pipeline execution is inspired by the instruction pipeline technology of the CPU. In this execution mode, the latter operator can start execution without waiting for the previous operator to process all the data. It greatly improves the utilization of resources and the degree of parallelism of the entire query processing, allowing different query phases to be processed simultaneously on multiple cores/processors.

For example, for Q5 in the TPCH 1TB scenario, the PolarDB-X column-oriented storage executor divides the following fragment and pipeline for further scheduling by the MPP framework.

Time Slice Mechanism

Time Slice Scheduling in the MPP execution framework is the cornerstone of the PolarDB-X column-oriented storage executor, which simulates the time slice scheduling mechanism in the operating system at the application level and is used to manage the above-mentioned tasks at the computing resource level in a fine-grained manner.

Between the internal degree of parallelism of a single query execution and between multiple concurrent queries processed at the same time, the time slice scheduling mechanism can accurately control the fair allocation of system resources (such as CPU time, memory, and I/O), maximize the throughput of the system, and reduce the response time of SQL.

The core idea of time slice scheduling is to divide the available MPP threads into a series of fixed-length time slices in the time dimension and allocate them to each task to be executed in turn in combination with specific scheduling queues.

When a task runs out of its time slice, it is suspended if it has not been completed, where the scheduler selects another task to continue execution. This process repeats until all tasks are completed.

Operator Design

PolarDB-X optimizes the performance of operators such as Join, Agg, and GroupJoin to better adapt to column-oriented storage HTAP scenarios. These improvements leverage type-specific operations, vectorized processing, and optimized data structures to improve execution efficiency.

Following the bottom-up design principle, operator design focuses on optimizing the core execution logic of operators, such as array access and hash table detection, and building outer logic to support these core parts. This design method provides the best vectorization algorithm implementation for specific scenarios based on different data types and algorithms.

For the Join operator, we optimize and improve the efficiency of hash detection, such as comparing key values of join through type-specific arrays to optimize the detection process. For the aggregation operator, we redesign the special interface of the aggregation function and optimize it based on the distribution characteristics of group IDs. Hash table refactoring is performed to reduce the average number of memory accesses for each row of data by using specific type arrays for different group keys to store key values and using int arrays of corresponding lengths to store group IDs, thereby reducing unnecessary memory access operations. For the build side of HashJoin, we add type-specific lists and optimize the locking mechanism to improve efficiency.

For the GroupJoin operator, we optimize its performance by collecting matching position information and constructing rows in batches when outputting Join results. The materialization of the intermediate results of the join-agg step is avoided, and the repeated construction of the hash table and the memory usage are reduced. In general, the new operator implementation focuses on:

Reduce data transmission between the storage layer and the computing layer.
Use data type characteristics and vectorization algorithms to write type-specific vectorization implementations for some key logic.
Introduce more efficient data structures and algorithm variants to facilitate optimal selection in different contexts.
Optimize the execution path of aggregation functions, especially for batch processing and vectorized execution.
Implement join matching and result set construction that takes full advantage of batch processing and hardware features.
Provide special optimization for specific execution modes (such as group join combined with left outer join and group by).

Based on the preceding description of the PolarDB-X column-oriented storage query executor, we provide an example for the Q19 query in the TPCH 1TB scenario. After parsing and optimizing the Q19 query, the SQL plan tree is obtained. Then the fragments and pipelines are split, and multiple groups of Drivers and Tasks with the corresponding degree of parallelism are obtained through the scheduling strategy, and stored in the pending queue representing readiness to wait for the time slice for execution.

Under the time slice scheduling mechanism, the status of all tasks switches between pending, running, and blocked until all tasks are executed. Different nodes exchange data through shuffle and obtain the calculation result of the previous round of fragment as the input for the next round of calculation.

Column-oriented Storage Scan Execution Mechanism

Column-oriented Storage Scan Operator

PolarDB-X uses ORC as the file storage format. In the Scan operator of the executor, the file is read from OSS and parsed. Therefore, for the overall query performance, the design of the Scan operator and the ORC file parsing process are particularly important. The column-oriented storage format is mainly composed of the ORC file format. The data structure required from the ORC file to the executor computing layer can be divided into three levels: physical storage structure, logical storage structure, and compute structure. Take notes that:

• Physical storage is a file structure composed of pure sequences of bytes and encapsulated by the inputStream of the file. Internally, the read process does not affect the meaning of each byte area in the file. We designed a cacheable file system for OSS to manage the data.

• Logical storage corresponds to the seven hierarchies of the ORC file: File - Stripe - Column - Stream - RowGroup - RL Compression - General Compression. For this part, the PolarDB-X query engine introduces a new ORC parsing mechanism. Based on the analysis and design of the above data structure hierarchy, a new ORC read link is constructed to accurately control the read process of each column and each row-group, bypassing the official Reader of Apache ORC.

• Compute structure is used to manage runtime data structures that are directly available to the executor.

In the latest PolarDB-X column-oriented storage access optimization, several improvements have been made to improve efficiency.

First, the I/O execution mode has been significantly improved. Reducing the I/O overhead of a single file contributes to improving the I/O performance and approaching the bandwidth limit. The improvement helps to realize more efficient I/O task processing, such as executing I/O on the filter column, then performing filtering calculation, executing I/O on the subsequent project column, and merging I/O fragments, which greatly reduces I/O scheduling and overhead.

In addition, in terms of filtering operations, combined with the deletion bitmap of the column-oriented storage, full-scenario optimization is achieved, significantly improving the performance of vectorized filtering operations. Column-oriented storage Scan also greatly reduces the data parsing overhead. The scan operation only executes I/O without immediately parsing data, delaying the parsing process until the first access to the block, thus reducing the parsing cost in the filter and join operators.

All block implementations are now replaced by the more efficient LazyBlock. It also improves the processing of generated columns, that is, reducing one memory copy process, directly converting the decompressed byte data into blocks, and optimizing all data types. The main optimizations include continuous memory copy and dictionary optimization to improve the parsing speed.

The column-oriented storage Scan module further increases the filtering layer to enable deeper data filtering than before, extending from the row-group level to the block level, thereby reducing unnecessary data read overhead. The caching feature is also enhanced to be maintained at the block level within columns for more flexibility in targeting through arbitrary predicates. This is in contrast to the past, where entire data blocks could only be cached in file units with no retrieval capability. The new approach greatly reduces data redundancy and space usage in memory. On the TPC-H 100G test set, we conducted a comparative test on the scan queries select count() from lineitem and SELECT FROM lineitem where l_comment like 'sleep%express%packages%final' with the degree of parallelism set to 16 and 64 respectively, whose purpose is to evaluate the performance difference between the old and the new scans.

The test examined three different data caching methods: direct access to OSS (Object Storage Service), local disk, and blockcache. The analysis of the execution performance results shows that the scan method of the column-oriented storage query executor avoids the parsing step of the primary key column data by adopting technologies such as late materialization, asynchronous I/O execution, and column parsing optimization, thereby achieving higher performance. In the given chart, a scatter plot of the results of ten executions in five different execution modes is shown, as well as statistics including the median, upper and lower quartile ranges, and maximum and minimum extrema.

Late Materialization Mechanism

Late Materialization is a technology used in modern HTAP database systems to improve the access performance of column-oriented storage. Its basic idea is to optimize the loading and processing of data during query execution to reduce the amount of data processed and unnecessary computations and I/O operations, thereby improving query efficiency. Late materialization defers the actual materialization of data to as late a phase as possible through finer-grained data selection and refined cache hit strategies.

Without the late materialization mechanism, traditional scan operations based on formats such as Apache ORC crop data in a coarse-grained manner to select a batch of (column, rowGroup) units to be processed, then perform batch I/O operations on these units and load them into the memory, and finally, decompress the original byte data and convert it into a column-oriented structure for subsequent computations. In contrast, the PolarDB-X process for late materialization consists of the following key steps:

More Precise Cropping: The columnar index cropping module performs a more refined data selection based on specific query conditions, locating only a smaller range of matrices of (column, rowGroup) units to be processed.
Cache Hit and I/O Execution: For filtered columns, the system first checks whether some data already exists in the BlockCacheManager and can be directly obtained. Only the part that does not hit the cache performs I/O operations to load the data for further data parsing, and at the same time performs push-down filtering to obtain a set of data locations that meet the conditions. This step will further narrow the scope of the data that needs to be processed.
Cache Check and I/O Loading: For projection columns, the system checks whether the corresponding data unit already exists in the cache based on the valid location set obtained from the preceding filtering operation. For uncached data units, necessary I/O operations are performed to load the data.
On-demand Data Materialization: The original byte data is parsed into final data blocks only when the data needs to be accessed. This means that most data may avoid unnecessary materialization due to various reasons (such as failing filter conditions, cache hits, and connection mismatches).

Late materialization significantly reduces the overhead of single-row processing by reducing the number of rows processed and intermediate data format conversions. Especially in join queries (such as R inner join S), late materialization can significantly improve the efficiency of data processing. Materialization and projection are performed only when data is actually required. Most of the data does not go through a complete loading and processing process after preliminary filtering. The refined data processing and intelligent cache utilization policies effectively improve the query efficiency of column-oriented storage.

Columnar Index Cropping

The PolarDB-X column-oriented storage query executor designs an independent columnar index cropping module, which supports indexes such as ZoneMap, SortKey, BitMap, and dictionary to improve the query efficiency of column-oriented data.

Sort Key Index:

Sort Key Index uses pre-calculated sort keys to quickly locate and query data.

When creating a columnar index, you can select one or more columns as the sort key. The columns selected as the sort key are stored in order in each ORC file. Such sorting attributes greatly accelerate query operations, especially range queries.

Supported expressions: EQ, NE, GT, GE, LT, LE, IN, NOT_IN, AND, OR

ZoneMap Index:

ZoneMap Index stores statistics such as the maximum, minimum, and number of rows for each row-group in ORC. With these statistics, the Zonemap index can quickly determine which data blocks may contain the data required by the query conditions at query time, thereby eliminating unnecessary data blocks before the search operation.

Supported expressions: EQ, NE, GT, GE, LT, LE, IN, NOT_IN, AND, OR

BitMap Index:

BitMap Index is suitable for retrieval in low-cardinality scenarios. For each column with a Bitmap index, a bitmap is created for each possible value in the column; inside the bitmap, each bit is mapped to each row in the table. If the column value of a row is equal to the value represented by the bitmap, the corresponding bit is set to 1, otherwise, it is set to 0.

Supported expressions: EQ, NE, GT, GE, LT, LE, IN, NOT_IN, AND, OR

In the TPCH 100G test set, we select a part of SQL to show the effect of index cropping:

Queries	Conditions	Row Filterability	Row-group Filterability
Q1	l_shipdate <= date '1998-09-01'	591,411,581/600,037,902	Sort Key 59141/60003
Q2	p_size=15	399,557/20,000,000	BitMapIndex 1034/2000
Q3	o_orderdate < '1995-03-15'	72,874,383/150,000,000	Sort Key 7288/15000
Q3	l_shipdate > '1995-03-15'	323,494,519/600,037,902	Sort Key 32349/60003
Q3	c_mktsegment='BUILDING'	3,000,188/15,000,000	BitMapIndex 1351/1500
Q4	o_orderdate >= '1993-07-01' and o_orderdate < '1993-10-01'	5,733,776/150,000,000	Sort Key 575/15000
Q5	o_orderdate >= '1994-01-01' and o_orderdate < date '1995-01-01'	22,760,819/150,000,000	Sort Key 230/15000
Q6	l_shipdate < date '1995-01-01' and l_shipdate >= '1994-01-01'	91,030,658/600,037,902	Sort Key 9103/60003
Q6	l_discount between 0.05 and 0.07	163,640,806/600,037,902	ZoneMap 60003/60003BitMapIndex 60003/60003
Q6	l_quantity < 24	276,029,630/600,037,902	ZoneMap 60003/60003BitMapIndex 60003/60003

runtime filter

The PolarDB-X column-oriented storage query executor designs a new runtime filter mechanism. Based on constructing data digests, this mechanism filters unnecessary data as early as possible and avoids unnecessary computations and I/O operations to improve query efficiency.

You can generate the data digest (filter) using the HashJoin operator on the build side and filter it on the probe side, which reduces the amount of data that needs to be processed by scan, the amount of data exchange, and the computing overhead of each pipeline operator, and improves the HashJoin processing speed. However, these performance gains must outweigh the overhead of constructing, passing, and filtering computations that come with the introduction of the runtime filter.

Design features and strategies for the runtime filter include:

Scheduling Management: The new runtime filter mechanism avoids reliance on optimizers and complex plan changes. Instead, it dynamically identifies and determines the construction location and filtering strategy of the runtime filter during the executor scheduling phase. PolarDB-X uses an adaptive switch mechanism based on the runtime filter performance gain model to determine whether to enable the runtime filter.
Internal Computing: Blocked Bloom Filter algorithm replaces the traditional Bloom Filter implementation to improve memory access performance.
Parallel Construction: A new memory layout is adopted based on the actual number of HashJoin rows and the CPU cache size to make the construction of the runtime filter more cache-friendly and adapt to various Join scenarios (such as broadcast join and partition-wise join).
Transmission Mechanism: The filter is transmitted through lightweight inter-thread memory communication to avoid the high overhead caused by the server-client mode.
Filtering Efficiency: The runtime filter is pushed down to the storage layer (such as scan) and integrated with the compressed cache mode of column-oriented storage to reduce the downstream computing pressure by filtering redundant rows. For scenarios with poor filtering effects, a batch sampling mechanism is implemented to evaluate the current filtering effect. The runtime filter execution of the current batch is stopped when the filtering effect is not good.

Enabling the new runtime filter achieves significant performance improvements in the TPC-H 1TB test. However, the performance improvement of the runtime filter is not always proportional to its filtering rate, because the data compression in column-oriented storage may lead to read amplification.

In the TPCH 1TB scenario, the following figure shows a box plot constructed by eight times of execution data to verify the performance comparison of SQL statements before and after enabling the new runtime filter:

The following figure shows the optimization efficiency of related SQL statements before and after enabling the new runtime filter:

Scheduling Optimization

partition wise Policy

PolarDB-X introduces a partition-wise join scheduling policy to improve the execution efficiency of JOIN and Agg operators in cross-node parallel scenarios. Its basic idea is to use the existing partition information in the storage layer to reduce the amount of data transmission and the overhead of local shuffle.

Partition-wise join scheduling optimization reduces the amount of data in exchange/shuffle, and replans inter-node and intra-node scheduling policies, fully utilizing the partition information in the multi-layer scheduling process. We divide an SQL plan into multiple fragments and execute them on different nodes. In the process of fragment scheduling, in consideration of partition alignment, aligned data fragments are allocated to the same node as much as possible, reducing the additional overhead introduced by cross-node data transmission.

On this basis, partition-wise join optimization considers pipeline splitting and local data shuffle, as well as the adaptation of join operators and build side, to maintain the data partitioning characteristics of the probe and the build sides, so as to ensure the query efficiency and correctness. In parallel execution, the degree of parallelism does not depend on the number of partitions, ensuring automatic adjustment and efficient execution of the system when the degree of parallelism of the join operator is greater than the number of partitions.

The new design significantly improves performance and provides a solution to the misalignment issue between the storage layer and the computing layer. By quickly collecting a small amount of information during scheduling between nodes, the data pulling and transmission overheads of operators are greatly optimized.

Consistent Route Scheduling for Storage and Computing

In PolarDB-X row-oriented storage mode, since different hash algorithms are used between the storage layer and the SQL plan, a large amount of data movement during data partitioning and shuffle cannot be avoided, which greatly affects the performance. Ideally, we want to minimize the shuffle between data nodes. For example, when executing Join queries, we want to schedule aligned data shards to the same machine as much as possible.

This can reduce the amount of data transmission while ensuring the accuracy of the results. However, when the storage layer and the shuffle layer are not aligned, the data in the storage layer must be rearranged to match the data partitioning effect of the shuffle layer, which greatly affects the query performance. To solve this problem, we propose a routing algorithm that unifies the computing layer and the storage layer to reduce data transmission and improve query efficiency.

This means that the routing algorithms for shuffle, optimizer computing shards, and the storage layer need to be unified. PolarDB-X chooses XxHash with better performance to unify the shuffle computing partitioning algorithm, and adjusts the storage layer and optimizer to align with the XxHash algorithm; for different types of partition keys, it re-implements the optimal xxHash interface to adapt to this capability.

Data Type Optimization

In PolarDB-X, a data representation method called Decimal64 is introduced to improve the processing efficiency of the Decimal data type in medium and low precision scenarios. Decimal64 uses an integer value of the long type plus a scale representing the number of decimal places to represent Decimal data. For example, a numeric value of 12.34 can be represented as a long value of 1,234 and a scale value of 2, with a maximum supported precision (the maximum number of total digits to be stored) of 18 digits. The relevant changes to Decimal64 in PolarDB-X include:

A computing-time vector structure based on decimal64 is added, which is compatible with the decimal representation in MySQL.
All vectorized expressions involving the Decimal type are rewritten using the Decimal64 feature in medium and low precision scenarios.
The aggregation computing process uses decimal64 to rewrite the accumulation and min/max aggregation functions for medium and low precision scenarios.
In the process of operator execution, for the overflow of addition, subtraction, and multiplication operations, Decimal64 is automatically rolled back to normal Decimal.

In the process of introducing Decimal64, the PolarDB-X team took an approach that was compatible with the ORC file format. PolarDB-X is implemented based on custom variables stored as metadata at the end of the ORC file, without relying on the ORC file version number; long integers and scale are used to support Decimal64 and the columnar index cropping logic is adapted. The introduction of Decimal64 provides a more efficient method for processing numerical data, especially for complex query scenarios that require a large number of low-precision Decimal numeric computations while maintaining backward compatibility with existing systems.

Summary

In this article, we introduce the basic form of the PolarDB-X column-oriented storage query engine and many new features based on this form. They contain both the theoretical foundations and best practices of decades developed in the database field, as well as the constant experimentation and innovation of the PolarDB-X team in the process of iterating products. It is believed that the column-oriented storage query engine can make up for the shortcomings of analytical queries in row-oriented storage scenarios, providing customers with a better user experience.

In storage and computing separation and distributed scenarios, some features also play a crucial role in column-oriented storage queries, such as caching and version management. We will cover this part in detail in the next issue.

Community

PolarDB-X Column-oriented Storage Query Engine

Overall Introduction to the Column-oriented Storage Executor

Column-oriented Storage

Pipeline Framework

Time Slice Mechanism

Operator Design

Column-oriented Storage Scan Execution Mechanism

Column-oriented Storage Scan Operator

Late Materialization Mechanism

Columnar Index Cropping

runtime filter

Scheduling Optimization

partition wise Policy

Consistent Route Scheduling for Storage and Computing

Data Type Optimization

Summary

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

PolarDB for PostgreSQL

PolarDB for Xscale

PolarDB for MySQL

Database for FinTech Solution