Performance Improvement Tool | In-depth Analysis of PolarDB-X Columnar Query Technology

This article delves into the cache hierarchy of the PolarDB-X columnar query engine and its key role in improving the performance of ORC columnar queries.

By Junqi

1. Introduction

In the era of rapid data growth, the efficient query of massive data has become a significant challenge for enterprises and technical support personnel. Although columnar storage formats, such as ORC, have obvious advantages in specific scenarios, query speed still has bottlenecks when dealing with large-scale datasets. This article will delve into the cache hierarchy of the PolarDB-X columnar query engine and its key role in improving the performance of ORC columnar queries. We will analyze its design principles, implementation details, and applicable scenarios to demonstrate its extensive application in big data queries and its efficiency and reliability. Moreover, the article will cover the storage structure of ORC files, data compression and decompression techniques, and the backpressure management strategy for caching intermediate results of the executor, illustrating how cache hierarchy and backpressure mechanisms can further enhance query performance. Through these contents, readers will understand the practical effects and technical advantages of the PolarDB-X cache hierarchy in improving the efficiency of columnar queries.

2. Multi-level Cache Management

2.1 Overview

In PolarDB-X columnar query engine, multi-level cache management is a core strategy for improving query performance. By setting caches at different levels, the system can achieve efficient data loading and query response according to data access frequency and characteristics. This cache hierarchy architecture optimizes the data reading path and effectively reduces the dependence on the underlying storage system, thus significantly improving the query speed.

2.2 ORC Storage Hierarchy

The ORC (Optimized Row Columnar) format is a columnar storage format that is widely used in big data processing. Its internal structure is ingeniously designed, and through a multi-level organization, it achieves efficient data compression and fast column-level queries.

2.2.1 Stripe, Column, and RowGroup

• Stripe: By default, an ORC file forms a stripe every 64 MB. Multiple columns are stored in a stripe.

• Column: In a stripe, each column is stored in a contiguous file area that contains multiple RowGroups.

• RowGroup: In a column, every 10,000 rows of data are divided into a RowGroup. The RowGroup is the basic unit for ORC to perform compression, decompression, encoding, and decoding. If the total number of rows stored inside the column is not an integer multiple of 10,000, the last RowGroup may contain fewer than 10,000 rows.

In addition, two types of structures, stripe footer, and index data, are used to store location information, making it easy to quickly locate data blocks.

2.2.2 SMA Index

In an ORC file, the SMA (Statistics Minimum and Maximum) index stores the minimum and maximum statistics for each stripe and RowGroup. This information can quickly filter out irrelevant data blocks during queries, reducing unnecessary data reading and improving query efficiency.

2.3 Locate ORC Data Blocks

To accurately locate a data block (such as a 1,000-row data unit) in an ORC file, a unique logical address is required, which contains the following 5 fields:

StripeId: The number of the stripe in the file.
ColumnId: The number of the column in the stripe.
RowGroupId: The number of the RowGroup in the column.
StartPosition: The starting position of the block in the RowGroup.
PositionCount: The number of rows in the block.

Through these fields, the system can accurately locate the specific data location in the ORC file, enabling efficient data reading.

2.4 Relationship between Columnar Queries and ORC

In the columnar query process, the executor is the core component of the data query operation. It formulates an execution plan tree based on SQL statements, operators (such as scans and joins), and predicate conditions, and executes corresponding query operations.

2.4.1 Execution Plan and Scan Operator

The execution plan tree generated by the executor contains a Scan operator, which defines how to perform scan operations on ORC files, that is, how to read data from specific columns. These operations are usually column-centric and performed in ORC-format files, which greatly improves query efficiency.

2.4.2 Predicate Conditions and SMA Indexes

During the execution of query operations, the Scan operator is equipped with predicate conditions for data filtering. The SMA index in ORC format allows the executor to efficiently determine which stripe or RowGroup contains eligible data without reading all the data. By combining predicate conditions with SMA indexes, the system can reduce the workload of data loading and scanning, further improving query performance.

2.4.3 Data Compression and Decompression

ORC files typically use multiple compression techniques, such as LZ4 general compression and RunLength numeric compression, to reduce storage space and improve I/O efficiency. The executor will automatically perform the decompression operation when reading data. The compression ratio ranges from 2 to 4 times, which saves storage space and improves the efficiency of data reading.

3. Overall Hierarchy Architecture

The PolarDB-X columnar query engine uses a multi-level caching mechanism, including level-1, level-2, and level-3 caches, as well as the underlying OSS storage base. Each level of cache has its unique design and function. Through hierarchical lookup and caching, the system can efficiently access and manage massive data.

3.1 Process Diagram

The overall process can be summarized in the following steps:

Cache hierarchy: The system is configured with three levels of cache. If a cache hit occurs at any level, the hit cache is returned; otherwise, the search continues to the next level of cache until reaching the OSS storage base.
Cache space: The level-1 cache space does not exceed 20% of the runtime memory, the level-2 cache space does not exceed 30%, the level-3 cache is stored on the disk with a space of approximately 500 GB to 1 TB, and the OSS storage space at the bottom is almost unlimited.
Access speed: The access speed of the level-1 cache is equivalent to the access latency of the main memory. The access speed of the level-2 cache includes the access latency of the main memory and the decompression time. The access speed of the level-3 cache contains the disk access latency and the decompression time. The access speed of the OSS storage includes the network access latency and the decompression time.
Compression ratio: Level-1 cache stores decompressed data, while level-2 cache, level-3 cache, and OSS store compressed data, with a compression ratio ranging from 2 to 4 times.
Locality: Data that is more frequently accessed is stored closer to the executor to improve access efficiency.

3.2 Addressing Mode

The system obtains the address {StripeId, ColumnId, RowGroupId} through predicate conditions and index search, and obtains the specific physical location {filePath, offset, len} of the file through the meta and footer information of the ORC file, thereby searching and reading data in the cache at all levels.

3.3 Design of level-1 Cache

The level-1 cache is designed to respond quickly to data requests that are accessed at a high frequency. When the query arrives, the key of the level-1 cache is converted based on the predicate conditions, that is, {StripeId, ColumnId, RowGroupId}, and the corresponding block list is searched in the level-1 cache. If a hit occurs, these blocks are directly used for computation. Since these blocks are sequentially decompressed from RowGroups, and each RowGroup contains 10,000 rows of data, a block represents 1/10 of a RowGroup, which is 1,000 rows.

To control the space usage of the level-1 cache, when the cache space exceeds a preset threshold (usually 20% of the runtime memory), the LFU (Least Frequently Used) cache replacement policy is triggered to eliminate the least accessed cache items.

3.4 Design of level-2 Cache

When the level-1 cache misses, the system obtains the address {filePath, offset, len} through {StripeId, ColumnId, RowGroupId} and uses it as the key of the level-2 cache. The system searches for the corresponding compressed data byte array in the level-2 cache. If a hit occurs, the data needs to be decompressed, and the decompressed data is written into the level-1 cache. Similarly, when the space usage of the level-2 cache reaches the threshold, the LFU mechanism is used to eliminate the least used data.

3.5 Design of level-3 Cache

The level-3 cache is located at the disk layer and mainly stores compressed file fragments obtained from the OSS storage base. The level-3 cache stores the byte array that was previously in the level-2 cache as a file. When adjacent segments are read, they are merged, and the old files are deleted asynchronously. Since write operations on files are time-consuming, these operations are completed by asynchronous threads.

3.6 OSS Storage Base

If all levels of the cache miss, the system maps {StripeId, ColumnId, RowGroupId} to a specific file path, obtains the complete ORC file from the OSS storage, and downloads the required file fragments for decompression and reading. Although OSS access has network latency, its nearly unlimited storage space provides powerful backup support for the system.

4. Design Principles of the level-1 and level-2 Caches

4.1 Column-oriented Data Hierarchy

The PolarDB-X columnar storage format is mainly composed of ORC files. The data structures required by the executor's computing layer can be divided into three levels from the ORC file:

Physical storage structure: A file structure composed of pure byte sequences, encapsulated by the file read interface InputStream. The reading process does not affect the meaning of the various byte regions in the file.
Logical storage structure: A structure focusing on seven hierarchical structures: file, stripe, column, stream, RowGroup, RL (lightweight) compression, and general compression.
Computing structure: A structure using information from four levels including file, column, stripe, and RowGroup to obtain specific types of blocks for the executor to use in computations.

4.2 Physical Storage Structure

The physical storage structure is the raw, fundamental storage form of an ORC file, purely composed of byte sequences without any semantic information. By using the file read interface (such as InputStream), the system can sequentially or randomly read the byte data in the file without understanding its specific meaning.

4.3 Logical Storage Structure

The logical storage structure focuses on the organization and compression of data, consisting of the following levels:

4.3.1 File

It consists of the file footer, postscript, and multiple stripes.

• Postscript: Located at the very end of the file, it can locate the byte positions of the main internal areas of the file.

• File Footer: It stores file-level min-max statistical information, compression and encryption algorithm information, version information, and others.

4.3.2 Stripe

Each 64 MB forms a stripe, which consists of a StripeFooter and a set of columns.

StripeFooter:
- Store metadata for the two-layer structure of all column-streams.
- Store SMA information (min-max, count, sum) at the stripe level.
- Store custom indexes at the stripe level.

4.3.3 Column

Each column consists of several streams, which are contiguous in storage structure.

Index Stream: Store the row index of the column, which consists of ColumnStatistics and positions.
- ColumnStatistics: The min-max, count, sum, and other SMA information of the column on each RowGroup.
- Positions: The starting byte position of the column on each RowGroup, as well as the byte size before and after compression.
Present Stream: Store a bitmap of null values for the column, with data undergoing lightweight and general compression.
Data Stream: Store the data in the column, with data undergoing lightweight and general compression.
Dictionary Stream: Generate an additional stream to store dictionary data for the columns that can form a dictionary.

4.3.4 Stream

Each stream represents a contiguous byte storage area that contains different types of data stream information. Each stream consists of several RowGroups, with each RowGroup storing 10,000 rows of data by default.

4.3.5 RowGroup

On each stream, every 10,000 rows of data form a RowGroup. Each RowGroup undergoes lightweight compression and general compression and is stored in the RowIndex. The RowIndex describes information such as the starting byte position of the RowGroup, the size after compression, the original size, and the number of starting elements.

4.3.6 General Compression

The default general compression method for ORC is Zlib, which offers a high compression ratio but a slower decompression speed. PolarDB-X column-oriented storage uses LZ4 compression. Although the compression ratio is average, the decompression speed is faster and it is suitable for scenarios that require high performance.

4.3.7 RL (Lightweight) Compression

RL (RunLength) compression selects flexible compression methods based on the workload characteristics of data to improve compression efficiency. There are four main ways:

DIRECT: For data with no distinct workload characteristics, the original data is stored directly.
SHORT_REPEAT: This encoding is used when data elements are repeated and their length falls within a certain range.
PATCHED_BASE: When the distribution range of elements is large and the DIRECT encoding cannot be used, patch-base encoding is employed.
DELTA: Delta encoding is used when elements follow a specific pattern, such as an arithmetic sequence.

4.4 Computing Structure and BlockCacheManager

You can obtain specific types of blocks based on the file, column, stripe, and RowGroup. For example, if the RowGroup stride is 10,000 rows and the chunk limit is 1000, then {file=xxx.orc, StripeId=0, ColumnId=1, rowGroupId={1,3}} can retrieve 20 IntegerBlock objects.

The BlockCacheManager is responsible for caching these objects, with main interfaces including:

• Cache all chunks of the associated RowGroup based on RowGroup access characteristics.

• Manage the insertion, querying, and elimination of the cache.

5. Overview of the ORC Read Path

Based on the in-depth analysis and design of the ORC data structure hierarchy, the new ORC read path aims to achieve the following objectives:

Input objective: Specify the columns to be read and their corresponding row group information.
Output objective: Generate a set of data blocks and manage these blocks in the cache manager for subsequent efficient access.

The entire reading process is driven by two core interfaces: StripeLoader and ColumnReader. Each interface takes on different responsibilities to ensure efficient data reading and parsing.

5.1 The Main Entry of the I/O Process

The StripeLoader interface manages the stream information in ORC files and coordinates the execution of I/O operations. The main process involves the following steps:

Initialize stream information: First, you can call the initialization method to obtain the information about all streams in the current data stripe, which is uniformly maintained by the StreamManager.
Obtain the byte area linked list: Locate the byte starting position and compressed size of each row group in the physical storage based on the specified column and corresponding row group during loading. Based on this information, a buffer chunk is created for each row group, and adjacent row groups are organized in the form of a linked list to form a coherent byte area linked list.
Read data asynchronously: An asynchronous thread is used to start the data reading operation and return a CompletableFuture containing the input streams (InStream) corresponding to each stream. Next, the subsequent merging and reading steps are performed.
Merging and I/O strategy: You can use DataReader to merge the buffer linked list, and merge adjacent or close nodes into one I/O operation to reduce the number of I/O operations and improve reading efficiency.
Execute I/O operation: The data reading interface writes the merged I/O results back to each linked list node through the file system or input stream interface to complete the actual data reading.
Assemble input stream objects: Divide and assemble multiple input stream objects (InStream) based on the streams to which each linked list node belongs for subsequent parsing.

The entire linked list structure covers the linked list pointers, byte area information, and the actual physical byte data. This information is managed by StreamInformation to ensure a clear and controllable relationship between streams and buffer nodes.

5.2 Parse and Generate Blocks

The ColumnReader interface is responsible for parsing the raw ORC byte data obtained by StripeLoader into the block objects required by the executor. The parsing process has the following characteristics:

Optimized memory operations: Minimize the number of memory copies and allocations, and optimize the data writing process to continuous memory copies to improve execution efficiency.
Direct data conversion: Build a direct conversion path from ORC byte data to the executor data format to avoid the additional overhead of intermediate formats (such as VectorizedBatch).
Asynchronous I/O management: Manage callbacks for asynchronous I/O operations and automatically create multiple parsers based on I/O results and stream information to ensure efficient and parallel data parsing.
Buffer recycling: Automatically recycle the buffer used by asynchronous I/O operations to prevent memory leaks and improve system resource utilization.

The ColumnReader interface contains three core methods:

• Initialization method: This method initializes the data parsing process. You can pass in the callback that is returned when the StripeLoader is loaded or directly trigger the loading process internally and wait for the results.

• Pointer positioning method: This method moves the internal read pointer to the position of the specified row group and element to enable random access and efficiently locate and read specific data.

• Data reading method: This method reads a specified number of data elements from the current pointer position and writes them into a randomly accessed block. This method avoids additional overheads such as dynamic array expansion and boundary checks, enhancing data writing efficiency.

5.3 General Abstract Column Reader

AbstractLongColumnReader is an abstract class suitable for handling all data types represented by integers or long integers, such as int, bigint, decimal64, date, datetime, and so on. The main steps are as follows:

1. Parser initialization: A parser is built based on the input stream and stream information. For example, for the common bigint type, compression and decompression parsers are built based on the data and presence streams respectively. The data stream stores the compressed bigint value, while the presence stream stores the compressed null value information.

2. Multi-layer pointer management: The parser uses a three-layer pointer management mechanism.

Pointer 1: Point to the current position of the decompressed data cache and obtain the decompressed numeric or Boolean values.
Pointer 2: Point to the position of the LZ4 decompression cache and manage the intermediate data after decompression.
Pointer 3: Point to the linked list position of the original byte data and manage the uncompressed raw data.

3. Specified location positioning: You can use the positioning method to calculate the byte starting position and compression size based on the row group ID and element position and adjust the pointer of each layer to correctly locate the data to be read.

4. Data copy and block generation: You can use the reading method to copy memory data from the current pointer to randomly accessed blocks as needed. According to the different data types, allocate the corresponding type of block objects, parse the presence stream and data stream in the loop, and write the final data into the block to complete the block construction.

Through the above design, the new ORC read path realizes precise control over each column and each row group, optimizing every aspect of data reading and parsing. The StripeLoader interface efficiently manages stream information and organizes I/O operations, while the ColumnReader interface is used to efficiently parse raw byte data and generate the block objects required by the executor. Combined with the abstract design of AbstractLongColumnReader, the entire reading process achieves efficiency and flexibility and adapts to the processing requirements of multiple data types. This customized ORC read path design provides strong support for large-scale data processing and significantly improves the performance and reliability of data reading.

6. Design Principles of the level-3 Cache

6.1 Composition of the level-3 Cache System

The management of the level-3 cache system covers several key parts to ensure efficient data access and optimization of system performance. First, the input/output (I/O) process of the file system is the foundation of the system, which includes both network I/O and local persistent caching. The network I/O is responsible for the transmission of remote data, while the local persistent caching ensures the reliability and quick access of data stored locally. Second, file metadata management is responsible for maintaining the table files and index files stored on the Object Storage Service (OSS) instances to ensure the orderliness of the data structure and the efficiency of retrieval. Third, data source management is responsible for configuring and loading data sources under different engines to support diversified data input channels. Finally, file access scheduling is carried out through a Massively Parallel Processing (MPP) framework to allocate and schedule file access before execution reasonably, optimizing resource utilization and access efficiency.

6.2 Hadoop File System and Cache File System

Hadoop file system

In the Hadoop ecosystem, the Hadoop file system provides a unified file system abstraction that encapsulates a variety of data sources by implementing its core abstract classes. This design enables different types of data sources to be accessed through a unified interface, greatly simplifying the complexity of file operations. The Hadoop file system enables file reading and writing by initializing connections, opening file streams, and creating file streams, thus supporting distributed data storage and efficient data processing.

Cache file system

To further improve the efficiency of file access, PolarDB-X has designed and implemented a cache file system, which can support file access to both local persistent caching and remote OSS instances. The cache file system consists of two main components: one is responsible for handling file access on remote OSS instances, and the other is responsible for managing local caches, including data persistence and maintenance of cached files. Through this two-layer caching mechanism, the system can greatly speed up file access and improve the overall system performance while ensuring data consistency.

6.3 Reading Process

The well-designed reading process of the level-3 cache system ensures efficient data acquisition and cache management. First, the system obtains the input stream of the target file through the cache file system, preparing for data reading. Next, the system reads a specific byte range of the file as needed. This process can flexibly access any position in the file, meeting various data requirements. The read data is then processed by the cache manager. The asynchronous disk flushing thread inside the cache manager regularly persists the data retrieved from remote locations to the local disk. It also maintains a range mapping that records the remote path, offset, and length information for each cache block. When adjacent or overlapping data blocks exist in the cache, the system triggers a file merging operation to merge multiple small files into a continuous file to optimize storage and access efficiency. At the same time, the cache statistics thread and file cleanup thread in the background regularly evaluate the size of the cache and clean up the expired caches to ensure the healthy and efficient operation of the cache system.

6.4 Cache Management Strategy

Design of thread pools

Multiple thread pools are designed internally by the cache manager to coordinate different cache management tasks. The first is the asynchronous disk flushing thread pool, which is responsible for periodically persisting the data obtained by network I/O to the local disk and maintaining the scope mapping of the cache. This thread pool automatically triggers a file merging operation when it detects adjacent or overlapping cache locations. Secondly, the cache statistics thread pool regularly counts the total size of the persisted local cache files to ensure that the total cache size does not exceed the preset upper limit, thus preventing excessive resource consumption. Finally, the cache file cleanup thread pool marks and clears those expired cache files that are no longer needed according to the Least Recently Used (LRU) policy to keep the cache system efficient and clean.

Hot cache and normal cache

The cache manager divides the cache into two types: hot cache and normal cache to fully use the principle of data access locality. The hot cache is used to store data that has been accessed within the last three seconds. These data have not persisted and are completely stored in the memory. Therefore, the hot cache has a faster access speed, making it suitable for scenarios requiring frequent data access. The normal cache is responsible for storing cached data that has been persisted to the local disk, suitable for long-term storage and less frequently accessed data. Through this cache hierarchy policy, the system can ensure fast data access while efficiently utilizing storage resources, significantly improving overall query performance and system response speed.

7. Backpressure Mechanism for Execution Caches

7.1 Background of the Backpressure Mechanism

In a database system based on the time slice mechanism and the pipeline execution framework, the difference in the speed of task producers and consumers may cause insufficient memory or system overload. To solve this problem, PolarDB-X has introduced a backpressure mechanism to control the workflow of producers and consumers, preventing frequent backpressure and scheduling overhead from degrading system performance.

7.2 Pipeline Execution Framework

The pipeline execution framework is a strategy in database query processing where components or operations of the execution plan are organized and executed in a pipeline form. Pipeline execution allows the current data batch to be processed in one stage while the output of the previous stage is processed in the next stage. This improves the overall performance and parallelism of the query.

The main concepts include:

• Driver: The minimum unit of execution and scheduling in the pipeline execution framework. It is responsible for processing the execution tasks of a set of operators.

• Chunk: The data unit in the pipeline execution and data blocks produced and consumed by the driver.

• Producers and consumers: Producer drivers are responsible for generating chunks, while consumer drivers are responsible for consuming chunks.

7.3 Design of the Backpressure Mechanism

The backpressure mechanism is mainly used to solve the problem that the producer speed is much faster than the consumer speed and prevent the system from crashing due to the rapid rise of the memory level. The specific design is as follows:

7.3.1 Producer-consumer-buffer Model

• Producer pipeline: With a degree of parallelism m, a producer driver is formed on each parallelism.

• Consumer pipeline: With a degree of parallelism n, a consumer driver is formed on each parallelism, equipped with a LocalBuffer for receiving chunks from the producer.

• LocalBuffer: It exists in the form of an unbounded non-blocking queue, managing the reading and writing of chunks through a linked list structure.

Without considering the memory limit, producers and consumers are free to read and write to the LocalBuffer. However, when the producer speed far exceeds the consumer speed, the memory level will rise rapidly, causing system instability. Therefore, a backpressure mechanism is introduced for control.

7.3.2 Memory Counters and Signal Mechanisms

memory_ledger: Record the total number of chunks in all LocalBuffers as a metric of memory usage.

readable and writable signals: Implemented based on SettableFuture; they are used to control the execution state of producers and consumers.

• readable: A signal indicating that the buffer is readable.

• writable: A signal indicating that the buffer is writable.

Through these two signals, the system can coordinate the execution of producers and consumers to avoid memory overload.

7.3.3 Backpressure Process

Set the memory threshold Smax: If the memory usage recorded in memory_ledger reaches Smax, backpressure is triggered.
Trigger backpressure: When the producer writes chunks to the LocalBuffer, backpressure is triggered if the memory usage of memory_ledger reaches Smax. All upstream producer drivers immediately give up time slices and enter the driver-blocking queue.
Producer processing: A writable signal is generated and stored in memory_ledger. The writable signal represents the producer driver's wait for the event "the buffer is writable." A callback is registered for the writeable signal. The execution context of this callback is to remove the producer driver from the blocking queue and place it into the ready queue.
Consumer processing: The consumer continuously removes chunks from the queue in LocalBuffer and also calls the memory_ledger.subBytes to reduce the memory usage value. When the LocalBuffer queues of most consumer drivers are empty and the memory usage of memory_ledger is close to 0:

i) A readable signal is generated and stored in the memory_ledger, representing that all consumer drivers are waiting for the event "the buffer is readable". The readable signal registers a callback whose content is, once the event completes,

If the consumer driver is still waiting in the time slice, end the waiting and let the consumer driver continue to read chunks from the buffer.
If the consumer driver has given up the time slice and is in the blocking queue, transfer the consumer driver to the ready queue.

The above two steps are defined in the callback and will be triggered in step iii.

ii) The consumer driver obtains the writable signal from memory_ledger and calls the writable.set, which indicates that the event "the buffer is writable" has been completed. This triggers the execution of the registration callback to move all producer drivers from the blocking queue to the highest priority ready queue.

iii) The consumer driver calls readable.get, and the thread begins to wait in place until two situations occur: 1) After a timeout, the consumer driver yields its time slice and enters the blocking queue; 2) Before the timeout, the event "the buffer is readable" has been completed, ending the wait and allowing the consumer driver to continue reading chunks from the buffer.

iii) After a short period (usually a few dozen nanoseconds), the producer driver is taken from the ready queue by the thread group for scheduling and execution, starting to produce chunks and write them into the LocalBuffer. The consumer driver also ends the wait and continues to read chunks from the buffer to consume and destroy the chunk memory.

This mechanism ensures speed matching between producers and consumers to avoid memory overload. It also reduces frequent context switching and improves the overall performance of the system.

7.4 Backpressure Frequency Control

To avoid excessive backpressure and scheduling overhead, the system dynamically adjusts the upper limit of buffer memory to control the frequency of backpressure. The specific steps are as follows:

Set the upper limit for backpressure times R: For example, the system defaults to R=1000.
Set the initial Smax value: For example, the default value is 8 MB.
Dynamically adjust within the time slice: The value of Smax is dynamically adjusted based on the read and write speeds of the producer and consumer. This ensures that the number of backpressures in each time slice does not exceed the set upper limit R.
Prevent excessive memory usage: The system sets the upper limit for Smax based on the runtime memory size of the database instance to prevent memory overflow due to excessive Smax caused by the user setting of an excessively small R value.

7.5 Time Sequence of the Backpressure Process

The backpressure process can be abstracted into the following time sequence:

Operate normally: The producer writes data at the speed v1, and the consumer reads data at the speed v2. The buffer memory level S changes with time t.
Trigger backpressure: When S reaches Smax, the producer driver enters a blocked state and the consumer driver continues to consume data.
Release backpressure: After the consumer driver consumes chunks, S is reduced, and the readable signal is triggered to wake up the producer driver and resume normal operation.
Cycle repeatedly: The system adjusts continuously within time slices to ensure that the backpressure frequency is within a controllable range, improving system stability and query performance.

7.6 Advantages of the Backpressure Mechanism

• Stable memory stability: The memory counter and signal mechanism are used to ensure that the memory usage is within a controllable range and prevent memory overflow.

• Improved performance: This feature reduces frequent context switching, lowers scheduling overheads, and improves overall system performance.

• Dynamic adjustment: The Smax value is dynamically adjusted based on the actual running conditions to meet different loads and query requirements.

8. Cache Warmup

8.1 Introduction

The PolarDB-X columnar query engine uses a compute-storage separation architecture. Before executing analytical queries, the compute node first checks whether the local cache hits the required columnar index data. If it does not hit, data needs to be fetched from the remote OSS, which can increase query latency. To resolve this issue, PolarDB-X provides the cache warmup that allows data to be loaded into the local cache in advance. This avoids fetching data from the remote end during actual queries, thereby improving query performance and stability.

8.2 Differences between Cache Warmup and Cache Management

• Cache management: It is a passive filling process. Only when the upper-level cache misses, data is fetched and parsed from the lower-level cache or storage service and filled into the upper-level cache.

• Cache warmup: It is an active cache-filling process, where specified data is placed into the cache in advance. It enhances cache management and is suitable for scenarios where the data to be queried is known in advance.

8.3 Applicable Scenarios

Stable business query access: This method applies to business scenarios where the query access mode is fixed and predictable. This avoids the warmup overhead of real-time queries.
Sufficient local disk capacity: If the local disk storage capacity is greater than the amount of data to be prefetched, all prefetched data need to be cached locally, thus avoiding frequent remote fetching.

8.4 Warmup Syntax

PolarDB-X provides the WARMUP syntax for implementing cache warmup. This syntax takes effect only on columnar read-only instances.

WARMUP [cron_expression]
SELECT
    select_expr [, select_expr] ...
     [FROM table_references]
     [WHERE where_condition]
     [GROUP BY {col_name | expr | position}
       [ASC | DESC], ... ]
     [HAVING where_condition]
     [ORDER BY {col_name | expr | position}
       [ASC | DESC], ...]
     [LIMIT {[offset,] row_count | row_count OFFSET offset}]

cron_expression:
     ('cron_string') | <empty>

select_expr:
    column_name
  | aggregate_function(column_name)
  | column_name AS alias_name
  | expression

Parameters:

• cron_expression: The time expression used to periodically execute warmup tasks, in a format similar to the Unix cron expressions.

• select_expr: Specify the columns or expressions to be warmed up.

8.5 Best Practices for Cache Warmup

1. Regular warmup of common columns

Scenario: High business write traffic and query latency sensitivity.
Example:

WARMUP('* /2 * * * *') 
SELECT col1, col2, col3... FROM your_table;

Function: Periodically warm up the common columns of the destination table every 2 minutes.

2. Batch warmup at night

Scenario: The business requires analysis and querying of a large amount of data at night.
Example:

WARMUP('30 21 * * *') 
SELECT col1, col2, col3 ...
FROM table1 
LEFT JOIN table2 ON table1.id1 = table2.id2
LEFT JOIN table3 ON table1.id1 = table3.id3
WHERE table1.col_date > '2024-01-01';

Function: Warm up the business data of the day at 21:30 every night to ensure quick response to queries at 22:00.

3. Warmup before performance testing

Scenario: Conduct professional performance tests, such as TPC-H or ClickBench.
Example:

WARMUP SELECT * FROM lineitem;
WARMUP SELECT * FROM orders;
WARMUP SELECT * FROM customer;
WARMUP SELECT * FROM part;
WARMUP SELECT * FROM partsupp;
WARMUP SELECT * FROM supplier;
WARMUP SELECT * FROM region;
WARMUP SELECT * FROM nation;

Function: All tables are warmed up at a time to ensure that the cache hit ratio is high during performance testing and the test results are more accurate.

4. Scheduled batch warmup

Scenario: Warm up data before O&M tasks in the early morning every day.
Example:

WARMUP('*/5 1-5 * * *') 
SELECT * FROM lineitem;

Function: During the O&M window from 01:00 to 05:00 every day, the data of the specified table is warmed up every 5 minutes to ensure stable query performance during O&M.

8.6 Summary

The cache warmup feature actively fills the cache to eliminate the overhead of fetching data from the remote end during queries. This greatly improves query performance and stability. By properly configuring the cache warmup, and combining business access mode and query requirements, PolarDB-X can provide users with a more efficient and reliable data query experience.

9. Conclusion

In the era of big data, how to efficiently store and query massive data has become a key issue. The PolarDB-X columnar query engine effectively optimizes the performance of ORC columnar queries through cache hierarchy management and refined backpressure mechanisms, solving the bottleneck of query speed in large-scale datasets. At the same time, the flexible cache warmup and the optimized ORC read path design enable PolarDB-X to provide efficient and stable query services in complex and changeable business scenarios.

Efficient multi-level cache management: The cache hierarchy mechanism enables fast access to high-frequency data and efficient management of low-frequency data, significantly improving query performance.
Refined backpressure mechanism: The memory counter and signal mechanism precisely control the execution processes of producers and consumers to prevent memory overload and ensure system stability.
Flexible cache warmup policies: Multiple warmup scenarios and policies are available to meet different business requirements and further improve the query response speed.
Optimized ORC read path design: Bypassing official readers and using a self-optimized read path achieves efficient data reading and parsing and reduces query latency.
Adaptable compute-storage separation architecture: Separation of compute nodes and storage nodes enables efficient resource utilization and system scalability, adapting to queries of different scales and complexities.

PolarDB-X, through continuous technological innovation and optimization, is committed to providing users with more efficient and reliable data query solutions. In the era of big data, it helps enterprises maximize the value of their data.