By Junqi
In the era of rapid data growth, the efficient query of massive data has become a significant challenge for enterprises and technical support personnel. Although columnar storage formats, such as ORC, have obvious advantages in specific scenarios, query speed still has bottlenecks when dealing with large-scale datasets. This article will delve into the cache hierarchy of the PolarDB-X columnar query engine and its key role in improving the performance of ORC columnar queries. We will analyze its design principles, implementation details, and applicable scenarios to demonstrate its extensive application in big data queries and its efficiency and reliability. Moreover, the article will cover the storage structure of ORC files, data compression and decompression techniques, and the backpressure management strategy for caching intermediate results of the executor, illustrating how cache hierarchy and backpressure mechanisms can further enhance query performance. Through these contents, readers will understand the practical effects and technical advantages of the PolarDB-X cache hierarchy in improving the efficiency of columnar queries.
In PolarDB-X columnar query engine, multi-level cache management is a core strategy for improving query performance. By setting caches at different levels, the system can achieve efficient data loading and query response according to data access frequency and characteristics. This cache hierarchy architecture optimizes the data reading path and effectively reduces the dependence on the underlying storage system, thus significantly improving the query speed.
The ORC (Optimized Row Columnar) format is a columnar storage format that is widely used in big data processing. Its internal structure is ingeniously designed, and through a multi-level organization, it achieves efficient data compression and fast column-level queries.
• Stripe: By default, an ORC file forms a stripe every 64 MB. Multiple columns are stored in a stripe.
• Column: In a stripe, each column is stored in a contiguous file area that contains multiple RowGroups.
• RowGroup: In a column, every 10,000 rows of data are divided into a RowGroup. The RowGroup is the basic unit for ORC to perform compression, decompression, encoding, and decoding. If the total number of rows stored inside the column is not an integer multiple of 10,000, the last RowGroup may contain fewer than 10,000 rows.
In addition, two types of structures, stripe footer, and index data, are used to store location information, making it easy to quickly locate data blocks.
In an ORC file, the SMA (Statistics Minimum and Maximum) index stores the minimum and maximum statistics for each stripe and RowGroup. This information can quickly filter out irrelevant data blocks during queries, reducing unnecessary data reading and improving query efficiency.
To accurately locate a data block (such as a 1,000-row data unit) in an ORC file, a unique logical address is required, which contains the following 5 fields:
Through these fields, the system can accurately locate the specific data location in the ORC file, enabling efficient data reading.
In the columnar query process, the executor is the core component of the data query operation. It formulates an execution plan tree based on SQL statements, operators (such as scans and joins), and predicate conditions, and executes corresponding query operations.
The execution plan tree generated by the executor contains a Scan operator, which defines how to perform scan operations on ORC files, that is, how to read data from specific columns. These operations are usually column-centric and performed in ORC-format files, which greatly improves query efficiency.
During the execution of query operations, the Scan operator is equipped with predicate conditions for data filtering. The SMA index in ORC format allows the executor to efficiently determine which stripe or RowGroup contains eligible data without reading all the data. By combining predicate conditions with SMA indexes, the system can reduce the workload of data loading and scanning, further improving query performance.
ORC files typically use multiple compression techniques, such as LZ4 general compression and RunLength numeric compression, to reduce storage space and improve I/O efficiency. The executor will automatically perform the decompression operation when reading data. The compression ratio ranges from 2 to 4 times, which saves storage space and improves the efficiency of data reading.
The PolarDB-X columnar query engine uses a multi-level caching mechanism, including level-1, level-2, and level-3 caches, as well as the underlying OSS storage base. Each level of cache has its unique design and function. Through hierarchical lookup and caching, the system can efficiently access and manage massive data.
The overall process can be summarized in the following steps:
The system obtains the address {StripeId, ColumnId, RowGroupId}
through predicate conditions and index search, and obtains the specific physical location {filePath, offset, len}
of the file through the meta and footer information of the ORC file, thereby searching and reading data in the cache at all levels.
The level-1 cache is designed to respond quickly to data requests that are accessed at a high frequency. When the query arrives, the key of the level-1 cache is converted based on the predicate conditions, that is, {StripeId, ColumnId, RowGroupId}
, and the corresponding block list is searched in the level-1 cache. If a hit occurs, these blocks are directly used for computation. Since these blocks are sequentially decompressed from RowGroups, and each RowGroup contains 10,000 rows of data, a block represents 1/10 of a RowGroup, which is 1,000 rows.
To control the space usage of the level-1 cache, when the cache space exceeds a preset threshold (usually 20% of the runtime memory), the LFU (Least Frequently Used) cache replacement policy is triggered to eliminate the least accessed cache items.
When the level-1 cache misses, the system obtains the address {filePath, offset, len}
through {StripeId, ColumnId, RowGroupId}
and uses it as the key of the level-2 cache. The system searches for the corresponding compressed data byte array in the level-2 cache. If a hit occurs, the data needs to be decompressed, and the decompressed data is written into the level-1 cache. Similarly, when the space usage of the level-2 cache reaches the threshold, the LFU mechanism is used to eliminate the least used data.
The level-3 cache is located at the disk layer and mainly stores compressed file fragments obtained from the OSS storage base. The level-3 cache stores the byte array that was previously in the level-2 cache as a file. When adjacent segments are read, they are merged, and the old files are deleted asynchronously. Since write operations on files are time-consuming, these operations are completed by asynchronous threads.
If all levels of the cache miss, the system maps {StripeId, ColumnId, RowGroupId}
to a specific file path, obtains the complete ORC file from the OSS storage, and downloads the required file fragments for decompression and reading. Although OSS access has network latency, its nearly unlimited storage space provides powerful backup support for the system.
The PolarDB-X columnar storage format is mainly composed of ORC files. The data structures required by the executor's computing layer can be divided into three levels from the ORC file:
The physical storage structure is the raw, fundamental storage form of an ORC file, purely composed of byte sequences without any semantic information. By using the file read interface (such as InputStream), the system can sequentially or randomly read the byte data in the file without understanding its specific meaning.
The logical storage structure focuses on the organization and compression of data, consisting of the following levels:
It consists of the file footer, postscript, and multiple stripes.
• Postscript: Located at the very end of the file, it can locate the byte positions of the main internal areas of the file.
• File Footer: It stores file-level min-max statistical information, compression and encryption algorithm information, version information, and others.
Each 64 MB forms a stripe, which consists of a StripeFooter and a set of columns.
StripeFooter:
Each column consists of several streams, which are contiguous in storage structure.
Index Stream: Store the row index of the column, which consists of ColumnStatistics and positions.
Each stream represents a contiguous byte storage area that contains different types of data stream information. Each stream consists of several RowGroups, with each RowGroup storing 10,000 rows of data by default.
On each stream, every 10,000 rows of data form a RowGroup. Each RowGroup undergoes lightweight compression and general compression and is stored in the RowIndex. The RowIndex describes information such as the starting byte position of the RowGroup, the size after compression, the original size, and the number of starting elements.
The default general compression method for ORC is Zlib, which offers a high compression ratio but a slower decompression speed. PolarDB-X column-oriented storage uses LZ4 compression. Although the compression ratio is average, the decompression speed is faster and it is suitable for scenarios that require high performance.
RL (RunLength) compression selects flexible compression methods based on the workload characteristics of data to improve compression efficiency. There are four main ways:
You can obtain specific types of blocks based on the file, column, stripe, and RowGroup. For example, if the RowGroup stride is 10,000 rows and the chunk limit is 1000, then {file=xxx.orc, StripeId=0, ColumnId=1, rowGroupId={1,3}}
can retrieve 20 IntegerBlock objects.
The BlockCacheManager is responsible for caching these objects, with main interfaces including:
• Cache all chunks of the associated RowGroup based on RowGroup access characteristics.
• Manage the insertion, querying, and elimination of the cache.
Based on the in-depth analysis and design of the ORC data structure hierarchy, the new ORC read path aims to achieve the following objectives:
The entire reading process is driven by two core interfaces: StripeLoader and ColumnReader. Each interface takes on different responsibilities to ensure efficient data reading and parsing.
The StripeLoader interface manages the stream information in ORC files and coordinates the execution of I/O operations. The main process involves the following steps:
The entire linked list structure covers the linked list pointers, byte area information, and the actual physical byte data. This information is managed by StreamInformation to ensure a clear and controllable relationship between streams and buffer nodes.
The ColumnReader interface is responsible for parsing the raw ORC byte data obtained by StripeLoader into the block objects required by the executor. The parsing process has the following characteristics:
The ColumnReader interface contains three core methods:
• Initialization method: This method initializes the data parsing process. You can pass in the callback that is returned when the StripeLoader is loaded or directly trigger the loading process internally and wait for the results.
• Pointer positioning method: This method moves the internal read pointer to the position of the specified row group and element to enable random access and efficiently locate and read specific data.
• Data reading method: This method reads a specified number of data elements from the current pointer position and writes them into a randomly accessed block. This method avoids additional overheads such as dynamic array expansion and boundary checks, enhancing data writing efficiency.
AbstractLongColumnReader is an abstract class suitable for handling all data types represented by integers or long integers, such as int, bigint, decimal64, date, datetime, and so on. The main steps are as follows:
1. Parser initialization: A parser is built based on the input stream and stream information. For example, for the common bigint type, compression and decompression parsers are built based on the data and presence streams respectively. The data stream stores the compressed bigint value, while the presence stream stores the compressed null value information.
2. Multi-layer pointer management: The parser uses a three-layer pointer management mechanism.
3. Specified location positioning: You can use the positioning method to calculate the byte starting position and compression size based on the row group ID and element position and adjust the pointer of each layer to correctly locate the data to be read.
4. Data copy and block generation: You can use the reading method to copy memory data from the current pointer to randomly accessed blocks as needed. According to the different data types, allocate the corresponding type of block objects, parse the presence stream and data stream in the loop, and write the final data into the block to complete the block construction.
Through the above design, the new ORC read path realizes precise control over each column and each row group, optimizing every aspect of data reading and parsing. The StripeLoader interface efficiently manages stream information and organizes I/O operations, while the ColumnReader interface is used to efficiently parse raw byte data and generate the block objects required by the executor. Combined with the abstract design of AbstractLongColumnReader, the entire reading process achieves efficiency and flexibility and adapts to the processing requirements of multiple data types. This customized ORC read path design provides strong support for large-scale data processing and significantly improves the performance and reliability of data reading.
The management of the level-3 cache system covers several key parts to ensure efficient data access and optimization of system performance. First, the input/output (I/O) process of the file system is the foundation of the system, which includes both network I/O and local persistent caching. The network I/O is responsible for the transmission of remote data, while the local persistent caching ensures the reliability and quick access of data stored locally. Second, file metadata management is responsible for maintaining the table files and index files stored on the Object Storage Service (OSS) instances to ensure the orderliness of the data structure and the efficiency of retrieval. Third, data source management is responsible for configuring and loading data sources under different engines to support diversified data input channels. Finally, file access scheduling is carried out through a Massively Parallel Processing (MPP) framework to allocate and schedule file access before execution reasonably, optimizing resource utilization and access efficiency.
In the Hadoop ecosystem, the Hadoop file system provides a unified file system abstraction that encapsulates a variety of data sources by implementing its core abstract classes. This design enables different types of data sources to be accessed through a unified interface, greatly simplifying the complexity of file operations. The Hadoop file system enables file reading and writing by initializing connections, opening file streams, and creating file streams, thus supporting distributed data storage and efficient data processing.
To further improve the efficiency of file access, PolarDB-X has designed and implemented a cache file system, which can support file access to both local persistent caching and remote OSS instances. The cache file system consists of two main components: one is responsible for handling file access on remote OSS instances, and the other is responsible for managing local caches, including data persistence and maintenance of cached files. Through this two-layer caching mechanism, the system can greatly speed up file access and improve the overall system performance while ensuring data consistency.
The well-designed reading process of the level-3 cache system ensures efficient data acquisition and cache management. First, the system obtains the input stream of the target file through the cache file system, preparing for data reading. Next, the system reads a specific byte range of the file as needed. This process can flexibly access any position in the file, meeting various data requirements. The read data is then processed by the cache manager. The asynchronous disk flushing thread inside the cache manager regularly persists the data retrieved from remote locations to the local disk. It also maintains a range mapping that records the remote path, offset, and length information for each cache block. When adjacent or overlapping data blocks exist in the cache, the system triggers a file merging operation to merge multiple small files into a continuous file to optimize storage and access efficiency. At the same time, the cache statistics thread and file cleanup thread in the background regularly evaluate the size of the cache and clean up the expired caches to ensure the healthy and efficient operation of the cache system.
Multiple thread pools are designed internally by the cache manager to coordinate different cache management tasks. The first is the asynchronous disk flushing thread pool, which is responsible for periodically persisting the data obtained by network I/O to the local disk and maintaining the scope mapping of the cache. This thread pool automatically triggers a file merging operation when it detects adjacent or overlapping cache locations. Secondly, the cache statistics thread pool regularly counts the total size of the persisted local cache files to ensure that the total cache size does not exceed the preset upper limit, thus preventing excessive resource consumption. Finally, the cache file cleanup thread pool marks and clears those expired cache files that are no longer needed according to the Least Recently Used (LRU) policy to keep the cache system efficient and clean.
The cache manager divides the cache into two types: hot cache and normal cache to fully use the principle of data access locality. The hot cache is used to store data that has been accessed within the last three seconds. These data have not persisted and are completely stored in the memory. Therefore, the hot cache has a faster access speed, making it suitable for scenarios requiring frequent data access. The normal cache is responsible for storing cached data that has been persisted to the local disk, suitable for long-term storage and less frequently accessed data. Through this cache hierarchy policy, the system can ensure fast data access while efficiently utilizing storage resources, significantly improving overall query performance and system response speed.
In a database system based on the time slice mechanism and the pipeline execution framework, the difference in the speed of task producers and consumers may cause insufficient memory or system overload. To solve this problem, PolarDB-X has introduced a backpressure mechanism to control the workflow of producers and consumers, preventing frequent backpressure and scheduling overhead from degrading system performance.
The pipeline execution framework is a strategy in database query processing where components or operations of the execution plan are organized and executed in a pipeline form. Pipeline execution allows the current data batch to be processed in one stage while the output of the previous stage is processed in the next stage. This improves the overall performance and parallelism of the query.
The main concepts include:
• Driver: The minimum unit of execution and scheduling in the pipeline execution framework. It is responsible for processing the execution tasks of a set of operators.
• Chunk: The data unit in the pipeline execution and data blocks produced and consumed by the driver.
• Producers and consumers: Producer drivers are responsible for generating chunks, while consumer drivers are responsible for consuming chunks.
The backpressure mechanism is mainly used to solve the problem that the producer speed is much faster than the consumer speed and prevent the system from crashing due to the rapid rise of the memory level. The specific design is as follows:
• Producer pipeline: With a degree of parallelism m, a producer driver is formed on each parallelism.
• Consumer pipeline: With a degree of parallelism n, a consumer driver is formed on each parallelism, equipped with a LocalBuffer for receiving chunks from the producer.
• LocalBuffer: It exists in the form of an unbounded non-blocking queue, managing the reading and writing of chunks through a linked list structure.
Without considering the memory limit, producers and consumers are free to read and write to the LocalBuffer. However, when the producer speed far exceeds the consumer speed, the memory level will rise rapidly, causing system instability. Therefore, a backpressure mechanism is introduced for control.
memory_ledger: Record the total number of chunks in all LocalBuffers as a metric of memory usage.
readable and writable signals: Implemented based on SettableFuture; they are used to control the execution state of producers and consumers.
• readable: A signal indicating that the buffer is readable.
• writable: A signal indicating that the buffer is writable.
Through these two signals, the system can coordinate the execution of producers and consumers to avoid memory overload.
i) A readable signal is generated and stored in the memory_ledger, representing that all consumer drivers are waiting for the event "the buffer is readable". The readable signal registers a callback whose content is, once the event completes,
The above two steps are defined in the callback and will be triggered in step iii.
ii) The consumer driver obtains the writable signal from memory_ledger and calls the writable.set, which indicates that the event "the buffer is writable" has been completed. This triggers the execution of the registration callback to move all producer drivers from the blocking queue to the highest priority ready queue.
iii) The consumer driver calls readable.get, and the thread begins to wait in place until two situations occur: 1) After a timeout, the consumer driver yields its time slice and enters the blocking queue; 2) Before the timeout, the event "the buffer is readable" has been completed, ending the wait and allowing the consumer driver to continue reading chunks from the buffer.
iii) After a short period (usually a few dozen nanoseconds), the producer driver is taken from the ready queue by the thread group for scheduling and execution, starting to produce chunks and write them into the LocalBuffer. The consumer driver also ends the wait and continues to read chunks from the buffer to consume and destroy the chunk memory.
This mechanism ensures speed matching between producers and consumers to avoid memory overload. It also reduces frequent context switching and improves the overall performance of the system.
To avoid excessive backpressure and scheduling overhead, the system dynamically adjusts the upper limit of buffer memory to control the frequency of backpressure. The specific steps are as follows:
The backpressure process can be abstracted into the following time sequence:
• Stable memory stability: The memory counter and signal mechanism are used to ensure that the memory usage is within a controllable range and prevent memory overflow.
• Improved performance: This feature reduces frequent context switching, lowers scheduling overheads, and improves overall system performance.
• Dynamic adjustment: The Smax value is dynamically adjusted based on the actual running conditions to meet different loads and query requirements.
The PolarDB-X columnar query engine uses a compute-storage separation architecture. Before executing analytical queries, the compute node first checks whether the local cache hits the required columnar index data. If it does not hit, data needs to be fetched from the remote OSS, which can increase query latency. To resolve this issue, PolarDB-X provides the cache warmup that allows data to be loaded into the local cache in advance. This avoids fetching data from the remote end during actual queries, thereby improving query performance and stability.
• Cache management: It is a passive filling process. Only when the upper-level cache misses, data is fetched and parsed from the lower-level cache or storage service and filled into the upper-level cache.
• Cache warmup: It is an active cache-filling process, where specified data is placed into the cache in advance. It enhances cache management and is suitable for scenarios where the data to be queried is known in advance.
PolarDB-X provides the WARMUP
syntax for implementing cache warmup. This syntax takes effect only on columnar read-only instances.
WARMUP [cron_expression]
SELECT
select_expr [, select_expr] ...
[FROM table_references]
[WHERE where_condition]
[GROUP BY {col_name | expr | position}
[ASC | DESC], ... ]
[HAVING where_condition]
[ORDER BY {col_name | expr | position}
[ASC | DESC], ...]
[LIMIT {[offset,] row_count | row_count OFFSET offset}]
cron_expression:
('cron_string') | <empty>
select_expr:
column_name
| aggregate_function(column_name)
| column_name AS alias_name
| expression
Parameters:
• cron_expression: The time expression used to periodically execute warmup tasks, in a format similar to the Unix cron expressions.
• select_expr: Specify the columns or expressions to be warmed up.
1. Regular warmup of common columns
WARMUP('* /2 * * * *')
SELECT col1, col2, col3... FROM your_table;
Function: Periodically warm up the common columns of the destination table every 2 minutes.
2. Batch warmup at night
WARMUP('30 21 * * *')
SELECT col1, col2, col3 ...
FROM table1
LEFT JOIN table2 ON table1.id1 = table2.id2
LEFT JOIN table3 ON table1.id1 = table3.id3
WHERE table1.col_date > '2024-01-01';
Function: Warm up the business data of the day at 21:30 every night to ensure quick response to queries at 22:00.
3. Warmup before performance testing
WARMUP SELECT * FROM lineitem;
WARMUP SELECT * FROM orders;
WARMUP SELECT * FROM customer;
WARMUP SELECT * FROM part;
WARMUP SELECT * FROM partsupp;
WARMUP SELECT * FROM supplier;
WARMUP SELECT * FROM region;
WARMUP SELECT * FROM nation;
Function: All tables are warmed up at a time to ensure that the cache hit ratio is high during performance testing and the test results are more accurate.
4. Scheduled batch warmup
WARMUP('*/5 1-5 * * *')
SELECT * FROM lineitem;
Function: During the O&M window from 01:00 to 05:00 every day, the data of the specified table is warmed up every 5 minutes to ensure stable query performance during O&M.
The cache warmup feature actively fills the cache to eliminate the overhead of fetching data from the remote end during queries. This greatly improves query performance and stability. By properly configuring the cache warmup, and combining business access mode and query requirements, PolarDB-X can provide users with a more efficient and reliable data query experience.
In the era of big data, how to efficiently store and query massive data has become a key issue. The PolarDB-X columnar query engine effectively optimizes the performance of ORC columnar queries through cache hierarchy management and refined backpressure mechanisms, solving the bottleneck of query speed in large-scale datasets. At the same time, the flexible cache warmup and the optimized ORC read path design enable PolarDB-X to provide efficient and stable query services in complex and changeable business scenarios.
PolarDB-X, through continuous technological innovation and optimization, is committed to providing users with more efficient and reliable data query solutions. In the era of big data, it helps enterprises maximize the value of their data.
PolarDB-X Technical Practice | IN Query Execution Plan Management and Pre-pruning
Alibaba Cloud DTS Trials and Errors | Using the VPC Data Channel to Solve Network Conflicts
ApsaraDB - April 10, 2024
ApsaraDB - June 19, 2024
ApsaraDB - October 16, 2024
ApsaraDB - October 16, 2024
ApsaraDB - January 17, 2025
ApsaraDB - June 4, 2024
Alibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreAlibaba Cloud PolarDB for Xscale (PolarDB-X) is a cloud-native high-performance distributed database service independently developed by Alibaba Cloud.
Learn MoreAlibaba Cloud PolarDB for MySQL is a cloud-native relational database service 100% compatible with MySQL.
Learn MoreTransform your business into a customer-centric brand while keeping marketing campaigns cost effective.
Learn MoreMore Posts by ApsaraDB