During the 2020 Double 11 Global Shopping Festival, Alibaba’s cloud-native real-time data warehouse was first implemented in core data scenarios and reached a new record for the big data platform. The data warehouse was built based on Hologres and Realtime Compute for Apache Flink. A series of articles about the practices of the cloud-native real-time data warehouse during "Double 11" will be published gradually. This article introduces the storage engine of Hologres and deeply analyzes its implementation principles and core technical advantages.
Hologres is a cloud-native Hybrid Serving/Analytical Processing (HSAP) system developed by Alibaba Cloud. It can be applied to real-time big data serving and analytical scenarios and is fully compatible with the PostgreSQL protocol. It also seamlessly integrates with the big data ecosystem. Hologres simplifies business architecture while enabling real-time decision-making. Thus, it helps big data produce greater business value. For more information about the system architecture, please check this link.
Compared with traditional big data and Online Analytical Processing (OLAP) systems, HSAP systems face the following challenges:
To solve the challenges above, Alibaba Cloud developed a Storage Engine for Hologres. Storage engine is mainly responsible for managing and processing data, including creating, retrieving, updating, and deleting (CRUD) data. The design and implementation of the storage engine provide high throughput, high concurrency, low latency, elasticity, and scalability required in HSAP scenarios. Based on the business demands of Alibaba Group and cloud customers, Alibaba Cloud continues to innovate and optimize the storage engine to support PB-level storage for a single table. During the 2020 Double 11 Global Shopping Festival, the Hologres engine perfectly supported hundreds of billions of point queries and tens of millions of real-time complex queries in the core business scenarios.
The following sections describe implementation principles and technical highlights of the Hologres storage engine.
The basic abstraction of the Hologres storage engine is distributed tables. To achieve scalability, tables need to be split into shards. In addition, users may need to store several related tables together to efficiently support scenarios such as JOIN and multi-table update. For this reason, Hologres introduces the concept of Table Group. A group of tables with identical sharding policies constitutes a table group, and all tables in the same table group have the same number of shards. Users can specify the number of shards for a table through shard_count
, and specify the number of shard columns through distribution_key
. Currently, only Hash sharding is supported.
The data storage formats of tables include row storage and column storage. The format can be specified through orientation
.
Each table record is stored in a specific order, which can be specified through clustering_key
. If no sorting column is specified, the storage engine automatically sorts data in the order in which the data was inserted. Selecting the appropriate sorting columns can greatly optimize the performance of some queries.
Tables also support multiple indexes. Currently, dictionary indexes and bitmap indexes are supported. Users can specify the columns to be indexed through dictionary_encoding_columns
and bitmap_columns
.
Example:
In this example, two tables, LINEITEM
and ORDERS
, are created. As the LINEITEM
table also has a PRIMARY KEY
specified, the storage engine automatically creates an index to ensure the uniqueness of the PRIMARY KEY
. Users can specify colocate_with
to put two tables into the same table group, which is divided into 24 shards in above example, specified through shard_count
. For the LINEITEM
table, the sharding is based on the data value of L_ORDERKEY
field, while it is based on the data value of O_ORDERKEY
field for the ORDERS
table. Dictionary indexes are created by the field L_SHIPINSTRUCT
of LINEITEM
table and the field O_ORDERSTATUS
of ORDERS
table. Moreover, bitmap indexes are created by the L_ORDERKEY
, L_LINENUMBER
, L_SHIPINSTRUCT
fields of LINEITEM
table, and the O_ORDERKEY
, O_CUSTKEY
, and O_ORDERSTATUS
fields of ORDERS
table.
Each table group shard (“shard” for short) constitutes a unit for storage management and recovery. The figure above shows the basic architecture of a shard. A shard consists of multiple tablets, which share a Write-Ahead Log (WAL). The storage engine uses the Log-Structured Merge (LSM) technology, so all of the new data is inserted in the append-only form in memory. Data is written to the MemTable where the tablet is located. Then, it is written to the file when it reaches a certain threshold. By doing so, when a data file is completely genrated, its contents will not change, and new data and subsequent updates are written to new files. Compared with the B +-tree data structure of traditional databases, LSM technology reduces random I/O and greatly improves the writing performance.
With constant data writing, a lot of files accumulate on each tablet. When small files in a tablet reach a certain amount, the storage engine will compact these small files in the backend. By doing so, the system does not need to open many files at the same time, reducing the use of system resources. More importantly, after compaction, file numbers are reduced, and the query performance is improved.
In terms of Data Manipulation Language (DML), the storage engine provides interfaces for CRUD operations on single or batch data. Query engines can access stored data through these interfaces.
The important components of the storage engine are listed below:
WAL Manager is used to manage log files. The storage engine uses WAL to ensure the atomicity and durability of data. When CUD operations occur, the storage engine writes WAL first and then applies the content of the WAL record to the MemTable of the corresponding tablet. When the MemTable reaches certain size or remains at a certain time, it turns to an immutable flushing MemTable and a new MemTable is created to receive new writing requests. By doing so, the immutable flushing MemTable can write its contents to disk and generate an immutable file. After an immutable file is generated, the data can be considered persistent. When the system restarts after a crash, the system reads WAL to recover the data that is not persistent. WAL Manager will delete the log file only after all corresponding data has been persisted.
Each tablet stores data in a group of files, which are stored in a distributed file system (DFS), such as Alibaba Cloud's Pangu and Apsara File Storage for HDFS. Rows storage files are stored in Sorted String Table (SST) format. Column storage files support two storage formats, one is a proprietary format similar to PAX and another is an improved version of Apache ORC format. Both column storage formats are optimized for file scanning scenarios.
To avoid reading data from disk and occurring I/O, the storage engine uses Block Cache to store frequently and recently used data in the memory. This reduces unnecessary I/O and optimizes reading performance. For the same node, all shards share the same Block Cache. Block Cache has two replacement policies, namely, Least Recently Used (LRU) and Least Frequently Used (LFU). As the name implies, the LRU algorithm weeds out the Block that has not been used for the longest time, while the LFU algorithm replaces the Block that is least frequently accessed within a certain time.
Hologres supports two types of writing: single-shard writing and distributed batch writing. Both types of writing are atomic writing, that is, succeed or rollback. Single-shard writing updates one shard at a time since it requires extremely high writing frequency. Distributed batch writing is used in scenarios where a large amount of data is written to multiple shards as a single transaction. It is usually executed at a much lower frequency.
As seen in the figure above, the WAL manager receives a single-shard writing request and operates in the following process:
The frontend node that receives the writing request distributes the request to all relevant shards. These shards adopt the Two-Phase Commit mechanism to ensure the writing atomicity of distributed batch writing.
Hologres supports reading data with multiple versions in the tablet. The consistency of reading requests is ensured through the read-your-writes mechanism, meaning the client can always see its latest committed writing operation. Each reading request includes a reading timestamp, which is used to form a snapshot for reading. If a row of data has an LSN that is greater than that of the snapshot, this row of data will be filtered out, because this row of data is inserted into the tablet after the snapshot is generated.
The storage engine adopts the storage-compute separation architecture. All data files are stored in a DFS, such as Alibaba Cloud's Pangu or Apsara HDFS. When more computing resources are required for increasing query loads, computing resources can be scaled out independently. The storage resources can also be scaled out independently to deal with rapidly increasing data volume. The architecture with this feature ensures that resources can be rapidly scaled out without waiting for data copying or movement. Moreover, the multi-copy mechanism of DFS can be used to ensure the high availability of data. This architecture simplifies O&M (Operations and Maintenance) and guarantees system stability.
The storage engine uses an event-triggered and non-blocking pure asynchronous execution architecture, which can give full play to the multi-core processing capability of modern CPUs. The architecture improves throughput and supports high-concurrency writing and querying. It benefits from the HoloOS (HOS) framework, which provides efficient asynchronous execution and concurrency, and can automatically perform load balancing for the CPU to improve system utilization.
There are two main query modes for HSAP scenarios. One is simple point query for data serving scenarios, and the other is complex query for analytical scenarios, which scans massive data. (Of course, there are also many other query modes for other scenarios.) The two query modes have different requirements for data storage. Row storage supports point queries more efficiently, while column storage has more advantages in queries with a large number of scans.
To support various query modes, a unified real-time storage is very important. The storage engine supports the row storage and column storage formats. According to user requirements, a tablet can be in row storage (applicable to serving scenarios) or in column storage (applicable to analytical scenarios.) For example, in a typical HSAP scenario, many users store data in column storage format to facilitate large-scale scanning and analysis. At the same time, the data index is stored in row storage format for point querying. In addition, data duplication is prevented by defining the primary key constraint, which is implemented using row storage. However, the reading and writing interfaces are the same for both row storage and column storage. Users only need to define these storage format when creating tables.
The storage engine adopts the "snapshot read" semantics. The reading operation reads the data status at the start time. The lock on the data is not required because reading operations are not blocked by writing operations. When there is a new writing operation, it will not be blocked by reading operations, and the writing operation is append-only. This provides excellent support for high-concurrency hybrid workloads of HSAP.
The storage engine provides multiple types of indexes to improve query efficiency. A table supports clustered index and non-clustered index, while only one clustered index containing all columns of the table is allowed. A table can have multiple non-clustered indexes. In non-clustered indexes, in addition to the non-clustered index key used for sorting, there is also a Row Identifier (RID) used to find the row data in the clustered index. If there is a unique clustered index, the clustered index key is the RID. Otherwise, the storage engine generates a unique RID. To improve query efficiency, other columns can also be included in the non-clustered index. In some queries, values of all of the columns (covering index) can be obtained by only scanning one index.
Inside a data file, the storage engine supports dictionary indexes and bitmap indexes. Dictionary indexes can be used to improve the efficiency of string processing and the compression ratio of data. Bitmap indexes can be used to efficiently filter out unwanted records.
The design idea and development focus of the storage engine are to better support HSAP scenarios. It can efficiently support high-throughput real-time writing and inter-active querying, and optimizes offline batch writing. As ingestion workloads and storage volumes increase exponentially every year, Hologres has withstood severe challenges and successfully supported Double 11 for several years. "Hologres withstands real-time data peaks with 0.596 billion records per second and stores up to 2.5 PB data in a single table. Based on trillions of data, Hologres provides multi-dimensional analysis and services, and 99.99% of queries can return results within 80 ms." These figures show the technical capability of the Hologres storage engine and prove the advantages of Hologres’ technical architecture and implementation.
Hologres is still a new product, and HSAP is a new concept. "The best achievement today is only the cornerstone of progress tomorrow." Alibaba Cloud learns from customer feedback, and continues to improve Hologres. More optimizations will be conducted on stability, usability, functions, and performance in the future.
There will be more articles introducing technologies related to the HSAP storage engine and other core systems. Please stay tuned!
Please check the VLDB 2020 article to learn more: Alibaba Hologres: A Cloud-Native Service for Hybrid Serving/Analytical Processing
Network Monitoring Technology Behind the Smooth Experience During Double 11
Hologres - June 30, 2021
Hologres - December 2, 2022
Alibaba Cloud MaxCompute - September 7, 2022
Hologres - May 31, 2022
Alibaba Cloud MaxCompute - September 30, 2022
Rupal_Click2Cloud - December 15, 2023
Get started on cloud with $1. Start your cloud innovation journey here and now.
Learn MoreRealtime Compute for Apache Flink offers a highly integrated platform for real-time data processing, which optimizes the computing of Apache Flink.
Learn MoreA real-time data warehouse for serving and analytics which is compatible with PostgreSQL.
Learn MoreMore Posts by Hologres