Unveiling Hologres: A Cloud-Native Storage Engine

During the 2020 Double 11 Global Shopping Festival, Alibaba’s cloud-native real-time data warehouse was first implemented in core data scenarios and reached a new record for the big data platform. The data warehouse was built based on Hologres and Realtime Compute for Apache Flink. A series of articles about the practices of the cloud-native real-time data warehouse during "Double 11" will be published gradually. This article introduces the storage engine of Hologres and deeply analyzes its implementation principles and core technical advantages.

1. Background

Hologres is a cloud-native Hybrid Serving/Analytical Processing (HSAP) system developed by Alibaba Cloud. It can be applied to real-time big data serving and analytical scenarios and is fully compatible with the PostgreSQL protocol. It also seamlessly integrates with the big data ecosystem. Hologres simplifies business architecture while enabling real-time decision-making. Thus, it helps big data produce greater business value. For more information about the system architecture, please check this link.

Compared with traditional big data and Online Analytical Processing (OLAP) systems, HSAP systems face the following challenges:

Highly Concurrent Hybrid Workloads: The number of concurrent queries that an HSAP system needs to process far exceeds a traditional OLAP system. In practice, the concurrent query number of data services goes far beyond the OLAP query number. For example, data services need to process tens of millions of queries per second in the real world, which is 0.5 million times higher than the number of concurrent OLAP queries. Moreover, compared with OLAP queries, data service queries have more strict requirements on latency. The complex hybrid query load has very different trade-offs on system latency and throughput. So, it is a big challenge to efficiently use system resources, process these different types of queries, and ensure the SLO (Service Level Objective) for each query.
High-Throughput Real-Time Data Import: While processing concurrent queries, HSAP systems also import massive data in real-time. The data synchronized from traditional Online Transaction Processing (OLTP) is only a small part. A large amount of other data comes from systems without strong transaction semantics, such as logs. Therefore, the amount of data imported in real-time far exceeds that of traditional Hybrid Transaction/Analytical Processing (HTAP) or OLAP systems. Another difference between HSAP systems and traditional OLAP systems is how HSAP systems impose high requirements on the real-time performance of data. Imported data must be visible within seconds or sub-seconds to the timeliness of service and analysis results.
Elasticity and Scalability: Data import and query loads may have sudden peaks. This places high requirements on the elasticity and scalability of HSAP systems. In practice, the peak of data imports can reach 2.5 times the average, and the peak of queries can reach 3 times the average. In addition, the peaks of data import and queries may not occur at the same time, which also requires the system to adjust accordingly.

To solve the challenges above, Alibaba Cloud developed a Storage Engine for Hologres. Storage engine is mainly responsible for managing and processing data, including creating, retrieving, updating, and deleting (CRUD) data. The design and implementation of the storage engine provide high throughput, high concurrency, low latency, elasticity, and scalability required in HSAP scenarios. Based on the business demands of Alibaba Group and cloud customers, Alibaba Cloud continues to innovate and optimize the storage engine to support PB-level storage for a single table. During the 2020 Double 11 Global Shopping Festival, the Hologres engine perfectly supported hundreds of billions of point queries and tens of millions of real-time complex queries in the core business scenarios.

The following sections describe implementation principles and technical highlights of the Hologres storage engine.

2. Data Model

The basic abstraction of the Hologres storage engine is distributed tables. To achieve scalability, tables need to be split into shards. In addition, users may need to store several related tables together to efficiently support scenarios such as JOIN and multi-table update. For this reason, Hologres introduces the concept of Table Group. A group of tables with identical sharding policies constitutes a table group, and all tables in the same table group have the same number of shards. Users can specify the number of shards for a table through shard_count, and specify the number of shard columns through distribution_key. Currently, only Hash sharding is supported.

The data storage formats of tables include row storage and column storage. The format can be specified through orientation.

Each table record is stored in a specific order, which can be specified through clustering_key. If no sorting column is specified, the storage engine automatically sorts data in the order in which the data was inserted. Selecting the appropriate sorting columns can greatly optimize the performance of some queries.

Tables also support multiple indexes. Currently, dictionary indexes and bitmap indexes are supported. Users can specify the columns to be indexed through dictionary_encoding_columns and bitmap_columns.

Example:

In this example, two tables, LINEITEM and ORDERS, are created. As the LINEITEM table also has a PRIMARY KEY specified, the storage engine automatically creates an index to ensure the uniqueness of the PRIMARY KEY. Users can specify colocate_with to put two tables into the same table group, which is divided into 24 shards in above example, specified through shard_count. For the LINEITEM table, the sharding is based on the data value of L_ORDERKEY field, while it is based on the data value of O_ORDERKEY field for the ORDERS table. Dictionary indexes are created by the field L_SHIPINSTRUCT of LINEITEM table and the field O_ORDERSTATUS of ORDERS table. Moreover, bitmap indexes are created by the L_ORDERKEY, L_LINENUMBER, L_SHIPINSTRUCT fields of LINEITEM table, and the O_ORDERKEY, O_CUSTKEY, and O_ORDERSTATUS fields of ORDERS table.

3. Storage Engine Architecture

1) Overall Architecture

Each table group shard (“shard” for short) constitutes a unit for storage management and recovery. The figure above shows the basic architecture of a shard. A shard consists of multiple tablets, which share a Write-Ahead Log (WAL). The storage engine uses the Log-Structured Merge (LSM) technology, so all of the new data is inserted in the append-only form in memory. Data is written to the MemTable where the tablet is located. Then, it is written to the file when it reaches a certain threshold. By doing so, when a data file is completely genrated, its contents will not change, and new data and subsequent updates are written to new files. Compared with the B +-tree data structure of traditional databases, LSM technology reduces random I/O and greatly improves the writing performance.

With constant data writing, a lot of files accumulate on each tablet. When small files in a tablet reach a certain amount, the storage engine will compact these small files in the backend. By doing so, the system does not need to open many files at the same time, reducing the use of system resources. More importantly, after compaction, file numbers are reduced, and the query performance is improved.

In terms of Data Manipulation Language (DML), the storage engine provides interfaces for CRUD operations on single or batch data. Query engines can access stored data through these interfaces.

2) Storage Engine Components

The important components of the storage engine are listed below:

WAL and WAL Manager

WAL Manager is used to manage log files. The storage engine uses WAL to ensure the atomicity and durability of data. When CUD operations occur, the storage engine writes WAL first and then applies the content of the WAL record to the MemTable of the corresponding tablet. When the MemTable reaches certain size or remains at a certain time, it turns to an immutable flushing MemTable and a new MemTable is created to receive new writing requests. By doing so, the immutable flushing MemTable can write its contents to disk and generate an immutable file. After an immutable file is generated, the data can be considered persistent. When the system restarts after a crash, the system reads WAL to recover the data that is not persistent. WAL Manager will delete the log file only after all corresponding data has been persisted.

File Storage

Each tablet stores data in a group of files, which are stored in a distributed file system (DFS), such as Alibaba Cloud's Pangu and Apsara File Storage for HDFS. Rows storage files are stored in Sorted String Table (SST) format. Column storage files support two storage formats, one is a proprietary format similar to PAX and another is an improved version of Apache ORC format. Both column storage formats are optimized for file scanning scenarios.

Block Cache (Read Cache)

To avoid reading data from disk and occurring I/O, the storage engine uses Block Cache to store frequently and recently used data in the memory. This reduces unnecessary I/O and optimizes reading performance. For the same node, all shards share the same Block Cache. Block Cache has two replacement policies, namely, Least Recently Used (LRU) and Least Frequently Used (LFU). As the name implies, the LRU algorithm weeds out the Block that has not been used for the longest time, while the LFU algorithm replaces the Block that is least frequently accessed within a certain time.

3) Reading and Writing Principles

Hologres supports two types of writing: single-shard writing and distributed batch writing. Both types of writing are atomic writing, that is, succeed or rollback. Single-shard writing updates one shard at a time since it requires extremely high writing frequency. Distributed batch writing is used in scenarios where a large amount of data is written to multiple shards as a single transaction. It is usually executed at a much lower frequency.

Single-Shard Writing

As seen in the figure above, the WAL manager receives a single-shard writing request and operates in the following process:

The WAL manager allocates a Log Sequence Number (LSN) for the writing request, which is composed of a timestamp and an increasing sequence number.
The WAL manager creates a new log and persists the log in the file system. This log contains necessary information to replay the write operation. Only when this log entry is completely persisted will you submit the writing operation.
The writing operation is performed in the MemTable of the corresponding tablet and is visible to new reading requests. It is worth noting that parallelized updates on different tablets are allowed.
When the MemTable storage is full, it is flushed as a shard file to the file system. At the same time, a new MemTable is initialized for new write operations.
Multiple shard files are compacted asynchronously in the backend. After the file compaction or MemTable flush finishes, the metadata file that manages the tablet is updated accordingly.

Distributed Batch Writing

The frontend node that receives the writing request distributes the request to all relevant shards. These shards adopt the Two-Phase Commit mechanism to ensure the writing atomicity of distributed batch writing.

Multi-Version Reading

Hologres supports reading data with multiple versions in the tablet. The consistency of reading requests is ensured through the read-your-writes mechanism, meaning the client can always see its latest committed writing operation. Each reading request includes a reading timestamp, which is used to form a snapshot for reading. If a row of data has an LSN that is greater than that of the snapshot, this row of data will be filtered out, because this row of data is inserted into the tablet after the snapshot is generated.

4. Highlights of the Hologres Storage Engine

1) Storage-Compute Separation

The storage engine adopts the storage-compute separation architecture. All data files are stored in a DFS, such as Alibaba Cloud's Pangu or Apsara HDFS. When more computing resources are required for increasing query loads, computing resources can be scaled out independently. The storage resources can also be scaled out independently to deal with rapidly increasing data volume. The architecture with this feature ensures that resources can be rapidly scaled out without waiting for data copying or movement. Moreover, the multi-copy mechanism of DFS can be used to ensure the high availability of data. This architecture simplifies O&M (Operations and Maintenance) and guarantees system stability.

2) Asynchronous Execution Process

The storage engine uses an event-triggered and non-blocking pure asynchronous execution architecture, which can give full play to the multi-core processing capability of modern CPUs. The architecture improves throughput and supports high-concurrency writing and querying. It benefits from the HoloOS (HOS) framework, which provides efficient asynchronous execution and concurrency, and can automatically perform load balancing for the CPU to improve system utilization.

3) Unified Storage

There are two main query modes for HSAP scenarios. One is simple point query for data serving scenarios, and the other is complex query for analytical scenarios, which scans massive data. (Of course, there are also many other query modes for other scenarios.) The two query modes have different requirements for data storage. Row storage supports point queries more efficiently, while column storage has more advantages in queries with a large number of scans.

To support various query modes, a unified real-time storage is very important. The storage engine supports the row storage and column storage formats. According to user requirements, a tablet can be in row storage (applicable to serving scenarios) or in column storage (applicable to analytical scenarios.) For example, in a typical HSAP scenario, many users store data in column storage format to facilitate large-scale scanning and analysis. At the same time, the data index is stored in row storage format for point querying. In addition, data duplication is prevented by defining the primary key constraint, which is implemented using row storage. However, the reading and writing interfaces are the same for both row storage and column storage. Users only need to define these storage format when creating tables.

4) Read-Write Isolation

The storage engine adopts the "snapshot read" semantics. The reading operation reads the data status at the start time. The lock on the data is not required because reading operations are not blocked by writing operations. When there is a new writing operation, it will not be blocked by reading operations, and the writing operation is append-only. This provides excellent support for high-concurrency hybrid workloads of HSAP.

5) Abundant Index Types

The storage engine provides multiple types of indexes to improve query efficiency. A table supports clustered index and non-clustered index, while only one clustered index containing all columns of the table is allowed. A table can have multiple non-clustered indexes. In non-clustered indexes, in addition to the non-clustered index key used for sorting, there is also a Row Identifier (RID) used to find the row data in the clustered index. If there is a unique clustered index, the clustered index key is the RID. Otherwise, the storage engine generates a unique RID. To improve query efficiency, other columns can also be included in the non-clustered index. In some queries, values of all of the columns (covering index) can be obtained by only scanning one index.

Inside a data file, the storage engine supports dictionary indexes and bitmap indexes. Dictionary indexes can be used to improve the efficiency of string processing and the compression ratio of data. Bitmap indexes can be used to efficiently filter out unwanted records.

5. Summary

The design idea and development focus of the storage engine are to better support HSAP scenarios. It can efficiently support high-throughput real-time writing and inter-active querying, and optimizes offline batch writing. As ingestion workloads and storage volumes increase exponentially every year, Hologres has withstood severe challenges and successfully supported Double 11 for several years. "Hologres withstands real-time data peaks with 0.596 billion records per second and stores up to 2.5 PB data in a single table. Based on trillions of data, Hologres provides multi-dimensional analysis and services, and 99.99% of queries can return results within 80 ms." These figures show the technical capability of the Hologres storage engine and prove the advantages of Hologres’ technical architecture and implementation.

Hologres is still a new product, and HSAP is a new concept. "The best achievement today is only the cornerstone of progress tomorrow." Alibaba Cloud learns from customer feedback, and continues to improve Hologres. More optimizations will be conducted on stability, usability, functions, and performance in the future.

There will be more articles introducing technologies related to the HSAP storage engine and other core systems. Please stay tuned!

Please check the VLDB 2020 article to learn more: Alibaba Hologres: A Cloud-Native Service for Hybrid Serving/Analytical Processing

Community

Unveiling Hologres: A Cloud-Native Storage Engine

1. Background

2. Data Model

3. Storage Engine Architecture

1) Overall Architecture

2) Storage Engine Components

3) Reading and Writing Principles

Single-Shard Writing

Distributed Batch Writing

Multi-Version Reading

4. Highlights of the Hologres Storage Engine

1) Storage-Compute Separation

2) Asynchronous Execution Process

3) Unified Storage

4) Read-Write Isolation

5) Abundant Index Types

5. Summary

Read previous post:

Read next post:

Hologres

You may also like

Comments

Hologres

Related Products

Black Friday Cloud Services Sale

Realtime Compute for Apache Flink

Hologres