An Introduction to the System Architecture Abstraction of the Replicated State Machine

By Yan Xiangguang (Xiangguang)

The replicated state machine allows multiple machines to work together like an enhanced combination. It is widely used in scenarios such as data replication and high availability. This article will start from the replicated state machine model and explain how to abstract the architecture of the replicated state machine system in combination with cutting-edge research in the industry.

Replicated state machine means that multiple machines have the same state and run the same deterministic state machine. It allows multiple machines to work together like an enhanced combination. During the process, a few machines down will not affect overall usability.

The replicated state machine is a basic method to realize fault tolerance. It is widely used in data replication and high availability scenarios and has always been a hot topic in the industry and academia. More and more systems use replicated state machines to achieve high availability, such as ZooKeeper, ETCD, MySQL Group Replication, TiDB, etc. Research on various replication protocols and system architectures is also emerging. How can we abstract an architecture that replicates a state machine system to make it more versatile and easy to use? Starting from the replicated state machine model and combining with some cutting-edge research in the industry, this article summarizes the architectural abstraction of the replicated state machine system, which is inspiring in the system architecture design.

1. Replicated State Machine

A replicated state machine is a deterministic state machine where multiple machines have the same state and run the same. These machines form a whole external service. The failure of some machines does not affect the overall availability. Raft proposes the replicated state machine architecture shown in Figure 1. The replicated state machine is implemented by replicated logs, and the replicated logs are implemented using the consensus protocol to ensure log consistency.

Figure 1: Raft's Proposed Replicated State Machine Architecture

You can see that the core of the replicated state machine is the replication log. The consensus protocol is the specific method to implement the replicated log. Therefore, as shown in Figure 2, the replicated state machine can be further abstracted into two parts: the upper-layer business state machine and the underlying replicated log. The upper-layer business state machine is responsible for specific business logic and does not care about the details of replicated log. When logs need to be replicated, the upper-layer business state machine directly writes logs to the underlying replicated log module. The replicated log module uses the consensus protocol to replicate logs to other nodes and notifies the upper-layer business state machine to perform operations in the logs after the logs are submitted. The details of the consensus protocol are hidden in the replicated logs at the lower layer. The business logic and the consensus protocol can evolve independently without affecting each other. The replicated logs can be made into common modules. Different business state machines can reuse the same set of code for replicated logs.

Figure 2: The replicated state machine is abstracted into the upper-layer business state machine and the underlying replicated log.

The abstraction shown in Figure 2 is already relatively common. It is also the architectural abstraction used by many replicated state machine systems in the industry. In this architecture, the replicated log module is linked to the program of the business state machine in the form of a library. The decoupling is not very thorough, there are many inconveniences in upgrade maintenance and dynamic expansion, and it is not suitable for the cloud-native architecture.

Can the replicated state machine system be abstracted further? The replicated log module in Figure 2 essentially uses replication to share logs among multiple nodes and abstracts out the semantics of a shared log. Therefore, we can further abstract the replicated log into a shared log. As shown in Figure 3, the business state machine can write new logs to the underlying shared log layer and read logs from the shared log layer for execution. Business state machines and shared logs can be expanded, upgraded, and maintained independently.

Figure 3: The Business State Machine That Abstracts the Replicated State Machine into the Upper Layer and the Shared Log of the Lower Layer

The architecture of separation of storage and computing shown in Figure 3 turns the shared log layer into a storage system that can use the technologies in many storage systems as if opening the door to a new world. These abstractions ([1] [2]) introduce the two top papers of Facebook Delos in detail. We describe the abstraction of the shared log layer in the shared logs section and the business state machine layer in the business state machine section.

2. Shared Logs

Shared logs provide log reading and writing services. Business state machines use shared logs to synchronize statuses to ensure consistent statuses. Shared logs need to have high availability to ensure the high availability of replicated state machines. Shared logs are essentially an Append Only storage system. We can learn from the design of GFS, HDFS, Pangu, and other storage systems. There are some formed systems in the industry, such as the virtual consensus of Apache BookKeeper and Facebook Delos.

2.1 Apache BookKeeper

Apache BookKeeper is a highly scalable, fault-tolerant, and low-latency online log storage system. It provides durability, replication, and strong consistency. Based on Apache BookKeeper, you can quickly build reliable online services. You can dynamically create and delete logs in Apache BookKeeper, which is called Ledger.

Figure 4: Architecture of Apache BookKeeper

As shown in Figure 4, Apache BookKeeper contains three core components: Client, Metadata Store, and Bookie. Metadata Store is responsible for storing metadata related to Ledger and cluster. Bookie is the storage node of the system, responsible for storing Entry in the Ledger. The client is responsible for providing interfaces to access the system. Ledger is the basic logical unit of BookKeeper, which contains a series of consecutive entries. BookKeeper ensures that entries are written sequentially and can be written at most once. Once an entry is written, it cannot be modified. A Ledger is divided into multiple Fragments, each of which contains a contiguous set of Entries. Bookie is responsible for the storage of Ledger, which is the storage of Ledger's Fragment. Each Bookie stores a Fragment of a Ledger. Each Fragment contains a set of consecutive Entries. Only the last Fragment can be written to each Ledger at the same time. When the Fragment fails to be written, a new Fragment is generated to continue writing. Each Fragment is copied to multiple Bookies to provide fault tolerance. This group of Bookies is called Ensemble.

2.2 Virtual Consensus in Delos

Delos proposed the concept of virtual consensus, hid the details of consensus, proposed the abstraction of Virtual Log, and obtained the Best Paper of OSDI'20. Virtual Log is an Append-Only log that provides APIs, such as append, checkTail, and readNext. It also supports a hot upgrade of the consensus protocol, which is unavailable in Apache BookKeeper.

The abstraction of Virtual Log makes the upper layer only assume that each Entry in the Log has been copied and persisted on different nodes, without worrying about which consensus protocol is used behind it. Even multiple consensus protocols can exist at the same time. A batch of consecutive Log Entires is mapped into a set of physical shared logs called Loglet. They correspond to a consensus protocol or a Log storage system implemented by a consensus protocol.

Loglet provides the same interface as Virtual Log, plus a seal interface. Once sealed, the Loglet no longer accepts new append writes. You need to switch to a new Loglet to continue to append writes. The mapping between the logical space of Virtual Logs and the physical space of Loglets is stored in a separate MetaStore service. When replacing the consensus protocol, you only need to modify the mapping in MetaStore and switch the storage location. MetaStore is a KV storage with a version. After switching between different versions of stored Loglets, Virtual Log naturally sends traffic to the new Loglet.

Figure 5: Virtual Consensus in Delos

After introducing the abstraction of virtual consensus, Loglets no longer need to provide a complete fault tolerance mechanism. It simplifies the implementation of Loglets. When a Loglet is unavailable, Virtual Log only needs to seal it and then switch to another Loglet to continue writing. Loglets only need to provide a highly available seal interface, which simplifies the implementation of Loglets and avoids the complexity of implementing consensus protocols (such as Paxos and Raft). The abstraction of virtual consensus also facilitates the long-term evolution of the system. New Loglets can be continuously evolved to replace old Loglets for higher performance and lower costs. Delos used the ZKLoglet to go online quickly at the beginning. NativeLoglet is developed to replace the ZKLoglet, which improves performance ten times.

3. Business State Machine

The business state machine is responsible for implementing the specific business logic, which is closely related to the specific business logic. At first glance, it seems that it can no longer be abstracted, but it is not. Facebook's Delos system proposed the Log-Structured protocol in SOSP'21. It is an implementation of a shared log-based replicated state machine, based on which its application state can be replicated consistently among different nodes.

The Log-Structured protocol provides a set of interfaces through which applications interact with the protocol engine. With the IEngine interface, the application can use the propose interface to propose an entry to the shared log. RegisterUpcall registers an instance of Applicator to receive new entries from the shared log. Once a new entry is written, the apply interface of the Applicator instance will be called. The sync interface ensures that all entries in the shared log have been notified to the application and returns a read-only view to read the latest status. An application can save its local state to a persistent storage system (such as RocksDB), which is called LocalStore in the paper.

Figure 6: Log-Structured Protocol Interface

The Log-Structured protocol is a stackable replicated state machine. In the example shown in Figure 7, each Engine is like an application of the lower-level Engine. The upper-level Engine calls the propose/sync of the lower-level Engine, and the lower-level Engine calls the apply of the upper-level Engine. Each layer of Engine implements the IEngine interface in Figure 6 and calls the interface of the next layer in the implementation. At the same time, each layer of Engine can directly access LocalStore to persist its required state. When an entry is proposed to an engine, the engine adds its headers to the entry and then proposes to the next layer of the engine. Similarly, when the next layer of Engine calls its apply, it parses its headers from the Entry and updates the LocalStore. Then, it calls the apply of the previous layer of Engine. The top layer is a specific application, which provides specific application interfaces to users. The bottom Engine (called BaseEngine) is the engine that interacts with shared logs.

Figure 7: An Example of an Interaction between Stacked Engines

A new function is usually added to add a new Engine to the engine stack through this stacking mode. Some common Engines can be reused in different business state machines and different business state machines can be developed quickly. Delos implements nine types of engines to implement databases with different requirements. It uses the combination of these engines to build different databases quickly, such as DelosTable, which provides MySQL semantics, Zelos, which provides ZooKeeper semantics, and DelosQ, which provides queue services.

4. Summary

This article introduces the architectural abstraction of the replicated state machine system. First, the replicated state machine can be abstracted into the upper-layer business state machine and the underlying shared log. Then, it introduces the architectural abstraction of the shared log and the business state machine. The shared log has many mature systems in the industry. The abstract is relatively common, but there are still few cases of architectural abstraction of business state machines. I hope to see more architectural abstraction of business state machines in the future, which can reuse code better and implement new business state machines quickly.