EPaxos (Egalitarian Paxos) is a high-profile next-generation and distributed consistency algorithm with broad application prospects in the industry. However, throughout the industry, there has been no project based on EPaxos, or even an article that introduces EPaxos in layman's terms. Although the theory of EPaxos algorithm is good, it is difficult to understand, and also has many challenges in implementation. The practical application of EPaxos algorithm is not quite ready yet.
This article aims to introduce the EPaxos algorithm in a simple and easy-to-understand way for everyone, even for those with only basic knowledge of Paxos or Raft algorithms. It introduces EPaxos from Paxos, describes the basic concepts and intuitive understanding of EPaxos, and provides relative information for detailed introduction of EPaxos algorithm in other two articles. Reading this article requires some background in consensus algorithm such as Paxos or Raft.
Let's start with Paxos. Paxos can be seen as the prototype of distributed consistency algorithm that tolerates the simultaneous failure of F replicas among 2F+1 replicas.
Generally, Paxos requires at least two phases to reach a value: the preparation phase and the acceptance phase.
In the preparation phase, each replica competes for the proposal right. In the acceptance phase, the replica with proposal right sends messages to make the proposal reach a consensus among replicas.
The process of using Paxos to reach a consensus on a series of values is shown in Figure 1. Three replicas are identified by different colors, and A, B, C, D are their proposed values. They compete for each instance and propose their own values:
Figure 1 Using Paxos to reach a consensus on a series of values
Paxos determines the value for each instance independently. For each instance, Paxos requires above two phases completely to determine the value of that instance.
Paxos requires at least two network trips to reach a value. More network trips may be required in concurrent cases, and in extreme cases even a livelock may be formed, which is inefficient in the process. To solve these issues, Multi-Paxos came into being.
Multi-Paxos elects a leader among replicas to initiate the proposal. Since there is no competition, the livelock issue is solved. In addition, the preparation phase can be skipped to improve the efficiency.
The process of using Multi-Paxos to reach a consensus on a series of values is shown in Figure 2. Three replicas are identified by different colors. First, the green replica is elected as the leader. Then, four values of A, B, C, and D are proposed by the leader on by one:
Figure 2 Using Multi-Paxos to reach a consensus on a series of values
The first step for Multi-Paxos is to elect a leader, and then the proposal right of instance all belongs to the leader without competition. Therefore, the preparation phase can be omitted, and only one phase is needed. By electing a leader, the efficiency of reaching values is improved. However, the leader also becomes a bottleneck in performance and availability.
The leader needs to process more messages than other replicas, so the load on each replica is unbalanced with low resource utilization. Moreover, the system is unavailable when the leader goes down, and cannot be restored until a new leader is elected. This reduces availability.
Under the basic Paxos, each replica can propose and supports high availability, but competitions and conflicts of replicas can cause low efficiency. Multi-Paxos, however, elects a leader to avoid conflict and improve efficiency, but the leader may become a bottleneck, which reduced availability. EPaxos is developed to solve the issue of balancing efficiency and availability. Unlike Multi-Paxos, EPaxos directly deals with and tries to solve conflicts in another way.
EPaxos is a leaderless consistency algorithm and allows any replica to send proposes. Generally, reaching a value only requires one or two network trips.
EPaxos has no leader election cost. When one replica becomes unavailable, other replicas can be accessed immediately, achieving higher availability. In addition, each replica is load balanced without leader bottleneck and has a higher throughput. Users can also choose the nearest replica to provide services, reducing latency in cross-available-zone and cross-regional scenarios.
Paxos numbers all instances in advance and then reaches a consensus independently on the value of each instance one by one. Unlike Paxos, EPaxos can concurrently process multiple instances. It does not number instances in advance but dynamically determine the order of these instances at runtime.
EPaxos reaches a consistency not only on the value of each instance, but also on the relative order among instances. EPaxos regards the relative order among different instances as a consistency issue, and reaches a consistency among replicas. In this way, each replica can concurrently initiate a proposal in its own instance. After values and the relative order of these instances are consistent, they are reordered based on the relative order to form a consistent order.
The process of using EPaxos to reach a consistency on a series of values is shown in Figure 3. Three replicas are identified by different colors. Each replica has its own instance space and proposes its own value. A, B, C, D are their proposed values. Each instance reaches a consistency on values as well as the relative order with other instances.
Figure 3 Using EPaxos to reach a consistency on a series of values
The instance space of EPaxos is two-dimensional. Each replica occupies a row in the two-dimensional instance space without competition. In addition, each replica can concurrently initiate a proposal in its own instance space, and maintain the relative order of instances. They also reach a consistency on values and the relative order of instances. As a result, each replica deterministically reorders instances according to the relative order, that is, reach a consistency on a series of values.
EPaxos introduces the dependency (deps) concept as a property of instances to represent the relative order between instances. A ← B means that B depends on A, indicating that A precedes B. Each instance has its own dependency set. EPaxos maintains the deps among instances and makes the dependency set consistent across replicas along with values. Thus, each replica reorders instances according to the deps for a consistent sequence of values. For the case in Figure 3, a series of consistent values of each replica are: A ← B ← C ← D.
For Figure 3, instances of EPaxos can be regarded as points, and dependency sets of instance as arrows. After values and dependency sets of instances reached a consistency, it means that points and arrows in Figure 3 have reached a consistency among replicas. Therefore, replicas will see the same figure like Figure 3.
The process of reordering instances by EPaxos is similar to deterministic topological sorting of the figure. However, it should be noted that the deps of EPaxos instances may form circles, that is, there may be loops in the figure. In this case, it cannot be called topological sorting.
To deal with the circular dependency, EPaxos algorithm for instance reordering needs to find strongly connected components of the figure first. All strongly connected components constitute a Directed Acyclic Graph (DAG). Then, deterministic topological ordering of strongly connected components is performed:
Figure 4 Process of reordering instances by EPaxos
The Tarjan algorithm is a recursive algorithm that generally used to find strongly connected components. However, in the practical pressure tests, the recursive implementation is easy to cause stack overflow, which brings certain challenges to applications.
Paxos, Raft, EPaxos: How Has Distributed Consensus Technology Evolved?
Raft Engineering Practices and the Cluster Membership Change
6 posts | 2 followers
FollowXiangguang - August 31, 2021
Xiangguang - January 11, 2021
Alibaba Clouder - May 20, 2020
Alibaba Cloud Native - April 6, 2022
Alibaba Clouder - July 31, 2018
Alibaba Cloud Blockchain Service Team - August 27, 2018
6 posts | 2 followers
FollowAlibaba Cloud Function Compute is a fully-managed event-driven compute service. It allows you to focus on writing and uploading code without the need to manage infrastructure such as servers.
Learn MoreHigh Performance Computing (HPC) and AI technology helps scientific research institutions to perform viral gene sequencing, conduct new drug research and development, and shorten the research and development cycle.
Learn MoreDeploy custom Alibaba Cloud solutions for business-critical scenarios with Quick Start templates.
Learn MoreElastic and secure virtual cloud servers to cater all your cloud hosting needs.
Learn MoreMore Posts by Xiangguang