Disclaimer: This is a translated work of Qinxia's 漫谈分布式系统, all rights reserved to the original author.
This is the 11th blog in the "Distributed Systems" series, and it is expected to write about 30 articles. At the end of each article, there is a TL;DR for lazy people, and related reading for diligent people. Scan the QR code at the end of the article, follow the official account, and listen to me. Also welcome to forward it to the circle of friends and share it with more people.
consensus
In the last article, we briefly introduced the implementation of distributed transactions such as 2PC and 3PC. These algorithms basically provide strong consistency, but are powerless to network partitions.
In this article, let's see if there is an algorithm with Partition Tolerance.
The above derivation leads to a strong consistency algorithm with partition tolerance.
In turn, summarize the commonalities of such algorithms:
? The total number of nodes must be an odd number of n.
? Commit is considered successful after data has been successfully sent to at least (n/2 + 1) nodes.
? After a network partition occurs, the partitions with the excess number of nodes (n/2 + 1) (or none, the entire system will be suspended) continue to work, and other partitions will be suspended.
Here, we have to introduce a new concept - Consensus.
The so-called consensus, as the name suggests, means that multiple nodes agree on something. For example, all nodes in the cluster change the value of a variable from 3 to 5.
Broadly speaking, the 2PC and 3PC mentioned in the previous article are also consensus algorithms, including Bitcoin and other virtual currencies that also use consensus algorithms. In a narrow sense, Paxos, Raft, etc. to be discussed in this article are more typical consensus algorithms.
Think about a question, what is the difference between consensus and data consistency?
The two are very similar and can even be interchanged in many occasions, but we can also experience subtle differences:
? Data consistency, more like outcomes and goals, is the desired state of the system, but does not define how this state is achieved.
? Consensus, which is also the result and goal, but also includes a general method to reach this state of consensus, which is voting (2PC is equivalent to unanimous vote, Paxos is equivalent to minority obeying majority).
The application scenarios of consensus algorithms are very wide. When we jump out of the scenario of data replication that leads to data consistency, and from the perspective of the consensus algorithm of voting, consensus is everywhere:
? leader election
? Distributed lock
? Members sentenced to life
? Atomic broadcast
? ......
Of course, consensus algorithms are not omnipotent, and common consensus algorithms choose not to solve the so-called Byzantine Generals Problem, which is the situation where nodes cheat. Since consensus algorithms are usually applied to internal independent systems, this premise is also generally acceptable.
Next, let's take a look at a typical consensus algorithm.
Paxos
basic knowledge
The first was Paxos, proposed by Leslie Lamport in 1989 and officially released in 1998.
Mike Burrows, author of Goolge's Paxos implementation of Chubby, once said:
There is only one consensus protocol, and that’s Paxos. All other approaches are just broken versions of Paxos.
However, Paxos was widely criticized for its complexity and incomprehension, and even the author himself was asked later, and he wrote a separate article "Paxos Made Simple" to introduce it.
Here, I will not expand in depth. If you are interested, you can go to the original paper to check the details. We just look at the point.
By analyzing the essence of consensus, Paxos defines several roles, puts forward some constraints on their behavior, and gradually analyzes and strengthens the constraints (P1 -> P1a; P2 -> P2a -> P2b -> P2c), and obtains A theoretically feasible consensus algorithm.
The defined core roles (other supporting roles omitted) are:
? Proposer, initiates a proposal.
? Acceptor, to vote on proposals.
The final constraint is (if you don't understand it, let it go first):
? P1a: The acceptor accepts a proposal numbered n if and only if the acceptor has not responded to a prepare request numbered greater than n.
? P2c: If a proposal numbered n has value v, then there is a majority, either none of them have accepted any proposal numbered less than n, or they have accepted the highest numbered proposal numbered less than n That proposal has the value v.
The algorithm is divided into two stages:
? Phase 1
? (a) In the Prepare phase, proposers initiate proposal(n, v) to acceptors, where n is a globally unique and (at least locally) increasing integer.
? (b) In the Promise stage, after receiving the proposal, the acceptor checks and finds that n > the proposal number it has accepted before, then accepts the current proposal and brings the information of the last accepted proposal in the reply, and promises not to accept any more < n any proposal. Otherwise the request can be ignored.
? Phase 2
? (a) In the Accept stage, after receiving more than half of the positive responses, the proposer sends a request accept(n, v) to the acceptors, where v is the highest numbered value in the replies received in the preparation stage, if not, use its own The value proposed for the first stage.
? (b) In the Accepted phase, after accepting the request, the acceptor checks n, and accepts the new value submitted on the premise that it does not violate the promise of the first phase.
typical process
I'm afraid it's still not easy to understand. Let's see a typical example to understand it (just an example of the principle, not exactly the same as the actual implementation).
Imagine a situation where 2 proposers initiate proposals to 3 acceptors at the same time.
The first stage begins, A1, A2, and A3 all hold a variable with a value of 1.
At this time, both P1 and P2 try to modify the value of this variable. P1 gets n=1 (hopefully set to 6) and broadcasts to all (or more than half) acceptors. P2 (hopefully set to 8) broadcasts n=2 at the same time. However, due to the difference in network transmission, P1's request arrives at A1 and A2 first, P2's request arrives at A3 first, and the rest of the requests are still on the way.
According to the constraints, in this round of operations, each acceptor has not accepted the request, so A1 and A2 accept prepare(n=1), and A3 accepts prepare(n=2). The acceptance here is just a vote, and has no effect, so the value of the variable is still 1.
Then, each acceptor replies to the proposer's request with the content of the response (current n, previously accepted maximum n, and v corresponding to the previously accepted maximum n).
P2 receives the promise from A3, which is not half of it, so it can only continue to wait. P1 receives more than half of the promises from A1 and A2, so it enters the second stage. P1 sends an accept request to A1 and A2, asking the other party to set the variable value to 6.
After A1 and A2 received the accept request, they checked the most constraints and found that they had not agreed to any request, so they accepted the proposal, set the variable value to 6, and returned the accepted request to A1.
After a while, the 3 requests on the way finally arrived. Each acceptor still compares the received n with the largest n previously accepted by itself according to the constraints. If the former is larger than the latter, accept it, otherwise reject it.
So A1 and A2 return promises to P2, and tell P2 that they have accepted other requests before, of which the highest number is (1, 6). And A3 rejected P1's request (it can also be ignored directly, P1 will consider the request lost).
At this time, P2 has received the promises of 3 acceptors, and can enter the second stage. According to the constraints, it is found that the acceptor has already accepted the value of other proposers, so the value 6 is adopted, and the value 8 originally expected by itself is discarded (in fact, it can be considered to only send a request to A3 to improve performance).
After each acceptor receives the accept request from P2, it makes a constraint judgment and updates its own number and value to (2, 6).
At this point, the whole process is over.
some thinking
A point worth noting is that, according to the Paxos constraint, P2 did not use its original value after getting the majority vote, but used the value of P1 that has been accepted by the majority.
This is because when P2 is about to initiate a vote, the value of P1 has already received more votes, and using this value directly will help the system converge faster to reach a consensus. This is also the philosophical advantage of consensus algorithms: don't be selfish, take the overall situation into consideration, and move closer to the majority as soon as possible.
It can be seen that the first stage only determines whose proposal will be accepted (but the prososer does not know yet), and what the specific proposal is is uncertain. A formal proposal will not be made until the second stage.
In the process, there may be multiple proposals voting at the same time, and it is not whose proposal number is higher (although the size of the number plays a big role in each constraint), but whose proposal is more important. Almost accepted by more than half of the nodes.
In addition, careful friends may have discovered that Paxos is also divided into two stages, very similar to 2PC. So what's the difference between them? Why can't 2PC handle network partitions while Paxos can?
The key is a few points:
? Paxos only requires more than half of the nodes to vote, while 2PC requires all nodes to vote. When a network partition occurs, it is impossible to get all the nodes together.
? Paxos supports multiple proposers, while 2PC can only have one Coordinator. In the event of a critical node downtime, Paxos can function normally, while 2PC is blocked due to a single point of failure.
Of course, as mentioned earlier, Paxos supports strong consistency under network partitions, but it also comes at a price -- some availability is lost, and it still cannot escape the constraints of CAP.
Raft
Although Paxos achieves distributed consensus, it is not perfect, at least not easy to use.
for example:
? The external system needs to provide continuous input to vote round by round.
? Multiple proposers may initiate voting at the same time, resulting in slower convergence, and the so-called livelock may be blocked all the time.
For reasons such as these, coupled with the cumbersome understanding and implementation of Paxos itself, there have been many improvements and imitators.
Among them are Multi-Paxos, Fast Paxos, Raft, ZAB, etc. We take Raft as an example to introduce.
Raft also makes optimizations that are mostly used by other improved algorithms. Among them, the most important optimizations are:
Define new roles leader and follower to distinguish proposers. Only the leader can initiate proposals. After the leader fails, a new leader is automatically elected.
In this way, under normal circumstances, the first stage of the Paxos protocol can be omitted directly. Since there is only one proposal in the second stage, a consensus will be reached quickly, and the overall performance will be greatly improved.
We will not introduce the complete functions and processes of Raft in detail below, but only use typical examples to illustrate the typical scenarios we care about.
leader election
Raft nodes pass information and probe activity through heartbeats. Any node, if it does not receive a new heartbeat from the leader within the heartbeat time, can initiate a vote to try to become the new leader.
The typical election process is relatively simple and will not be repeated here. Let's see what happens when there is competition.
As shown in the figure below, by coincidence, Node A and D voted to the other 3 nodes at the same time, hoping to become the new leader, while Node 3 and 4 were still in the previous round of heartbeat and did not trigger the election.
If A gets the votes of B and D gets the votes of C, at this time, including their own votes, both A and D have two votes, and neither can become the leader if they do not exceed half.
But all nodes have already voted, this round of election has ended, and no leader has been elected.
In order to avoid this "livelock" situation, each Candidate (which is conIn order to avoid this "livelock" situation, each candidate (which is converted to this role after a follower initiates an election) randomly waits for a short period of time before initiating a new proposal.
Randomness ensures that the probability of simultaneous voting is low enough. In this way, even if one round of bad luck encounters the above situation, a new leader can be re-elected quickly.
(The copyright of all Raft animations belongs to the original author. For details, please click to read the original text at the end of the article to view the complete introduction to the Raft process)
As shown in the figure above, neither A nor D have obtained more than half, so they continue to wait for the timeout until the next round. But B was the first to wait for the heartbeat to enter the next round, and then initiated a vote. Although D also woke up later, B had already won more than half of the votes and became the new leader.
log replication
After the leader is elected, data read and write services can be provided to the outside world.
Raft calls this process log replication. In fact, it also transmits data through heartbeat, and then votes.
The conventional data transmission process is very simple, and we won't go into details. Let's focus on what we are most concerned about, how to reach a consensus when a network partition occurs.
As shown in the figure below, the network is divided into two partitions:
? A and B are in a partition, where B was the leader before the partition occurred.
? C, D, and E are in another partition. After C gets three votes from C, D, and E, C is also elected as the leader.
This is a typical network partition + split-brain situation.
Then, the two clients made requests to the two leaders respectively.
? The above partition received SET 8 request, 3 votes were received, and the setting was successful. This partition is fully functional.
? The following partition receives the SET 3 request and only gets 2 votes. The setting cannot be successful, which means that this partition has no availability.
This state will continue, all requests sent to the upper partition will succeed, and requests sent to the lower partition will fail.
Until the network partition is resolved, the two partitions fuse.
As shown in the figure below, B quickly knew that he was out through Term, so he abdicated to become a Follower, and then the only leader C quickly synchronized SET 8 to A and B, allowing them to catch up with the consensus that had been reached. The previous SET 3 was discarded because there was no consensus (the client can try again).
Term, or Epoch in other systems, is a globally increasing integer representing the leader's term of office. Each new election round causes Term to be incremented by 1 .
It can be seen that the split-brain problem has been properly resolved because of the existence of Term. And Term + more than half of the votes can well solve the data consistency problem under network partitions.
In essence, Raft reduces the frequency of the first phase of Paxos. After the leader is elected, the term goes directly to the second phase. Only when the term ends, the complete consensus process (including one or two phases) will be executed to elect a new leader.
Coupled with the fact that followers are forced to synchronize data in time, that they can only vote for data with stronger safety guarantees than their own new candicate, and good API design, Raft has been widely used by systems such as etcd.
Similar to Raft, ZooKeeper also implements its own consensus protocol, ZAB (ZooKeeper Aotomic Broadcast Protocol), which is used by many big data components such as HDFS, YARN, HBase, and Kafka.
Due to space limitations, we will not repeat the content that is not closely related to the theme.
TL;DR
Consensus algorithm, another kind of general algorithm with strong data consistency, can provide partition tolerance support
Paxos is the ancestor of the consensus algorithm, the core idea is more than half of the voting
But Paxos is too complex and inefficient, resulting in many optimized variants
Raft, as a typical descendant of Paxos, greatly simplifies the consensus process through the new role of leader
__
This blog introduces the concept of consensus, introduces typical consensus algorithms with Paxos and Raft, and finally achieves the strong consistency we expect with partition tolerance.
But it can also be seen that whether it is the 2PC mentioned in the previous article or the Raft mentioned in this article, as a preventive strong consistency algorithm, due to the negotiation between multiple nodes, performance is sacrificed to a certain extent.
At the same time, both 2PC and Raft have lost usability in pursuit of consistency to varying degrees.
At the beginning of the next article, let’s learn about the possibilities beyond the preventive strong consistency algorithm, and see whether the weak consistency algorithm that pollutes and then manages can solve performance and usability problems, and whether it can meet our requirements for data consistency.
This is a carefully conceived series of 20-30 articles. I hope to let everyone have a basic and core grasp of the distributed system in a story-telling way. Stay Tuned for the next one!
Learning about Distributed Systems – Part 10: An Exploration of Distributed Transactions
The Other Type of Consistency- Part 12 of About Distributed Systems
64 posts | 53 followers
FollowAlibaba Clouder - November 29, 2019
Alibaba Cloud Native - October 12, 2024
Wei Kuo - August 30, 2019
Alibaba Cloud_Academy - September 30, 2022
ApsaraDB - October 17, 2024
Alibaba Cloud_Academy - September 16, 2022
64 posts | 53 followers
FollowProvides scalable, distributed, and high-performance block storage and object storage services in a software-defined manner.
Learn MoreMulti-source metrics are aggregated to monitor the status of your business and services in real time.
Learn MoreBuild business monitoring capabilities with real time response based on frontend monitoring, application monitoring, and custom business monitoring capabilities
Learn MorePlan and optimize your storage budget with flexible storage services
Learn MoreMore Posts by Alibaba Cloud_Academy