Disclaimer: This is a translated work of Qinxia's 漫谈分布式系统, all rights reserved to the original author.
Once the global clock is not guaranteed, in a distributed system, the deviation in representability is not bad, after all, human beings are not so sensitive to time. However, strong data consistency cannot be guaranteed, and node detection may also be misjudged (affected by both the network and the clock), and the consequences may be serious.
Unreliable networks and unreliable clocks, these two major problems, are the root causes of data consistency problems in distributed systems.
Addressing the Root Cause of Consistency Problems
The root cause is found, and if these two problems are solved, it seems that the consistency problem can be fundamentally solved.
Troubleshoot unreliable network issues
There is no simple and fundamental solution to an unreliable network.
For the network itself, whether it is a transit device such as a switch, or a transmission medium such as an optical fiber line or a telephone line, just like a server, there is a possibility of failure physically, and it is impossible to completely avoid it.
Improving hardware stability and performance, and improving operation and maintenance efficiency can of course effectively improve network quality, but it is certainly impossible to completely solve the problem.
Therefore, we can only explore by sending requests, and then infer the status of the other party based on the returned results.
And whether you adjust the time of overtime detection, or do multi-step verification, etc., you can only do the best-effort guarantee.
In the final analysis, a node is an island, and it is difficult to understand the overall situation by itself in the ocean of distributed systems.
So we need to work together and collaborate. This is also the solution to the various quorom algorithms mentioned in the previous article, so I won't repeat them.
Fix unreliable clock issues
The first method, since it is so difficult to keep the clock globally consistent, bypass it and use no clock.
Anyway, as mentioned in the previous section, the essence of the clock is only a counter. The big deal is to change a counter.
Think about the consistency of data, what we pursue is actually the first attribute of time - order. In this case, an auto-incrementing ID can be used to identify the sequence, which is equivalent to a logical clock.
The first person who proposed this method was Leslie Lamport, the author of the famous Paxos, so this kind of logical clock is also called Lamport timestamp.
However, even if self-incrementing IDs are used, it is still necessary to negotiate IDs between nodes like a clock, or to be a center for distributing IDs, but it will drag down performance.
Then take a step back and do not pursue strong consistency. Causal consistency is sufficient in most application scenarios.
Deduced here, the answer is ready to come out, it is the Version Number and Vector Clock we talked about earlier!
For details, please review the previous article, which will not be repeated here.
Although this can satisfy most scenarios, after all, the consistency of some scenarios cannot be satisfied, and the node detection of the duration attribute that requires time cannot be replaced by ordinary counters.
So the second way is to face and solve the problem head-on, and come out with a truly consistent time.
The most representative is Google's TrueTime API:
? Each computer room has some time masters as the clock standard in the computer room.
? Each machine has a time slave daemon to ensure the time synchronization with the master in the computer room.
? Most time masters are equipped with GPS to synchronize time from satellites, avoiding the influence of land-based network equipment.
? The rest of the time masters are equipped with Atomic Clock, which determines the time by atomic resonance frequency, and the error can be one second in 20 million years.
? In addition to synchronizing the time from GPS or Atomic Clock, the masters will also correct each other and compare with their own local time, and kick themselves out if they are abnormal.
? The slave pulls time from multiple masters (possibly in different computer rooms) every 30 seconds, and kicks off liars through the Marzullo algorithm.
? Under typical settings, the local clock will have a drift of 200 us/sec, plus the slave synchronization interval of 30 seconds per round, there will be a theoretical maximum error of 6ms, plus an average transmission overhead of about 1ms, the actual error range is 0-7ms.
The picture above is the benchmark of Goolge under thousands of machines in multiple computer rooms. It can be seen that the overall error is controllable, and 99% of the errors are within 3ms. On the left, after network optimization on March 31, the error further decreased and stabilized, and 99% of the error was controlled within 1ms. The spur in the middle of the picture on the right is due to planned maintenance on the 2 masters.
Traditional clocks provide the exact time, but this is an illusion. It looks like there is no error, but in fact it is unbounded time uncertainty.
TT.now(): TTinterval: [earliest, latest]
TT.after(t): true if t has defintely passed, t < TT.now()
TT.before(t): true if t has definitely not arrived, t > TT.now()
As you can see from the main API of TrueTime above, what TrueTime does is to provide a bounded time uncertainty guarantee.
Negotiation between nodes must have different transmission and processing overhead, and it is impossible to achieve absolute consistency, but it can ensure that the real time is within this extremely small interval.
Since TrueTime returns an interval, in order to ensure the sequence of two times, it is necessary to ensure that the two intervals do not overlap (overlap), that is, t1.latest < t2.earliest.
TrueTime is heavily used in Goolge Cloud's distributed database Spanner. In Spanner, to ensure the serializability of two transactions, it is necessary to submit the second transaction after TT.now() > t1.latest after the first transaction t1 is submitted, that is, the so-called commit wait .
For the two methods mentioned above, the first logical clock cannot be supported in many scenarios, and the second physical clock is too dependent on specific hardware. Therefore, a third implementation called Hybrid Time was born, combining an NTP-based physical clock and an auto-incrementing logical clock to comprehensively judge the sequence of events.
In addition, I mentioned the logical clock earlier and mentioned that the centralized distribution center may drag down the performance, but it is not absolute. If the distribution center only does this, and the service node network is of good and stable quality, such as all in the same IDC, it can also be considered as the fourth clock scheme. In fact, there is already an implementation called Timestamp Oracle in Google Percolator. It will not be expanded here, and interested students can learn about it by themselves.
TL;DR
The first half of this article summarizes the data consistency issues discussed in recent articles and will not be repeated. Only summarize the second half of the content.
? Data consistency problems that appear to be caused by server failures are actually caused by the uncertainty of distributed systems.
? Specifically, unreliable networks and unreliable clocks are the source of consistency problems.
? To solve the problem of unreliable network, consensus algorithms such as Paxos have given a way out.
? To solve the problem of unreliable clocks, there can be decentralized logical clocks such as Vector Clock, new physical clocks such as TrueTime API, hybrid clocks such as Hybrid Time, and centralized logical clocks such as Timestamp Oracle.
__
At this point, the chapter on data consistency comes to an end. Of course, I have buried the exact once pit several times in the front, and I have not forgotten it. I will pick it up and talk about it in a suitable place later.
After solving the consistency problem, can we enjoy the power of the scalability of the distributed system with peace of mind?
Is a distributed system really completely distributed?
In the next article, let's take a look at the centralization problem in distributed systems.In the case of 3 copies, the data corresponding to key K will be saved on nodes B, C and D. B, C and D are called the preference list of key k. The first B is the coordinator, which is responsible for responding to the client's read and write requests and copying the data to the clockwise subsequent C and D.
The advantages of consistent hashing over ordinary hashing are obvious. When a node is added or removed, it will only affect adjacent nodes, and will not cause data redistribution on all nodes. In this way, the scalability of the system is much better.
But native consistent hashing is not good enough:
? The randomness of node locations leads to data and load imbalance.
? Differences in machine performance are not reflected.
Therefore, Dynao has made some improvements on the basis of native consistent hashing:
? Divide physical nodes into different numbers of virtual nodes according to performance differences, and then form a consistent hash ring from virtual nodes.
? Divide the consistent hash ring into M equal-sized regions, and each node shares these regions equally.
These two improvements make the machine performance and data storage evenly distributed, fully dissipating the load pressure.
short-term failure recovery
Dynamo has concluded from practice that short-term machine failures are more frequent, while permanent failures are relatively rare. Therefore, different processing strategies are adopted.
For short-term faults, gossip + hinted handoff is used to deal with it.
Due to the decentralized architecture, there is no place in the Dynamo system to maintain member information.
Therefore, when there is a brief disconnection and recovery of nodes, or planned capacity expansion, the gossip protocol is used to synchronize the state.
Gossip, as the name suggests, spreads peer-to-peer just like gossip and gossip.
Each node randomly selects a node at intervals to synchronize member information. It can be seen that the data of member information will also achieve eventual consistency through the gossip protocol.
Member information changes are resolved, followed by data synchronization during member changes.
When the number of write replicas cannot be met due to a node failure, Dynamo will select another node to temporarily store this data.
The node selected for temporary storage will create a separate space to temporarily store this data, and retain the meta information of the faulty node as a hint. When the faulty node recovers, the temporarily stored data is handed off.
It can be seen that hinted handoff not only guarantees data persistence requirements, but also does not break the requirements of consistent hashing, which is a good way to deal with short-term failures.
Permanent failure recovery
For permanent faults, the short-term fault method cannot be used, otherwise the temporary data will increase and the node stability will be compromised.
When a new node joins, it needs to synchronize the data it is responsible for in the consistent hash ring from other nodes. Due to possible data inconsistency, Dynamo uses an anti-entropy strategy to perform consistency checks during data synchronization.
In terms of specific implementation, Merkle tree is used to process and compare data to reduce the amount of data actually transmitted.
As shown in the figure above, Merkle is a multi-layer tree structure, and the parent node stores the hash value of the child node. Each Dynamo node creates one such tree for each key range it is responsible for.
When doing data synchronization or consistency check, you only need to compare the hash values ??layer by layer from top to bottom to find the specific difference data. Then only the difference part needs to be transmitted, and the data consistency between nodes can be achieved as soon as possible.
This incremental transfer of differential data is naturally much faster than full transfer.
read and write mode
In order to pursue performance and availability, Dynamo adopts a read-write mode similar to the leaderless mentioned in the previous article, and any node can accept requests for any key.
Typically, clients have two ways to send read and write requests:
? The client sends a load balancer, and the load balancer forwards it to a node according to the load. If this node is not in the first few prefrence list, it will forward the request to the corresponding coordinator.
? If the client knows the data partition, it will directly send it to the corresponding coordinator.
In the last article, we mentioned that Paxos uses quorum to replicate data to achieve consensus. Dynamo also uses a similar method, except that it does not require more than half of the nodes, so it is also called partial quorum.
Let the number of system nodes be N, the number of write nodes to be W, and the number of read nodes to be R.
The client needs to write data to W nodes, and then read data from R nodes. When W + R > N is satisfied, the nodes covered by writing and reading will overlap, so that the latest data can be read, and A possible data conflict was found.
The specific values ??of W and R can be flexibly adjusted as needed. Obviously, the larger the W, the better the data durability guarantee, but the slower the writing; the larger the R, the slower the reading.
Taking the regular N=3 as an example, there are several options:
? W = 1, R = 3, slow read and fast write
? W = 2, R = 2, read and write balanced
? W = 3, R = 1, fast read and slow write
The figure below is a real benchmark. where Lr is the return time of 99.9% of read operations, Lw is the return time of 99.9% of write operations, and t is the duration of data inconsistency.
Using the YMMR column as an example, compare the lines (R=1, W=1) and (R=2, W=1) . It can be seen that as R increases, the read operation time increases from 5.58 to 32.6, while the inconsistency duration decreases from 1364 to 202.
conflict resolution
The decentralized architecture allows multiple nodes to write, and concurrent writing makes data conflicts inevitable.
Dynamo uses the vector clocks mentioned above to resolve data conflicts, which will not be repeated here.
As mentioned earlier, vector clock provides causal consistency and cannot resolve concurrent write conflicts. Once concurrent writing occurs, the system cannot determine who comes first, or which result should be used.
At this point, if the client queries data, two situations will occur:
? If only one data is found, it means that there is no concurrent writing or it has successfully converged, and it can be used directly.
? If there are more than one data found, it means that the concurrent results cannot be completely converged. Therefore, multiple results will be returned to the client, and the client will select the required one according to the application scenario.
Dynamo has also been questioned for this, but that's the price of sacrificing consistency. The application of Dynamo in a large number of production environments at Amazon also proves that this cost is acceptable.
The above is the focus of Dynamo worthy of our attention. The following figure shows the main problems and solutions listed in the Dynamo paper, which we have basically covered above.
It can be seen that Dynamo, as a typical successful example of a weakly consistent distributed system, requires a series of optimizations in addition to resolving data conflicts before it can be widely used in production environments. This is also a gap that needs to be overcome between theory and practice.
TL;DR
Preventive strong consistency model, although good consistency, but sacrifices performance and availability.
The weak consistency model of first pollution and then governance is more inclined to performance and availability, and consistency is chosen after the fact, which has also been widely used.
Weak consistency models can be divided into client-centric consistency, eventual consistency, and causal consistency, among which causal consistency can be regarded as a special case of eventual consistency.
Weak consistency allows for disagreement, but to be usable, conflicts must be resolved.
Common conflict resolution methods are Last Write Wins, version numbers, vector clocks, CRDT.
Amazon's Dynamo is a classic example of a weakly consistent distributed system.
Dynamo uses an improved consistent hashing algorithm for partitioning and data location.
Dynamo uses gossip + hinted handoff to handle short-term failures.
Dynamo uses Merkle trees, an anti-entropy strategy, to reduce data transfers for permanent failure recovery.
Dynamo uses a W+R approach similar to leaderless data replication for data read and write.
Dynamo uses vector clocks to resolve data conflicts, providing causal consistency.
__
In this article, we introduce the second solution to data consistency: weak consistency of pollution first and then governance. Taking Dynamo as an example, I learned about the problems that need to be solved for a weakly consistent distributed system to be applied in a production environment.
Recall from article 10 in the series, we mentioned implementing distributed transactions with strongly consistent 2PC/3PC. Although it has been widely used in the database field, its shortcomings are also obvious.
Now that we understand the weak consistency distributed algorithm, is it possible to make distributed transactions based on weak consistency to solve problems such as performance and availability? In the next article, let's take a look at another possibility of distributed transactions.
For permanent faults, the short-term fault method cannot be used, otherwise the temporary data will increase and the node stability will be compromised.
When a new node joins, it needs to synchronize the data it is responsible for in the consistent hash ring from other nodes. Due to possible data inconsistency, Dynamo uses an anti-entropy strategy to perform consistency checks during data synchronization.
In terms of specific implementation, Merkle tree is used to process and compare data to reduce the amount of data actually transmitted.
picture
As shown in the figure above, Merkle is a multi-layer tree structure, and the parent node stores the hash value of the child node. Each Dynamo node creates one such tree for each key range it is responsible for.
When doing data synchronization or consistency check, you only need to compare the hash values layer by layer from top to bottom to find the specific difference data. Then only the difference part needs to be transmitted, and the data consistency between nodes can be achieved as soon as possible.
This incremental transfer of differential data is naturally much faster than full transfer.
read and write mode
In order to pursue performance and availability, Dynamo adopts a read-write mode similar to the leaderless mentioned in the previous article, and any node can accept requests for any key.
Typically, clients have two ways to send read and write requests:
The client sends a load balancer, and the load balancer forwards it to a node according to the load. If this node is not in the first few prefrence list, it will forward the request to the corresponding coordinator.
If the client knows the data partition, it will directly send it to the corresponding coordinator.
In the last article, we mentioned that Paxos uses quorum to replicate data to achieve consensus. Dynamo also uses a similar method, except that it does not require more than half of the nodes, so it is also called partial quorum.
Let the number of system nodes be N, the number of write nodes to be W, and the number of read nodes to be R.
The client needs to write data to W nodes, and then read data from R nodes. When W + R > N is satisfied, the nodes covered by writing and reading will overlap, so that the latest data can be read, and A possible data conflict was found.
The specific values of W and R can be flexibly adjusted as needed. Obviously, the larger the W, the better the data durability guarantee, but the slower the writing; the larger the R, the slower the reading.
Taking the regular N=3 as an example, there are several options:
W = 1, R = 3, read slow and write fast
W = 2, R = 2, read and write balance
W = 3, R = 1, read fast and write slow
The figure below is a real benchmark. where Lr is the return time of 99.9% of read operations, Lw is the return time of 99.9% of write operations, and t is the duration of data inconsistency.
picture
Using the YMMR column as an example, compare the lines (R=1, W=1) and (R=2, W=1) . It can be seen that as R increases, the read operation time increases from 5.58 to 32.6, while the inconsistency duration decreases from 1364 to 202.
conflict resolution
The decentralized architecture allows multiple nodes to write, and concurrent writing makes data conflicts inevitable.
Dynamo uses the vector clocks mentioned above to resolve data conflicts, which will not be repeated here.
As mentioned earlier, vector clock provides causal consistency and cannot resolve concurrent write conflicts. Once concurrent writing occurs, the system cannot determine who comes first, or which result should be used.
At this point, if the client queries data, two situations will occur:
If there is only one data found, it means that there is no concurrent writing or it has successfully converged, and it can be used directly.
If there are multiple data found, it means that the concurrent results cannot be completely converged, so multiple results will be returned to the client, and the client will select the one it needs according to the application scenario.
Dynamo has also been questioned for this, but that's the price of sacrificing consistency. The application of Dynamo in a large number of production environments at Amazon also proves that this cost is acceptable.
The above is the focus of Dynamo worthy of our attention. The following figure shows the main problems and solutions listed in the Dynamo paper, which we have basically covered above.
picture
It can be seen that Dynamo, as a typical successful example of a weakly consistent distributed system, requires a series of optimizations in addition to resolving data conflicts before it can be widely used in production environments. This is also a gap that needs to be overcome between theory and practice.
TL;DR
Preventive strong consistency model, although good consistency, but sacrifices performance and availability.
The weak consistency model of pollution first and then governance is more biased towards performance and availability, and consistency is selected after the fact, which has also been widely used.
Weak consistency models can be divided into client-centric consistency, eventual consistency, and causal consistency, among which causal consistency can be regarded as a special case of eventual consistency.
Weak consistency allows for disagreement, but in order to be usable, conflicts must be resolved.
Common conflict resolution methods are Last Write Wins, version numbers, vector clocks, CRDT.
Amazon's Dynamo is a classic example of a weakly consistent distributed system.
Dynamo uses an improved consistent hashing algorithm for partitioning and data location.
Dynamo uses gossip + hinted handoff to handle short-term failures.
Dynamo uses Merkle tree, an anti-entropy strategy, to reduce data transfer and achieve permanent failure recovery.
Dynamo uses a W+R approach similar to leaderless data replication to read and write data.
Dynamo uses vector clocks to resolve data conflicts, providing causal consistency.
In this article, we introduce the second solution to data consistency: weak consistency of pollution first and then governance. Taking Dynamo as an example, I learned about the problems that need to be solved for a weakly consistent distributed system to be applied in a production environment.
Recall from article 10 in the series, we mentioned implementing distributed transactions with strongly consistent 2PC/3PC. Although it has been widely used in the database field, its shortcomings are also obvious.
Now that we understand the weak consistency distributed algorithm, is it possible to make distributed transactions based on weak consistency to solve problems such as performance and availability? In the next article, let's take a look at another possibility of distributed transactions.
This is a carefully conceived series of 20-30 articles. I hope to let everyone have a basic and core grasp of the distributed system in a story-telling way. Stay Tuned for the next one!
Data Consistency and Consensus- Part 11 of About Distributed Systems
More on Distributed Transaction - Part 13 of About Distributed Systems
64 posts | 53 followers
FollowAlibaba Clouder - March 12, 2020
Alibaba Clouder - March 20, 2018
Alibaba Cloud_Academy - August 29, 2022
Alibaba Cloud_Academy - September 30, 2022
Alibaba Cloud_Academy - September 1, 2022
Alibaba Cloud Native - October 12, 2024
64 posts | 53 followers
FollowProvides scalable, distributed, and high-performance block storage and object storage services in a software-defined manner.
Learn MoreMulti-source metrics are aggregated to monitor the status of your business and services in real time.
Learn MoreBuild business monitoring capabilities with real time response based on frontend monitoring, application monitoring, and custom business monitoring capabilities
Learn MorePlan and optimize your storage budget with flexible storage services
Learn MoreMore Posts by Alibaba Cloud_Academy