Disclaimer: This is a translated work of Qinxia's 漫谈分布式系统, all rights reserved to the original author.
Starting from the 8th article in the series, high availability and data replication are introduced, and then we go deep into the problem of data consistency. This chapter has already written 6 articles.
This content involves many concepts and theories, and it is also one of the core challenges faced by distributed systems. Therefore, it is necessary to summarize here first.
First, in order to achieve high availability, the only way is data replication (data replication). Of course, data replication can also bring other benefits such as performance improvements.
For the selection of data replication master and slave, we have introduced 3 methods:
For the timeliness of data replication, we also introduce 3 methods:
Then, we found that multi-master concurrent writing may lead to data conflicts, and the replication lag brought by asynchronous replication may lead to inability to access the latest data. That said, these kinds of issues can lead to data consistency issues.
The problem of data consistency makes the system untrustworthy and causes many practical problems that must be solved.
Methods to address data consistency can be divided into two broad categories:
Consistent methods of prevention are subdivided into three categories:
The consistent method of pollution first and then treatment can be divided into two categories:
Distributed transactions can be used to solve data consistency problems, but they are also relatively independent and very important applications. And data consistency is divided into two types: strong and weak, then distributed transactions can be summarized into two implementations:
At this point, we can summarize and theoretically generalize some consistency models. These models, in essence, are the promises and guarantees provided by the distributed system to the outside world, so that external systems can use the distributed system on the basis of these guarantees without holding unrealistic expectations and completely black boxes. land use.
There are so many categories and implementations of consistency models that we cannot and do not need to list them all. Only the main models covered in our previous articles are summarized here.
strong consistency models
weak consistency models
Client-centric consistency models, client-centric consistency, does not require complete consistency of the server, but only pursues the consistency of each client.
From the above summary, it is not difficult to see that the consistency problem is too important and has too much impact. In order to solve it, we have tried every means.
As we mentioned earlier, the reason for the consistency problem is that we want high availability under scalability. However, the server may be down or even unable to recover, so only multiple copies can be made, and the data between multiple copies must be synchronized to ensure the same. If the implementation is not good, inconsistency will occur.
In this way, server failures are the root cause of consistency problems. Is it really?
Partially, but not entirely.
Let's step back and go back to the original stand-alone system to see if we can get to the root of the problem.
In a stand-alone system, a program that receives a specific input will only have two results.
Return a specific output.
An error was returned when a failure was encountered.
For a series of inputs, the single-machine system will process the input in the chronological order of the input, and obtain the output in the same order as the input.
Therefore, the feedback of a single-machine system to a specific input is deterministic. This certainty is reflected in two aspects:
output content. When a failure occurs, such as disk corruption, it is better to crash than to give different output.
order of output. When a failure occurs, it is preferable to terminate execution and resume execution after recovery, rather than giving output in a different order.
This kind of certainty is very important, and it is a strong guarantee provided by the stand-alone system to the outside world, and the external system can use the stand-alone system with confidence.
At the same time, this certainty is a one-to-one relationship, which in turn allows the outside world to infer the state of the system through the returned results.
But in a distributed system, in order to ensure availability, the entire system is allowed to continue to operate after a partial failure. But the number, location, duration, etc. of partial failures are uncertain, which leaves the system in a nondeterministic state.
The same input from the outside world can no longer necessarily get the same output; it is no longer possible to judge the state of the system based on the output results.
There is no problem with this judgment. The uncertainty caused by partial failure is the root cause of data consistency. But still a bit abstract. Digging deeper, what exactly is the problem that caused this uncertainty?
Take a look at the above simplified network topology diagram. After removing the role attributes, a distributed system is a graph of many nodes.
These nodes, like islands in the Pacific Ocean, exist solitary and know almost nothing about the outside world, and can only be explored through the only way -- a peer-to-peer network.
If you send a message through the Internet, if you get a response, you will know more about the other party; but if you don't get a response, you can't even make a negative judgment that the other party is not online.
Because the actual topology map is more like this:
The nodes are not directly connected, but are connected together through complex network devices such as switches. These network devices (and even network lines) can also fail, jam, and so on.
Therefore, for a node that issues a network request, there are actually two variables: the peer node and the network.
Similar to logic and calculation, if the returned result is 1, it can be judged that both are normal, but if the result is 0, there are three situations:
The peer node is abnormal but the network is normal.
The peer node is normal but the network is abnormal.
The peer node and the network are abnormal.
In this way, different from the one-to-one relationship of the stand-alone system, there is a one-to-many relationship. It is impossible to infer the system state according to the returned result.
Even, the normal node can be subdivided into the real death of the crash class and the suspended death caused by the GC. To distinguish between these two situations, you can only use timeout to probe. But what should the timeout be set to? Different systems and environments will have their own experience values that cannot be guaranteed 100%, which is the so-called unbounded timeout problem.
That's one of the big problems with distributed systems -- unreliable networks.
And for the order of a series of messages, there is no longer a unique time to determine the order like a stand-alone system.
Usually a local clock is used to ensure local ordering, and then a globally synchronized clock is used to ensure overall ordering. But the global synchronization of the clock is difficult to be efficient enough, whether it is the standard NTP, or the synchronization protocol implemented by itself.
This also makes it difficult to achieve total order accurately and efficiently in a distributed system.
This is another big problem faced by distributed systems - unreliable clocks.
More than that, think about it carefully, what is time, and what is the use? Even, does time really exist?
This question can be viewed from many angles, such as philosophy, physics and so on.
Let's look at it from a computer's point of view.
Most modern computers and programming languages adopt the practice of Unix, starting from January 1, 1970 at 0:00:00 seconds, and every second that elapses, the timing is incremented by 1 or 1000 (depending on the data type and precision), that is, The so-called timestamp.
Therefore, computers describe time as a count of timestamps, and datetime, which is more readable to humans, represents the year, month, and day, but it is just a converted display form.
And since time is a count, subtracting the two may also make sense, yes, it is the time interval (interval/duration) that we are familiar with.
So, in fact, time has three dimensions to us:
Corresponding to the distributed system:
Order is used to specify the sequence of events to ensure data consistency.
Intervals are used to measure the boundaries of expected events, such as heartbeats to probe for activity, delays in event processing, etc.
Representability, like other scenarios, is used by humans to contrast the real world.
Once the global clock is not guaranteed, in a distributed system, the deviation in representability is not bad, after all, human beings are not so sensitive to time. However, strong data consistency cannot be guaranteed, and node detection may also be misjudged (affected by both the network and the clock), and the consequences may be serious.
Unreliable networks and unreliable clocks, these two major problems, are the root causes of data consistency problems in distributed systems.
The root cause is found, and if these two problems are solved, it seems that the consistency problem can be fundamentally solved.
There is no simple and fundamental solution to an unreliable network.
For the network itself, whether it is a transit device such as a switch, or a transmission medium such as an optical fiber line or a telephone line, just like a server, there is a possibility of failure physically, and it is impossible to completely avoid it.
Improving hardware stability and performance, and improving operation and maintenance efficiency can of course effectively improve network quality, but it is certainly impossible to completely solve the problem.
Therefore, we can only explore by sending requests, and then infer the status of the other party based on the returned results.
And whether you adjust the time of overtime detection, or do multi-step verification, etc., you can only do the best-effort guarantee.
In the final analysis, a node is an island, and it is difficult to understand the overall situation by itself in the ocean of distributed systems.
So we need to work together and collaborate. This is also the solution to the various quorom algorithms mentioned in the previous article, so I won't repeat them.
The first method, since it is so difficult to keep the clock globally consistent, bypass it and use no clock.
Anyway, as mentioned in the previous section, the essence of the clock is only a counter. The big deal is to change a counter.
Think about the consistency of data, what we pursue is actually the first attribute of time - order. In this case, an auto-incrementing ID can be used to identify the sequence, which is equivalent to a logical clock.
The first person who proposed this method was Leslie Lamport, the author of the famous Paxos, so this kind of logical clock is also called Lamport timestamp.
However, even if self-incrementing IDs are used, it is still necessary to negotiate IDs between nodes like a clock, or to be a center for distributing IDs, but it will drag down performance.
Then take a step back and do not pursue strong consistency. Causal consistency is sufficient in most application scenarios.
Deduced here, the answer is ready to come out, it is the Version Number and Vector Clock we talked about earlier!
For details, please review the previous article, which will not be repeated here.
Although this can satisfy most scenarios, after all, the consistency of some scenarios cannot be satisfied, and the node detection of the duration attribute that requires time cannot be replaced by ordinary counters.
So the second way is to face and solve the problem head-on, and come out with a truly consistent time.
The most representative is Google's TrueTime API:
Each computer room has some time masters as the clock standard in the computer room.
Each machine has a time slave daemon to ensure the time synchronization with the master in the computer room.
Most time masters are equipped with GPS to synchronize time from satellites to avoid the influence of terrestrial network equipment.
The rest of the time masters are equipped with Atomic Clocks, which rely on atomic resonance frequencies to determine the time, with an error of one second in 20 million years.
In addition to synchronizing the time from GPS or Atomic Clock, the masters will also correct each other, and will also compare with their own local time, and kick themselves out of anomalies.
The slave pulls time from multiple masters (possibly in different computer rooms) every 30 seconds, and kicks off liars through the Marzullo algorithm.
Under typical settings, the local clock will have a drift of 200 us/sec, plus the slave synchronization interval of 30 seconds per round, theoretically there will be a maximum error of 6ms, plus the average transmission overhead of about 1ms, the actual error range is 0-7ms.
The picture above is the benchmark of Goolge under thousands of machines in multiple computer rooms. It can be seen that the overall error is controllable, and 99% of the errors are within 3ms. On the left, after network optimization on March 31, the error further decreased and stabilized, and 99% of the error was controlled within 1ms. The spur in the middle of the picture on the right is due to planned maintenance on the 2 masters.
Traditional clocks provide the exact time, but this is an illusion. It looks like there is no error, but in fact it is unbounded time uncertainty.
As you can see from the main API of TrueTime above, what TrueTime does is to provide a bounded time uncertainty guarantee.
Negotiation between nodes must have different transmission and processing overhead, and it is impossible to achieve absolute consistency, but it can ensure that the real time is within this extremely small interval.
Since TrueTime returns an interval, in order to ensure the sequence of two times, it is necessary to ensure that the two intervals do not overlap (overlap), that is, t1.latest < t2.earliest.
TrueTime is heavily used in Goolge Cloud's distributed database Spanner. In Spanner, to ensure the serializability of two transactions, it is necessary to submit the second transaction after TT.now() > t1.latest after the first transaction t1 is submitted, that is, the so-called commit wait .
For the two methods mentioned above, the first logical clock cannot be supported in many scenarios, and the second physical clock is too dependent on specific hardware. Therefore, a third implementation called Hybrid Time was born, combining an NTP-based physical clock and an auto-incrementing logical clock to comprehensively judge the sequence of events.
In addition, I mentioned the logical clock earlier and mentioned that the centralized distribution center may drag down the performance, but it is not absolute. If the distribution center only does this, and the service node network is of good and stable quality, such as all in the same IDC, it can also be considered as the fourth clock scheme. In fact, there is already an implementation called Timestamp Oracle in Google Percolator. It will not be expanded here, and interested students can learn about it by themselves.
The first half of this article summarizes the data consistency issues discussed in recent articles and will not be repeated. Only the second half of the content is summarized.
Data consistency problems appear to be caused by server failures, but are actually caused by the uncertainty of distributed systems.
Specifically, unreliable networks and unreliable clocks are the source of consistency problems.
To solve the problem of unreliable network, consensus algorithms such as Paxos have given a way out.
To solve the problem of unreliable clocks, there can be decentralized logical clocks typified by Vector Clock, new physical clocks typified by TrueTime API, hybrid clocks exemplified by Hybrid Time, and centralized logical clocks exemplified by Timestamp Oracle.
At this point, the chapter on data consistency comes to an end. Of course, I have buried the exact once pit several times in the front, and I have not forgotten it. I will pick it up and talk about it in a suitable place later.
After solving the consistency problem, can we enjoy the power of the scalability of the distributed system with peace of mind?
Is a distributed system really completely distributed?
In the next article, let's take a look at the centralization problem in distributed systems.
This is a carefully conceived series of 20-30 articles. I hope to let everyone have a basic and core grasp of the distributed system in a story-telling way. Stay Tuned for the next one!
More on Distributed Transaction - Part 13 of About Distributed Systems
Upcoming Classroom Training | Alibaba Cloud Technical Essentials and Basic Products
64 posts | 54 followers
FollowAlibaba Cloud_Academy - August 29, 2022
Alibaba Cloud_Academy - September 1, 2022
kehuai - May 15, 2020
Alibaba Cloud_Academy - September 16, 2022
Alibaba Cloud_Academy - June 26, 2023
Alibaba Developer - December 17, 2018
64 posts | 54 followers
FollowProvides scalable, distributed, and high-performance block storage and object storage services in a software-defined manner.
Learn MoreMulti-source metrics are aggregated to monitor the status of your business and services in real time.
Learn MoreBuild business monitoring capabilities with real time response based on frontend monitoring, application monitoring, and custom business monitoring capabilities
Learn MorePlan and optimize your storage budget with flexible storage services
Learn MoreMore Posts by Alibaba Cloud_Academy