Disclaimer: This is a translated work of Qinxia's 漫谈分布式系统, all rights reserved to the original author.
In the previous blog, we talked about why there is a distributed system and when a distributed system is needed.
In this article, I would like to talk about who made the distributed systems in the past and how they did it. And, more importantly, what better lesson could be summed up from the past development of distributed systems.
Let's take a look at the birth of Hadoop.
We sort out the important events in chronicle order:
· In October 2003, Google published the paper "The Google File System"
· In early 2004, the Apache Nutch project led by Doug Cutting started to develop an NDFS file system with reference to Google's GFS paper for their own needs.
· In October 2004, Google published the paper "MapReduce: Simplified Data Processing on Large Clusters"
· At the end of 2004, the Apache Nutch project began to implement its own MapReduce framework based on Google's MapReduce papers.
· In January 2006, Doug Cutting, who was at Yahoo! at that time stripped NDFS and MapReduce from Apache Nutch into separate projects and named them Hadoop.
· 2007, A Hadoop cluster with more than 1000 nodes was deployed in a production environment in Yahoo!
· In 2008, Hadoop became the top Apache project and was widely used. The number of nodes in Yahoo! reached 2000.
· In 2008, Cloudera was established to provide Hadoop-based commercial services and launched its own Hadoop distribution CDH.
· In 2009, MapR was established to provide Hadoop-based business services.
· In 2011, Hortonworks was established to provide Hadoop-based commercial services and launched its own Hadoop distribution HDP.
· In October 2018, Cloudera acquired Hortonworks.
· In 2019, MapR announced that it would go broke if they don't get investment.
As could be seen, Hadoop was born out of two papers published by Google in the early days, and then implemented by a genius according to design, and then was widely used in another big company, Yahoo!, and began to spread after mass production inspections.
Hadoop's growth shows how the industry gives back to the open-source community, and the open-source community feeds back to the industry.
Let's take a look at the birth of Spark too.
Similarly, from the perspective of time, let's look at several important early events:
· In 2009, Spark was born in the AMPLab Laboratory of UC Berkeley University
· 2010, Spark opened the source
· In 2013, Spark was donated to the Apache Foundation
· In February 2014, Spark was successfully incubated and became a top-level Apache project, and started to gain population.
· In November 2014, the founder of Spark established a commercial company Databricks to provide business services with Spark as its core. In the same year, Spark broke the Daytona GraySort 100TB data sorting record maintained by Hadoop MapReduce.
Spark went from academia to industry, from closed source to open source to commercialization.
Finally, let's turn to the birth of Flink:
· In 2010, a project called Stratosphere was launched. This project was funded by the German Research Foundation and was jointly promoted by the Technical University of Berlin, Humboldt University of Berlin, and Hasso Plattner Institute.
· In March 2014, Stratosphere entered the Apache incubator and changed its name to Flink. In the same year, it was successfully incubated and became the top-level project of Apache.
· In 2014, Flink creators founded the company Data Artisans to provide Flink-based commercial services.
· In 2015, Alibaba began to investigate Flink, gradually began to apply it to its business, and made a lot of improvements. It formed its own internal branch and was even named Blink.
· In January 2019, Alibaba acquired Data Artisan (later renamed Ververica) and began to advance the merger of Blink to Flink's main branch.
Flink is from academia to open source, from open source to industry and commercialization. It's just that the composition of academia is richer, including schools and some non-profit institutions.
Through the above three sections, we have sorted out the development history of the three most mentioned frameworks in the field of big data and distributed systems.
I will be focusing on the following aspects:
· Academia and industry
· Closed source, open-source, and open-source-based commercialization
By analyzing these two dimensions, it is easy to find commonalities, lead to certain problems, and finally get some conclusions.
Why was Google the first company to implement a large-scale distributed system?
The core of a search engine is to collect all the web page data in the world, index them, and after receiving user input, find out the results and return them after sorting. This business model has led to Google, the world's largest search engine company, facing massive amounts of data and computing. And it was the first company who has to deal with data saving and slow calculation issues mentioned in the previous blog of the seires.
These two problems can't be solved by the company, and no one has ever encountered similar situations. There is no reference at all. With that, Google designed and implemented NFS and MapReduce frameworks (actually there's more, but we won't be talking about them now). Google applied these frameworks to a massive extent, and technology does not become a hindrance to business development.
Why does Google want to disclose its results?
As a commercial company, its primary purpose is to make money. Only by making money can companies survive. And only bringing cash is a responsible act for investors.
Not to mention that Google has competitors, like the early Yahoo!, Baidu, Bing, and so on. Sooner or later, these companies will also encounter the same technical problems which Google has solved. Isn't it helping competitors if Google were to share these technologies?
We can't completely rule out that Google included its own interest in doing so, as in, for promotion and attraction of technical personnel; and in order to ensure that the technical field is developing in the direction it is already in, and to maintain the lead, etc.
Although Google has removed "Don't be evil" from the company's code of conduct, there have been at least, pursuits beyond money (in fact, it can't be denied now).
This spirit of willingness to contribute one's own strength and make the world a better place, even if it will face near opposition to a point that itself will start doubting and hesitating, is admirable. After all, it does make the world a better place.
It's enough to get people to take off their hats to.
Companies are under pressure to make money, but what about schools?
When encountering technical problems, the company may be off the market if it does not solve them. But school, the teacher, and the student are not the same. If they can not solve it, the worst case is that they change the topic.
Money is too important. But life is not just about cash.
The sense of accomplishment brought by solving a difficult problem is hard to replace. Academia has always been biased toward theoretical research, especially doctorates and various laboratories in major universities, which are often the main force to solve cutting-edge problems. This sense of accomplishment they have experienced is almost to the fullest already.
But if one's own achievements can produce huge economic and social values in the industry, then this sense of achievement will be infinitely magnified. This also prompted the academic community to pay more and more attention to the industry, and join the industry to stand on the front line of solving practical technical problems.
The ultimate question is: What is it to live for just a few decades of human life?
Is it for a lifetime of pleasure and relaxation? Is it for the sake of history, to be respected by later generations, and to bring glory to the family?
I think the pursuit of a sense of accomplishment is enough to drive and satisfy people for a lifetime. And this sense of satisfaction and driving force is self-sufficient, spiritual, and therefore more lasting and huge.
Some may wonder now that Google had donated the work to open source, why set up a commercial company? With all the open-source, how to do commercialization?
As mentioned in the previous section, money is important.
We pursue a sense of achievement, but in the meantime, we cannot deny other pursuits.
Only after the bottom material needs are met can it be possible to pursue spiritual satisfaction at the top. A Chinese philosopher Guanzi once said, “When the granaries are full, the people follow appropriate rules of conduct; when there is enough to eat and wear, the people know honor and shame.”
So, since other companies can use their own works to help make money, why can't they?
The hard thing is how.
Other commercial companies are not bothered by this. They have their own money-making business, and open-source software just solves the technical problems that prevent them from continuing to make money.
And what should be done with the commercialization of open source software itself as the core? Where is the business model for making money?
One answer is technology, one is service.
From a technical point of view, this kind of commercial company is either created by the creator of open-source software or solicits the committer and PMC of many projects, ensuring sufficient control of the core technology. In this way, an ideal way is to launch one's own release version. Good functions will be implemented first in their own version, and bugs will be found in this version. Companies like Cloudera, and Hortonworks had all adopted this model. In this way, differentiation is made from a technical point of view.
From a service perspective, such commercial companies tend to provide a full range of solutions. For example, they offer cloud platform support, a complete machine learning suite, and so on. Databricks are taking this path, and Cloudera is also developing in this direction. On the other hand, the 24/7 responsive technical support service also allows corporate customers to safely hand over themselves to these companies. So customers' needs are met from a service perspective.
Although the idea is sound, the practice is not easy. Examples of the merger of Hortonworks, and the bad management of MapR show that this isn't a breezy way to take. The acquirement of Data Artisans by Alibaba could be considered a "well result" among them.
Commercial companies based on traditional open-source software, such as Oracle and Redhat, have already explored mature business models. And how can open-source software in the context of big data survive stably? Will it have some negative impact on open-source software after being acquired by commercial companies? Only time will tell.
Why it is always big companies, prestigious schools, or large institutions leading these projects?
Hadoop comes from a paper published by Google, and had its finest moments in Yahoo!,then Yahoo! silently receded, and Facebook carried the banner where Hive we are about to mention later is developed in the Facebook.
AMPLab in addition to tinkering with Spark, also hatched Alluxio, Mesos, and other star projects, each of which is brilliant.
Even if the final project is open source, it naturally forms a non-profit organization like the Apache Foundation to organize and coordinate community resources and promote project development.
Other companies in the industry, and other schools and institutions in academia, are more of a follow-up and benefits role.
This is normal. As we often say, the process of history is not shifted by personal will, but the role of heroes in it is also crucial.
Large companies and prestigious schools, first of all, have enough opportunities to take the lead in facing problems, and they have enough motivation, financial resources, and fame, to ensure that they are capable of solving these problems.
Everything is justified.
For an individual, if you are committed to solving these general-purpose problems, rather than specific application problems, you should strive to enter these large companies and prestigious schools.
For an individual company, if you encounter technical problems and do not have enough financial and technical strength to solve them, whether to embrace open source or pay for the business, on a larger level, becomes a cost consideration.
The communication between industry and academia has made a good combination of theory and practice, which promotes both sides.
For example, the industry has now developed the academic habit of publishing papers, whereas, in the past, it was all about product documents.
What's more, a direct and strong sense of achievement, as well as possible economic benefits, also allows the academic community to attract more talents to do practical research (of course, basic theory research is still important).
The choice of open source and commercialization has also become a complementary relationship.
Open source allows the general public to participate, promotes the development of the project, and greatly expands the application scale of the project. Only in this way can someone be willing to use the commercial version.
When commercialization needs to make money, it must compete with the open source version differently, which leads to technical indicators such as function, performance, stability, or service indicators such as customization and responsiveness, both take the project to a level that is difficult to achieve quickly with open source. Subsequently, whether it is the active open source of commercial functions or the conscious follow-up of the open source version, in turn, promoted the development of the open source version.
Therefore, the interaction between industry and academia, as well as the mutual support of open source and commercialization, is really a beautiful thing in the history of human development.
_
TL;DR
The development history of the three most mentioned frameworks in the field of big data and distributed systems are:
· Hadoop's route is that the industry gives back to the open source community, and the open source community feeds back to the industry.
· Spark's route is from academia to industry, from closed source to open source to business.
· Flink is from academia to open source, and then from open source to business.
In conclusion, we can gain some insights from the dimensions of Academic & industry and open source & Business:
· As the world's biggest search engine company, Google took the lead in facing memory and calculation problems, so it became the first pioneer to solve the problem.
· Google has made public its achievements with goodwill. Open source makes the world a good place.
· The pursuit of a long-term and direct sense of achievement has brought schools and institutions to the wave of research and development of such distributed systems.
· Making money is the basic demand and commercialization is not shameful. The commercialization model is to differentiate competition in technology and service.
· Large companies and prestigious schools, just like heroes in the historical process, have played a vital role. Those with lofty ideals, follow and join the lead.
This is a carefully conceived series of 20-30 articles. I hope to let everyone have a basic and core grasp of the distributed system in a story-telling way. Stay Tuned for the next one!
Learning about Distributed Systems - Part 1: Why Using Distributed Systems?
Learning about Distributed Systems - Part 3: Solving Short Storage
64 posts | 53 followers
FollowAlibaba Clouder - April 20, 2018
Alibaba Clouder - November 11, 2020
Alibaba Clouder - May 6, 2020
Alibaba Clouder - July 27, 2020
Alibaba Cloud Storage - February 27, 2020
Aliware - November 10, 2020
64 posts | 53 followers
FollowAlibaba Cloud provides beginners and programmers with online course about cloud computing and big data certification including machine learning, Devops, big data analysis and networking.
Learn MoreProvides a control plane to allow users to manage Kubernetes clusters that run based on different infrastructure resources
Learn MoreAlibaba Cloud Linux is a free-to-use, native operating system that provides a stable, reliable, and high-performance environment for your applications.
Learn MoreReach global users more accurately and efficiently via IM Channel
Learn MoreMore Posts by Alibaba Cloud_Academy