By Zhang Jianfeng
"Everyone has limited time, so it becomes particularly important to choose a technology worth devoting oneself to."
I have been working with data for 12 years since 2008. During this period, I have dealt with a plethora of data challenges. Specifically, I have participated in developing underlying framework kernels for big data such as Hadoop, Pig, Hive, Tez, and Spark, upper-layer data computing frameworks such as Livy and Zeppelin, and data applications including data processing, data analysis, and machine learning. Now, I am a member of the Apache and Project Management Committee (PMC) for multiple Apache projects. In 2018, I joined Alibaba Cloud's real-time computing team and specialized in Flink R&D.
Today, based on my career experience, I want to discuss how to evaluate whether a technology is worth learning. I have been working on big data technologies, from Hadoop to Hadoop's ecosystem projects. Then, I moved on to the new-generation computing engine Spark, and to Flink on which I have been working recently. So, generally, the evolution of big data computing engines reflects different stages of my career. Personally, I am lucky to have worked on then-popular technologies at all stages. But I also have to admit that I mostly chose which technologies to work on based on my own interests and intuition. By summarizing my experience, I concluded three dimensions that help evaluate whether a technology is worth learning:
1) Technical Depth,
2) Ecosystem Breadth
3) Evolutionary Ability
Technical depth refers to whether the foundation of any technology is solid and irreplaceable by other technologies. Generally, technical depth stresses whether this technology solves major problems while other technologies cannot. This involves the following key points:
1) No other technology could solve the problems.
2) The technology produces great value by solving the problems.
Take Hadoop as an example. I learned Hadoop at the beginning of my career. When Hadoop was released, it was a revolutionary technology, as no companies in the industry had a complete set of solutions for massive data, except Google, who claimed to have a GFS and MapReduce systems internally. At the same time, with the development of Internet technologies, data volume increased daily, and the ability to process massive data became imperative. The emergence of Hadoop met this urgent need.
Though Hadoop was good at processing massive data, as technologies developed, its defects became increasingly intolerable, such as poor performance and complicated MapReduce programming. At this time, Spark was released and solved the chronic problems of the Hadoop MapReduce computing engine. Spark featured a good computing performance that far surpasses that of Hadoop and extremely elegant and simple APIs. With these features, Spark met users' various requirements and was widely accepted by big data engineers.
At present, I am engaged in the research and development of Flink at Alibaba Cloud, as I have seen the industry's demand for real-time processing. In this field, Flink is dominant. Previously, the biggest challenge of big data processing was the huge scale of data. This is also why we call such data a piece of "big data". After years of effort, the data scale problem has been solved by the industry. In the next few years, the major challenge will be speed- the real-time capability. The real-time capability of big data does not only mean the real-time performance of transmitting or processing data but also means end-to-end real-time performance. Throughout this process, if any of the steps are inefficient, the real-time performance of the entire big data system is compromised.
In Flink, however, everything is a stream. Flink adopts a unique architecture in the industry, as it uses Stream as the kernel. The features of superior performance, high scalability, and end-to-end Exactly Once make Flink the leader in stream processing.
Currently, there are three mainstream processing engines: Flink, Storm, and SparkStreaming.
Note: Spark Streaming only supports term searching, and therefore, in theory, it is not a commensurable opponent. However, we are more concerned about its variation in trend, so Spark Streaming is also included here.
According to the preceding Google trends, Flink is becoming increasingly popular, Storm is declining year by year, and Spark Streaming has almost come to a halt. This proves Flink has a deep foundation in stream processing, and none can take its dominant position for now.
For technology, it is not enough to hold its position with technical depth alone. As one technology only focuses on one specific aspect, therefore, to solve complex practical problems, it must integrate with other technologies. This requires the technology to cover a sufficient breadth of the ecosystem. The breadth of the ecosystem is measured with two dimensions.
1) Upstream and Downstream Ecology: It refers to the upstream and downstream data from the data streams perspective.
2) Vertical Ecology: It refers to the integration of a specific domain or application scenarios.
In the beginning, Hadoop had only two basic components- HDFS and MapReduce, which targeted massive storage and distributed computing respectively. However, the development of technologies introduces increasing numbers of complex problems, and HDFS and MapReduce can no longer resolve them all. Given this situation, Hadoop ecosystem projects, such as Pig, Hive, and HBase, emerged. These projects, as part of the vertical ecology, had solved these problems.
On the other hand, the same happened with Spark. Spark was designed to replace the original MapReduce computing engine. Later, however, Spark developed various language interfaces and upper-layer frameworks, such as Spark SQL, Spark Structured Streaming, MLlib, and GraphX. These interfaces and frameworks greatly enriched Spark's application scenarios and extended its vertical ecology. In addition, Spark supports various data sources and as a computing engine establishes a connection with storage. This builds a powerful upstream and downstream ecosystem of Spark and lays the foundation for developing an end-to-end solution.
The Flink ecosystem I am working on is still at its startup stage. However, we saw the dominance of Flink not only as a stream processing engine but also as the opportunity to build a Flink ecosystem. Initially, I focused on the core framework layer of big data, but then I slowly moved to the derived ecological projects. This shift is based on my prediction about the big data industry. The first half of the struggle to implement big data was concentrated on the underlying framework, and this stage has already come to an end. In the future, there will not be many new underlying technologies and frameworks for the big data ecosystem. Meanwhile, superior subdivisions of the ecosystem will be retained, while inferior ones will be evicted. As a result, each subdivision will be more mature and centralized. The focus of the second half of the combat is moving from the under layer to the upper layer, and finally to the ecology. Previously, big data innovations mostly adopted the Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) model. In the future, you will see more products and innovations in the form of Software as a Service (SaaS).
Every time I discuss the ecology of big data, I refer to the preceding figure. This figure includes all the big data scenarios required to deal with everyday challenges. The workflow in these scenarios starts from the leftmost data producer to the subsequent data acquisition and data processing, and then to data applications, such as business intelligence (BI) and artificial intelligence (AI). Interestingly enough, you may find that Flink can be applied in each of these steps, from big data to AI. However, though good at stream processing, Flink is still in the startup stage in the ecology of other fields. I am working on to improve Flink's end-to-end capabilities, as shown in the preceding figure.
If a specific technology has both technical depth and ecosystem breadth, it is worth learning for now, at the minimum. To evaluate technology, it’s crucial to consider its changes over time. Undoubtedly, you never expect the technology you are working on to become obsolete soon and learn a new technology every year. Therefore, a technology worth learning must have the ability to evolve.
It has been more than 10 years since I first learned Hadoop, and it is still very popular today. Although many public cloud vendors are competing for a larger share of the Hadoop market, you still have to admit that if a company wants to set up a big data department, the first task is to set up a Hadoop cluster. Today, when we talk about Hadoop, it is no longer about the original Hadoop prototype, but more of a general term for the Hadoop ecosystem. For more information, read this article by Arun C Murthy, CPO of Cloudera.
Spark projects are better in terms of evolutionary ability. Spark erupted in 2014 and 2015 and has now entered the stable phase. However, Spark is still evolving and embracing changes. Spark on Kubernetes is the best proof that Spark is embracing cloud-native. In the meantime, Delta and MLFlow, which are the most popular models in the Spark community, show Spark's strong evolutionary ability. At present, Spark is not only used to replace MapReduce but is also a general computing engine suitable for various scenarios.
It has been almost a year and a half since I joined Alibaba in 2018, and during this time I have witnessed the evolutionary ability of Flink.
Firstly, Flink has integrated most functions of Blink after releasing several major versions, which greatly improved its SQL capabilities.
In addition, the support for Kubernetes, Python, and AI all demonstrate the strong evolution of Flink.
In addition to the preceding three dimensions, I would like to share some tips for evaluating new technology.
1) Follow Google trends. Google trends can reflect the development momentum of technology well. The preceding Google trends figure compares the 3 stream processing engines: Flink, Spark Streaming, and Storm. It’s easy to conclude that Flink takes the lead in stream processing.
2) Check the Awesome list on GitHub. The popularity of technology can be partly seen from the Awesome list on GitHub. Check the GitHub rating for a specific technology on this list. Additionally, take a weekend to read through the Awesome list, which contains the essence of technology. Based on this information, you may roughly assess the value of technology.
3) Check whether technical evangelists have endorsed the technology on technology websites, such as Medium.com, which is one of my favorites. Often, you may see a group of people in the technical community who are enthusiastic about and understand new technologies. If the technology is actually good, technical evangelists will endorse it free of charge and will share their experience of using the technology.
Everyone has limited time, so it becomes particularly important to choose a technology worth devoting oneself to.
This article summarized my thoughts on how to evaluate whether a technology is worth learning, which is also a quick review of my career in terms of technology selection. I hope it is helpful in your career planning.
Zhang Jianfeng (Jianfeng) is a veteran in the open-source community with GitHub ID @zjffdu. As an Apache Member, he has worked for Hortonworks and now is a senior technical expert at the Alibaba Computing Platform Department. He is the PMC for three open-source projects- Apache Tez, Livy, and Zeppelin, and is also a committer for Apache Pig. He has been working on big data and open-source for years and hopes to make some contributions to big data and data science in the open-source field.
Apache Flink Best Practices: Constructing Real-Time Data Warehouses in the Financial Industry
152 posts | 43 followers
FollowAlibaba Clouder - March 24, 2020
Alibaba Cloud Community - June 5, 2023
Alibaba Cloud Native Community - May 18, 2022
Data Geek - April 19, 2024
Alibaba Cloud Community - July 12, 2022
Alibaba Clouder - July 3, 2020
152 posts | 43 followers
FollowRealtime Compute for Apache Flink offers a highly integrated platform for real-time data processing, which optimizes the computing of Apache Flink.
Learn MoreConduct large-scale data warehousing with MaxCompute
Learn MoreAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreMore Posts by Apache Flink Community