Watch the replay of the Apsara Conference 2021 at this link!
"Alibaba's evolution from data lakes to data warehouses has made us think about the importance of lake house. Lake house has organically integrated the flexibility and rich data types of data lakes with the growth and enterprise-level management of warehouses. It is a valuable asset of Alibaba's best practices and a new generation of big data architecture," said Lin Wei, Researcher of Alibaba Cloud Intelligence.
Lin Wei, Researcher of Alibaba Cloud Intelligence, Technical Head of MaxCompute and Machine Learning Platform PAI of Alibaba Cloud Intelligence
This article explains the continuous evolution of data warehouses, featuring offline/real-time integration and lake house. We explain it in three parts to help you better understand the evolution of cloud-native big data platforms. By reviewing the history from data lakes to data warehouses, we may wonder why the lake house is necessary and why we need to build data warehouses featuring offline/real-time integration and lake house at this stage.
We hope this article will help you better understand why lake house is necessary.
At the Ningbo Strategy Conference in 2007, we decided to establish an open, collaborative, and prosperous e-commerce ecosystem, the core of which is data. However, at that time, all business departments were vertically developing data capabilities to support business decision-making services. These data middle platforms supported the development of business units. When we reached a certain stage, we hoped to explore the correlation between the data of various business departments to generate business value. We encountered many difficulties because the data came from different departments. Different people would provide us with different data sets. There was no clear data quality monitoring, and we did not know whether these data were complete. So, we needed to spend a lot of time calibrating these data. This process was time-consuming and reduced the overall efficiency of our company.
In 2012, we integrated the data of all business departments to achieve the goal of One Data, One Service. This was a typical process of upgrading data lakes to data warehouses. However, this process was very difficult because we lacked experience building lake houses. We called this process the moon landing. We can understand its difficulty from this name. During this period, each team even needed to stop their daily business to conduct data collation and migrate the original data analysis processes to a unified data warehouse system. After 18 months, we completed building a unified big data warehouse platform in December 2015 called MaxCompute. Through this unified data warehouse platform, business teams, service merchants, and logistics can easily, quickly, and better explore business opportunities. Alibaba's business growth accelerated after the completion of this platform. It offered better data support so merchants and clients could make business decisions quicker.
From the developer’s perspective, data lakes are more flexible, and they prefer freedom. Any engine can read and write data lakes without constraints, and data lakes are very easy to start.
From the data manager’s perspective, data lakes can be used at the early stage. When the data reaches a specific scale, data warehouses are preferred when they treat data as an asset or make business decisions.
The growth curves in the figure above present the curve of Alibaba's development. Data lakes were used in the beginning. Each business department developed independently with fast start-up and strong flexibility. However, when the data reaches a specific scale, it is unmanaged, and the data logic of each business unit is inconsistent, making it difficult to share data among different departments. It took us a lot of time to verify the data. During the continuous expansion of the business, this kind of loss was increasing, making it necessary to build a unified data warehouse for our company.
We have experienced pain comparable to the moon landing, so we do not want the MaxCompute enterprise clients to experience such a painful process. Therefore, we built a development platform that integrates data lakes and data warehouses. When the company is small, you can use data lake capabilities to customize analysis. When the company grows to a certain stage and needs better data management and governance, the lake house platform can upgrade seamlessly and manage data and data analysis effectively. This makes the company more standardized in data management. This is the core idea behind the overall design of the lake house.
We organically combined the data lake system with the data warehouse system. At first, there was no metadata. We extracted the metadata from the data lake as we built the data warehouse. The metadata is stored on an integrated metadata analysis platform together with the metadata of the data warehouse. Based on this metadata, many data management platforms for data warehouses can be built.
Besides, we have supported many analysis engines on the lake house platform of the data warehouse. These engines include task-based computing engines, such as MaxCompute (batch processing), Flink (stream processing), and machine learning. There are open-source components to analyze data and a service-oriented data engine that supports interactive query services, which can display data in real-time. This enables users to build data applications on this service engine.
We can build rich data management tools to enable business units to conduct efficient and overall data governance with these engines. This is possible due to the integration between data lakes and data warehouses, which is also the core of the lake house.
Today's society is becoming more convenient, and clients need to make business decisions faster. We can see this through data analysis needs, such as the GMV real-time dashboard during Double 11, the livestreaming dashboard of the Spring Festival Gala, and the trend of machine learning from the offline model to the online model. These needs have driven the development of real-time data warehouses.
Real-time data warehouses and offline data warehouses have a similar development process. In the early days of real-time system development, we considered the engine because we can only analyze real-time data with engines. Therefore, Alibaba focused its research and development on stream computing engines like Flink. However, with only stream computing engines, we lacked management tools for the analysis result data. So in the second stage, we used our offline data warehouse products to manage these analysis results and put them in our overall data warehouse and data management system. However, it is not timely enough for real-time business decisions if analysis results are managed by the offline data warehouse. Now, we are in the third stage: real-time data warehouses.
We write the analysis results of the streaming engine into the real-time data warehouse Hologres in real-time so the analysis results can be analyzed by BI in real-time. This can support clients' real-time business decisions.
This is the design of the integration of offline and online data warehouses.
In summary, the original analysis was a very complicated process before the integration of offline and online data warehouses. There were many different offline and online engines. In the architecture shown in the figure above, we use the real-time engine for pre-processing. After pre-processing, we write these data to the MaxCompute offline data warehouse or the Hologres real-time data warehouse at the same time. This allows us to conduct more real-time service-oriented BI analysis. MaxCompute offline data warehouses have lower storage costs, better throughput performance, and can conduct a large amount of offline data analysis.
A very balanced system can be brought to clients through an integrated design. Clients can use batch processing according to the data scenario or business scenario. You can conduct offline analysis with lower costs through data compression, cold storage, and gradient storage according to the access frequency.
When more attention is paid to the real-time value of data, it can be processed with the engine of stream computing. If you also hope to have a better interactive experience and observe the generated reports through various dimensions and perspectives quickly, you can use the interactive engine. It offers insights into various dimensions of highly purified data.
We hope a good balance can be achieved with the lake house platform according to the business volume, requirements, scale, and costs.
In general, we hope the lake house can support various types of analysis with different analysis engines, whether offline or online. The online service engine can perform BI analysis in real-time with low costs and customized capabilities. We hope to achieve various balances between real-time and online services. Then, clients can choose based on real-world business scenarios.
We can build a powerful platform for data governance or analysis based on the unified data warehouse platform. This is our DataWorks. There are many data modeling tools on this platform, providing data quality and standards, lineage analysis, and a programming assistant. The integrated online and offline capabilities of the lake house provide a more intelligent way of building a platform for big data development and governance. As a result, we can share more proven and effective data governance experiences with our enterprise clients.
Cloud-Native Upgrade Practices for Big Data Platforms in the Digital Marketing Industry
137 posts | 19 followers
FollowAlibaba Cloud MaxCompute - July 14, 2021
Alibaba Cloud MaxCompute - July 15, 2021
Alibaba Cloud MaxCompute - December 22, 2021
Alibaba Cloud Community - September 17, 2021
Alibaba Cloud MaxCompute - March 3, 2020
Alibaba Cloud MaxCompute - January 22, 2021
137 posts | 19 followers
FollowAn end-to-end solution to efficiently build a secure data lake
Learn MoreAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreBuild a Data Lake with Alibaba Cloud Object Storage Service (OSS) with 99.9999999999% (12 9s) availability, 99.995% SLA, and high scalability
Learn MoreConduct large-scale data warehousing with MaxCompute
Learn MoreMore Posts by Alibaba Cloud MaxCompute