Watch the replay of the Apsara Conference 2021 at this link!
"We build our data middle platform because we hope the accumulated data can continuously generate value. The big data product components provided by Alibaba Cloud have given small and medium-sized teams the opportunity to support big data business," said Chen Xiaoliang, Head of Hupan Network's Big Data Platform.
Chen Xiaoliang, Head of Hupan Network's Big Data Platform
This article explains how Hupan Network (Wanli Niu) uses Alibaba Cloud's big data components to build a data middle platform step by step based on users' pain points and demands and the situation of the company. It focuses on sharing the experience of building a real-time data warehouse based on Hologres through two specific application examples.
Hupan Network was established in 2011 and started its business in Hupan Garden. It is the earliest SaaS ERP service provider in the industry. It has R&D centers in Hangzhou and Shanghai, providing service to more than 400 cities in China. Wanli Niu is a brand that the company provides to users.
We primarily provide services to corporate clients engaged in domestic e-commerce, cross-border e-commerce, and entity stores. Wanli Niu is committed to helping clients achieve an all-in-one connection with global e-commerce platforms and helping merchants sell products all over the world. Wanli Niu provides clients with an all-in-one product matrix consisting of core products, such as ERP, cross-border ERP, and warehouse management system (WMS). The matrix is integrated with business intelligence (BI), new retail, ordering systems, and other products. Our data middle platform has fully supported the closed loop of our products, and there are independent data products of BI.
With ten years of experience in the industry, Wanli Niu has connected with more than 200 various e-commerce platforms and supported Double 11 steadily for nine consecutive years. Wanli Niu has provided customers with stable and trustworthy products and services, gaining the trust of more than 300,000 merchant users.
In serving our clients, Wanli Niu found some data pain points of users in the business. First, business data analysis is not performed to assist business operations. Second, manual statistics are difficult to ensure accuracy and timeliness. Third, even if data analysis is performed to improve workflow, it is impossible to evaluate its effect quickly.
Based on these pain points, we have concluded that users' demand for data is to make the accumulated data continuously generate value and form a circular driving force conducive to business development. In other words, big data services are expected to do more and to be more powerful. This is also the driving force behind the construction of the data middle platform.
As shown in the figure above, currently, the data middle platform plays a key role that integrates the various types of internal and external data of Wanli Niu. After data modeling and analysis, the data assists the business operation through data services and applications.
The first problem we met was selection. The Wanli Niu Big Data Team only had five members while building the first version of the data middle platform. We knew that we must focus on data modeling and business development, and the big data components of Alibaba Cloud could help Wanli Niu build a big data platform. In addition, other teams in our company had successful experiences using Alibaba Cloud products, so we chose Alibaba Cloud.
The data source was Alibaba Cloud ApsaraDB RDS. Data synchronization adopted the data integration and pull features of DataWorks. Offline data warehouses built through dimensional modeling were stored in MaxCompute. MaxCompute also served as the computing engine of offline data. DataWorks was responsible for editing and scheduling data warehouse tasks and monitoring data quality. We synchronized some of the ADS tables to the internal tables of Hologres. Therefore, Hologres served as the ADS cache and accelerated the external table query of MaxCompute. Our data asset management system was a metric management system similar to OneData. Data services and applications (mainly reflected in interactive data analysis, CRM, and decision data support for business systems in the system) were at the top layer. At that time, the data middle platform carried 50 terabytes of data, over 2,000 tables, and more than 1,000 scheduled tasks.
After the data middle platform was completed, the BI product service system was also set up. Similar to Quick BI, Wanli Niu built a series of analysis data domains that worked together with data asset management systems and visualization to provide interactive data analysis with a high degree of freedom. Wanli Niu also provided a theme dashboard for personnel of different positions of clients.
After the first version of the data middle platform was completed, it achieved data integration among various applications within the company, eliminated data silos, and built a basic offline data warehouse to realize interactive data analysis capabilities without code. We also received feedback from our clients, mainly focusing on two aspects. First, they hoped to get more time-sensitive data to meet the needs of analysis and monitoring. Second, some clients hoped we would reduce the learning costs because they had no operation experience.
The real-time data warehouse capability was added while building the second version of the data middle platform to solve the issue of timeliness. In the process, we mainly focused on the federated query and the cost of real-time warehouse operation. After research, we found that Hologres, an all-in-one real-time data warehouse, can achieve the goal without increasing the complexity of the technical architecture. At the same time, its cost was only about half of Flink. Therefore, Hologres was finally selected.
The following figure shows the technical architecture of the second version of the data middle platform, which has not changed much from the first version. For data synchronization, a Binlog channel is added to transmit data to Hologres in real-time for establishing a real-time data warehouse. Hologres also undertakes the tasks of federated query and real-time data engine. The following figure shows the role of Hologres as an all-in-one real-time data warehouse:
Currently, the data middle platform of the second version carries 60 terabytes of data with nearly 3,000 tables and over 1,700 scheduled tasks.
The second version provides clients with real-time and quasi-real-time data analysis capabilities. At the same time, it has started providing the data query service of business systems in some scenarios and using the capabilities of Hologres to open up data access permission at the ODS layer. However, there are some imperfections. For example, the current real-time data warehouse lacks stream computing capability since it does not use Flink.
Why did we choose Hologres? During our research, we found that Hologres has several features suitable for our businesses. The most important one is the real-time capability. The all-in-one real-time data warehouse meets the current requirements for the Wanli Niu real-time data warehouse. It can support the data query of OLTP and OLAP. The query of internal tables can be completed within one second. It has low maintenance costs, and it can be used for offline acceleration and as a real-time data warehouse. In addition, its storage cost as a real-time data warehouse is about one-third of ApsaraDB RDS. It has high resource flexibility and can adjust the configuration flexibly like MaxCompute. It has high compatibility with Alibaba Cloud's big data components, which will be helpful to the technical architecture.
Hologres is responsible for real-time and offline data query and query acceleration in OLAP oriented to analysis systems. It serves as a real-time and quasi-real-time computing engine and provides low-QPS and low-concurrency hybrid query service for Wanli Niu in terms of TP.
The real-time data is parsed from Binlog to the internal table of the Hologres ODS layer. At the same time, the micro-batch task computes the hourly statistical data (with an adjustable range) and stores it in the hourly table of the ADS layer. Then, together with the offline data of T+1, a view of real-time full data can be built. The data of H-0 and H-1 come from real-time computing, and the data from H-2 to H-n comes from the quasi-real-time data computed by the micro-batch. The data of T-2 and later comes from offline data in Hologres external tables. Currently, Hologres is applied to scenarios such as monitoring the running status of ERP job orders and counting the order quantity and order amount of a store on certain platforms.
The federated query capability is obtained through the combination of view structure and Hologres. Currently, for Wanli Niu, the query time of external table data is less than ten seconds and less than one second only if the internal table is involved. Of course, without stream computing, the current dimension tables are synchronized from offline to Hologres, so the timeliness is relatively poor. At the same time, real-time and quasi-real-time statistics are conducted directly at the ODS layer, so the computing efficiency will not be high.
In the future, we will add Flink stream computing and build the DWD and DIM layers of the real-time data warehouse. Hologres uses LSM trees to support fine-grained real-time updates. In addition, internal tables support primary keys, which can ensure accuracy and consistency. We can send Hologres internal tables as the source to Flink to perform stream computing and then use Hologres internal tables as the sink to store the computing results. The DWD and DIM layers are obtained using Flink stream computing.
The business scenarios of the query service for mixed hot and cold data with low QPS and low concurrency are applications, such as the Wanli Niu inventory details query. The data is mainly stored in logs in a large quantity, and the application basically performs read operations. It is a waste to put these data in ApsaraDB RDS or Elasticsearch. Wanli Niu synchronizes these data to data warehouses in real-time and uses Hologres and MaxCompute to layer hot and cold data. This ensures the query efficiency of hot data and saves costs. One set of data can be used by several applications.
During the use of Hologres, we also found that Hologres cannot meet our needs in some aspects. For example, the synchronization task of StreamX cannot configure multiple data sources flexibly. Hologres cannot isolate resources as MaxCompute does, and the control of the link number needs to be done by itself. A high version of Java API is required, and its cost is high.
Data Warehouse and Lake House: Continuous Evolution of the Cloud-Native Big Data Platform
Technical Points and Implementation Features of MaxCompute Multi-Tenancy Design on Public Clouds
137 posts | 19 followers
FollowAlibaba Cloud MaxCompute - October 12, 2018
Alibaba Cloud MaxCompute - July 14, 2021
Alibaba Cloud MaxCompute - January 21, 2022
Alibaba Cloud MaxCompute - January 22, 2021
Apache Flink Community China - September 27, 2020
Alibaba Cloud Community - December 18, 2023
137 posts | 19 followers
FollowAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreConduct large-scale data warehousing with MaxCompute
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreA real-time data warehouse for serving and analytics which is compatible with PostgreSQL.
Learn MoreMore Posts by Alibaba Cloud MaxCompute