By Chaoqun Zan, Researcher of the Alibaba Cloud; Julian Zhou, Staff Product Manager of the Alibaba Cloud
It is no surprise that analytics is all about data. For enterprises, discovering the value in data and turning it into business value is the core logic for analytics development.
Similarly, data-driven analytics products have a long history in Alibaba Group. They play a vital role in ensuring stable business operations as well as helping Alibaba Group expand its business seamlessly. In this blog, we will discuss in detail the data analytics capabilities of Alibaba Group and explore how it has successfully evolved its ecosystem to the cloud.
First, let's take a look at some interesting data products in Alibaba group.
Taobao Time Machine is actually the first "to C" internet big data application in China since 2012. It is a type of application, that helps Taobao better understand its customers. When a customer logs in, Taobao Time Machine is able to show the customer's preferences, behavior, and characteristics across multiple dimensions as a profile in a time line. It helps the brand build a personal connection with customers and have impressed many of Taobao users that year in 2012. This customer-oriented data product has become a role model for every new product release, and have inspired a whole host of similar products.
Alibaba Index was a market oriented application since 2011, originally called "Taobao index". The purpose of Alibaba Index is to identify trends on Taobao and Alibaba's e-commerce platform across multiple dimension, including market information, subscribers, categories, and etc.
What is more important is that how analytics have empowered the Alibaba business as a data driven engine. Data is growing exponentially, with data scientists and data analysts running algorithms every day to produce profiles and tags on every single object including seller, buyer, products, and orders.
And this is where the two key principles in analytics play an important role. The first one is that tag matching should be efficient to help us find similar objects or group similar objects together. Another principle is the gradual weakening of tags. This means that the leading tag should take the most weight of the similarity.
These two principles play a fundamental role for the precision marketing and advertising that we have today. There is a logic called "LOOKALIKE" modeling internally, which is to find the target crowd based on the existing crowd in terms of similarity. In order to achieve this, we set up a relational model, apply the algorithm and training model on the data, and use SQL query language to analyze and find the insights from the data online.
Based on these analytics scheme, Damopan and Alimama, Alibaba's precision marketing and advertising platform, are able to provide data-driven analytics services and products. These products not only support Alibaba Group itself but also empower merchants in the Alibaba e-commerce ecosystem with incredible ROI every day.
If we take a look at the years of data and business development in Alibaba, we will see that data and business are closely correlated. At the very beginning, business generates data, and then we enter the phase where data is driving business growth, and finally data itself becomes a business to make profit.
From an operational perspective, data business is about how to store, connect, and use the data in a more efficient way. First, we move data to the cloud. Data on the cloud has many benefits, such as a centralized storage with unified metadata, and cloud infrastructure can be utilized with large-scale computing capabilities on top of data. Cloud data is being used by many data consumers as data assets. Then the data itself becomes a business that drives precision marketing, FinTech, and smart logistics.
With more openness, data and analytics on top of data became the data services. UMENG is such kind of data product nowadays. And it was the time our online analytics products and services went to the public cloud such as AnalyticDB.
To make the system and service architecture more agile, business and Data Mid-End, also known as Data Middle Platform, was born as a middle layer sitting between business data application layers and the analytics platform. Data Mid-End drives business development and innovation in a quick manner with these 2 key factors, business and data. Data Mid-End has driven more data intelligence such as intelligence brain as described in the following section.
The image above illustrates some of the capabilities of the Intelligence Brain solution. This solution have been deployed in scenarios such as city transportation, city operation, and energy control and operation. At the backend, data and analytics empower the intelligence, optimizing the city operation with better efficiency. This solution has helped Alibaba, and our customers, to tackle real-world problems in a visual manner with the power of artificial intelligence.
Another application of data intelligence is the Data Mid-End (also known as Data Middle Platform) as mentioned previously. Data Mid-End has supported the Alibaba economy, including customer analytics and marketing, helping us find and refine target audiences through data. The "LOOKALIKE" logic introduced before is used to map target audiences with the right media channels and assets.
This solution is made possible with the support of various data and analytics platforms and technologies by Alibaba Cloud, including QuickBI, Quick Audience, Dataphin, AnalyticDB, and OSS.
Let's take a look at the online analytics development from the technical perspective. Since the establishment of Alibaba in 1999, the business and data consumption grew rapidly, which drove the need for strong data analytics capabilities. Prior to 2008, Oracle RAC was used, basically in SMP architecture, which is symmetric multi-processing with multiple cores in a single machine. This was good for handling the analytical workload for a period of years in terms of real-time, consistency, agility, accuracy. (Here agility, we define it as the latency to handle the multi-dimensional data analysis). It was "all-in-one" without system horizontal scaling out capability.
As Alibaba's business grew pretty fast, the system limitation showed up due to fast growing query workloads high with concurrency and data volume.
Then in 2009, Greenplum with MPP architecture was introduced to solve the problem especially for data volume, which could handle petabytes of data, as well as with agility and accuracy. But there were still some limitations, in particular regarding high concurrency for mixed query types of workload and high availability. This is because Greenplum introduces a single node point of failure of the leader node. Furthermore, data ingestion performance is not good enough for the real-time write scenario.
In 2011, open source big data projects were becoming more and more popular in the market and many enterprises started adopting them for vast analytical processing. Alibaba also started to adopt HBase and Hadoop for petabyte-level batch data processing, and adopted the sharding architecture) on top of MySQL databases instances with SSD storage replaced Oracle for the OLTP (online transactional processing) workload. This architecture was typical with decoupling online and batch data processing, transactional and analytical processing to handle the fast data growth.
But the consistency between batch and online, and also the agility were a headache. Usually, we need some architecture design and implementation for dealing with these two problems. Therefore, batch loading and streaming were introduced for data synchronization; while pre-computing and more storage of data cube was introduced for agility. System was going more and more complex and high learning curve for new engineers joining the organization.
Then in 2013, AnalyticDB version 1.0 was born with solving some of those pain points including volume, high concurrency, agility, low latency, accuracy and high availability. But at the beginning, batch and online consistency was still hard as without real-time data ingestion and high concurrent analytical query within a single analytical system. Also the database ACID property was not supported yet in AnalyticDB, which was important for "all-in-one" architecture for an analytical system.
Even with these issues, thanks to the powerful online interactive query capability over big data, AnalyticDB was still becoming the online analytics infrastructure for digital transformation for the most of the business units in Alibaba group since 2013.
From 2013 to 2019, Alibaba core businesses experienced a journey of migrating to a cloud-based infrastructure. AnalyticDB MySQL version 3.0, AnalyticDB PostgreSQL 6.0 and Data Lake Analytics are some of the key online analytics products evolved during this period. These products carry the mission of helping customer's business growth with the online analytics capabilities developed within Alibaba group during the past few years.
Cloud is a very open platform for data ecosystem. Cloud native Data Lake Analytics (DLA) can be used to connect and process data from different data sources, including data in OLTP databases, ubiquitous log data, and massive data in the big data system Hadoop, and heterogeneous types of data in OSS data lake including structured and semi-structured data. DLA integrates the open source Presto and Spark engines to achieve rich multi-source upstream and downstream connection capabilities and multi-mode computing capabilities.
Cloud native data warehouse AnalyticDB focuses on two aspects,
Thousands of customers have benefited from the online analytics cloud products since they were available on Alibaba Cloud. Let's take a look at some application scheme in different scenarios.
A customer that provides short video marketing service used Data Transmission Service (DTS) to synchronize data from transactional database in RDS MySQL to AnalyticDB MySQL in real-time. This migration enables the customer to support online analytics scenarios, including hot video aggregation analysis, multi-dimensional BI reporting, and statistics analysis of host and followers.
Another customer, who is providing internet financial service, is using AnalyticDB PostgreSQL to support the analytics scenarios including real-time query for serving its clients and the BI reporting. The data ingestion channels coming from the upstream business system include,
Another customer, which provides a social media app, is using AnalyticDB MySQL to support the analytics scenarios including BI reporting with Tableau and data analytics driven modules within its App. The data ingestion channels coming from the upstream business system include,
Data Lake Analytics has also been used by a customer that provides data service to its advertising clients. The customer leverages DLA as the centralized data processing engine in the lifecycle of advertising optimization as shown below.
And finally, one of our customers that provide gaming service has built the gaming DAP (Data Analytics Platform) supporting all the analytical workload for gaming business. Data all sinks into 2 types of cloud storages including OSS as data lake and AnalyticDB MySQL as data warehouse,
Some application modules in the ECS game server are writing data into AnalyticDB MySQL in real-time via JDBC connection:
The online analytics system and technology are evolving fast on cloud in terms of cloud native, serverless, HTAP, intelligence, online/batch integration, and database/big data integration. The journey from data driven business, data as business, data technology, to data intelligence on cloud has helped Alibaba Group's businesses evolve and keep up with the demands of customers. It is clear that the online analytics system is, and will be, the core engine driving the business growth to the future.
Use Data Lake Analytics (DLA) to Analyze Data in MaxCompute External Tables
Build a Real-time Cloud Data Lake Based on Alibaba Cloud DLA and Apache Hudi
Amuthan Nallathambi - May 12, 2024
Alibaba Clouder - February 4, 2019
Alibaba Clouder - February 11, 2021
Alibaba Clouder - October 22, 2020
Alibaba Clouder - October 20, 2020
Alibaba Cloud Community - September 18, 2024
A premium, serverless, and interactive analytics service
Learn MoreAn end-to-end solution to efficiently build a secure data lake
Learn MoreAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreBuild a Data Lake with Alibaba Cloud Object Storage Service (OSS) with 99.9999999999% (12 9s) availability, 99.995% SLA, and high scalability
Learn MoreMore Posts by ApsaraDB