Over the last ten years, we have witnessed a dramatic technology shift from traditional databases to diversified, single purpose, massive scale big data platforms. In this wave of big data technology, Hadoop technology (including MR, HDFS, SPARK, HIVE etc) had quickly surged in almost all enterprise data centers but then given ways to those managed Hadoop services on the cloud. The database technologies are also reinvented with big data demands.
Alibaba has built advanced database technologies along the massive "de-IOE" (a-decade-of-evolution-of-alibabas-databases) movement [1]. Those internet giants in USA also had very similar process [2] to upgrade their stack for the ever growing big data requirements.
One of the most mystery output from Alibaba's big data journey should be its Data Middle Office. This concept is kind of unique, living only in Chinese Technology world, but so popular, just like the De-IOE. Another equally unique concept only in China would be Mini-Programs in the mobile domain.
In Sep 2019, Alibaba announced its middle office strategy in investor day meeting as three middle offices -- Business, Data and AI respectively. Here I focus only on data middle office.
The origin of Middle office was in fact about business agility. This is very similar to the middle office concept [3,4] in investment banks. In Alibaba group, the term middle office could refer to two different things:
So, what does middle office mean to the big data world? It includes two parts:
People have been dreaming about a single source of truth for a long time. However, it was never really accomplished in any large organization, due to both technical difficulties and the natural political struggle inside the organization.
A lots companies can build a complete big data system when they have enough budgets. Data may be collected into a single place. However, simply because the system is on-premise, it becomes a luxury to have the necessary agility to adapt to the variety and velocity part of big data business. Common problems are how to get new APIs, new libraries, new server capacities, etc. Life is short, let's not waste time on those basic tedious things.
Alibaba offers mainstream big data tools to support all big data usage scenarios. You can use those tools to build up your own solution:
Leveraging recent hardware advances (RDMA, SSD, etc), Alibaba has built those products with state of art architecture[5]. As a result, recently Alibaba AnalyticDB set a new record for TPC-DS benchmark, beating its own previous record by Elastic Map Reduce (EMR).
On Artificial Intelligence, Alibaba also set a new record on DAWN Deep Learning Benchmark (DAWNBench), on its own Cloud. Apart from this, PAI also offers a full set of tools covering drag-and-drop GUI, Notebook, traditional algorithms (RF, SVM, etc) and deep learning frameworks.
Flink and DataV are quite special breeds for dealing with their own specific problems like real time processing and large screen dashboard (like the one from 11.11).
Though those technologies deliver better performances than other cloud vendors, functional wise, you may still find counterparts in most other cloud vendors. The overall technology stack does not yet look too different from the rest. What's the special ingredient to differentiate itself from the traditional names like Data Warehouse, Data Lake, Big Data Platform? The answer is data governance and DataOps.
Data governance is a broad topic, including data quality, security, lineage, etc. You may find a list of vendors in Gartner Quadrant 2019. Normally data governance tools are provided from a third party, instead of the data platform vendors like traditionally Teradata, Oracle, or more recently those cloud vendors.
When you have your big data development team working in one environment and data governance team in another, it simply won't work. Under this governance setting, people tend to believe their systems look like this:
In fact, under the cover, it often looks like this:
Do you see similarities between those cables and your complex data relationships?
Alibaba's answer to this problem is a fully integrated DataOps environment: DataWorks. It offers native tools to deal with typical problems in developing and operating an enterprise big data platform. To name a few:
DataWorks was born out of Alibaba's own big data daily development and operations. It is being used by those "Middle Office" organizations inside Alibaba group. On the cloud, DataWorks enables DataOps by integrating with different big data engines on Alibaba cloud, including MaxCompute, EMR, etc.
With the DataOps framework and a full spectrum of big data tools, Alibaba cloud can help customers build a big data platform to support business innovations through an agile development process. Through this framework, you are also copying Alibaba's internal development best practices, and avoiding lots of pitfalls Alibaba went through along the big data journey.
1 posts | 1 followers
FollowAlibaba Cloud Community - November 1, 2021
AliCloud-TechLab - August 25, 2021
ApsaraDB - February 20, 2021
Alibaba Clouder - December 13, 2019
Alibaba Clouder - November 3, 2020
Alipay Technology - November 26, 2019
1 posts | 1 followers
FollowAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreConduct large-scale data warehousing with MaxCompute
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreApsaraDB for HBase is a NoSQL database engine that is highly optimized and 100% compatible with the community edition of HBase.
Learn More