×
Community Blog Continuous Definition of SaaS Cloud-Based Data Warehouses and Real-Time Analysis

Continuous Definition of SaaS Cloud-Based Data Warehouses and Real-Time Analysis

This article thoroughly explains data warehouses and SaaS cloud-based data warehouses.

By Kong Liang (Lianyi)

1. An Overview of Cloud-Based Data Warehouses

A data warehouse is theme-oriented, integrated, stable, and time-varying. It is used to support management decision-making. A data warehouse collects all of the enterprise data and provides a centralized and standard data outlet for all enterprise departments.

A data warehouse (model) is a methodology of best practices for manual data collection, storage, understanding, organization, management, and decision-making. The model is not affected by where it is used or what technology it uses. However, the logical model and the physical model are integrated closely into the final solution. Therefore, we need the business and technical capabilities provided by data warehouses.

1

The core features and values of data warehouses include collection, synchronization, processing, storage, modeling, governance, and query. The data warehouse must be deployed, activated, and routinely maintained in an Internet Data Center (IDC) to realize its capabilities and value. It must also be highly available, secure, and scalable. These requirements constitute the total cost of ownership (TCO) of the data warehouse. From all perspectives: TCO = Core capability costs + Basic costs = Product costs + Service costs = Current costs + Long-term costs + Evolution costs.

MaxCompute is an enterprise-grade data warehousing service based on the Software-as-a-Service (SaaS) model. SaaS cloud-based data warehouses provide the following characteristics:

  1. Out-of-the-Box (OFTB)
  2. Large-Scale and High-Performance
  3. O&M-Free With Expert Optimization
  4. Flexible Expansion
  5. Data Services
  6. Rich and Complete Data Warehouse Capabilities
  7. High Availability and Disaster Recovery
  8. Excellent Security
  9. Low Cost
  10. Rapid Evolution of Capabilities

Data warehouses free enterprises from investment in non-core capabilities, such as infrastructure construction, maintenance, and long-term evolution.

2

Possible Scenarios of SaaS Cloud-Based Data Warehouses:

  • Real-time data warehousing, analysis, and decision-making
  • Business Operation Scenarios: Interactive business metric computing and query
  • Data Warehouse Construction in Each Industry: Batch and streaming integration with data lake and data warehouse integration
  • Off-premises auto scaling for big data computing and storage

Benefits of SaaS Cloud-Based Data Warehouses:

  • Excellent Cloud-Native Elasticity: Cloud-native design and serverless architecture support auto scaling in seconds and meet the requirements of large-scale elastic workloads.
  • Easy-to-Use With Multifunctional Computing: Multiple preset computing models and tunnel capabilities, which come ready-to-use
  • Enterprise-Grade Platform Services: These services support open ecosystems and provide enterprise-grade security management capabilities. These services are integrated seamlessly with many Alibaba Cloud data services.
  • Security: Security control is effective in multitenancy environments
  • Large-scale clusters feature high performance and comprehensive stability, which have been verified in Alibaba's Double 11 scenarios.

Recommended SaaS Cloud-Based Data Warehouse Scenarios and Product Combination:

  • Real-Time Analysis Scenarios: MaxCompute + MC-Hologres + Flink + DataWorks + Quick BI
  • Machine Learning Scenarios: MaxCompute + Machine Learning Platform for Artificial Intelligence + DataWorks

Here, we will focus on real-time analysis scenarios:

3

The following figure shows the user-oriented features and data flows of cloud-based data warehouses. After you activate the MaxCompute cloud-based data warehouse service, you can use all these features.

4

2. Real-Time Analysis Scenarios and Value

The 5 V's of Big Data

  1. Volume refers to the volume of massive data that is continuously increasing. Currently, data volumes in big data scenarios are generally over 10 TB. However, the size of datasets that meet big data standards will change as the technology advances.
  2. Velocity indicates the fast speed of data generation and flow. The data flow velocity refers to the speed of collection, storage, and analysis of valuable information. Therefore, it also means that data collection and analysis must be fast.
  3. Variety indicates that big data includes many different formats and types of data. Data can be automatically generated in man-machine interaction. The variety of data sources leads to a diversity of data types. Data can be divided into structured data, unstructured data, and semi-structured data based on data models, structures, and relationships.
  4. Veracity refers to data quality and fidelity. A high signal to noise ratio (SNR) is preferred for data in big data environments.
  5. Value refers to low value density. As data increases, meaningful information in the data does not increase proportionally. However, value is also related to data veracity and data processing time, as shown in the following figure:

Closer data sources facilitate analysis and decision-making, maximizing the value of data.

5

The following two analogies are used to describe the evolution of real-time analysis scenarios:

Analogy 1: A grand hotel also has a wide range of other businesses, such as providing real-time food services, to leverage the advantages of collaboration.

Evolution 1: In a data warehouse analysis scenario, perform real-time analysis based on real-time business requirements to implement real-time channels and interactive analytics, forming a Lambda architecture

Analogy 2: The hotel expands from real-time food services, requiring more external support and moving toward comprehensive development.

Evolution 2: In a real-time analysis scenario, create a stream architecture to extract data from data warehouses, and play it back with data sources, forming a Kappa architecture. Then, you must consider how to implement real-time data and model warehousing.

6

The two evolution scenarios are analyzed in detail below:

In the data warehouse analysis scenario, you can analyze data in real-time based on real-time business requirements to implement real-time channels and real-time interactive analytics, forming a Lambda architecture. For example, for Internet of Things (IoT) device monitoring and analysis, after policies are delivered to a device, data is reported and immediately analyzed. Then, you can compare the previous results for repeated analysis and optimization.

In the real-time analysis scenario, you can create a stream architecture to extract data from data warehouses, and play it back with data sources, forming a Kappa architecture. Then, you must consider how to implement real-time data and model warehousing. For example, for fraud monitoring, obtain the analysis result in a timely manner and associate tags for accurate identification. Finally, store the real-time data in the data warehouse to generate knowledge through incorporation with other data.

7

The main capability requirements of real-time analysis are listed below:

1.  Application Ecosystem:

  • Developer Ecosystem
  • Rich API Operations and SDKs
  • Seamless Integration With BI tools
  • Seamless Integration With Streaming Data Processing Tools and Distributed Message Queues

2.  Rapid Query Response:

  • Millisecond-level response speed to support complex multi-dimensional analysis of massive data
  • Tens of Millions of Queries Per Second (QPS) for Point Queries
  • Thousands of QPS for Simple Queries

3.  Real-Time Storage:

  • Hundreds of Millions of Transactions Per Second (TPS)
  • Real-Time Queries Upon Writing

4.  Data Warehouse Query Acceleration:

  • Direct Analysis
  • No Data Migration
  • No Redundant Storage
  • Unified Permissions

5.  Joint Computing:

  • Unified Modeling Methods
  • Unified Metadata
  • Unified Management and Control System
  • Evolution and Integration in the Hierarchical Domain Architecture

8

3. MaxCompute Cloud-Based Data Warehouses and Real-Time Analysis

The common Lambda architecture has three major problems:

1.  Inconsistency:

  • Two sets of code and logic
  • Different stream and batch semantics
  • Different data storage and transformation methods at the offline layer and real-time layer

2.  Interlocking Systems With Complex O&M and High Costs:

  • Multiple Different Systems
  • Many Synchronization Tasks
  • High Resource Consumption
  • Different Standards for Different Systems

3.  Long Development Cycle and Cumbersome Businesses:

  • Hard to diagnose and locate errors
  • Long Revision and Replenishment Cycle
  • Unable to implement self-service real-time analysis
  • Unable to promptly respond to changes
  • Slow transformation from analysis to services

9

Based on the refined operations for search recommendations scenarios, open-source solution capabilities are decentralized. KVStore, Massively Parallel Processing (MPP), real-time data warehouses, and data warehouses that support multiple capabilities are shown in the following figure. We recommend using one technical solution to integrate these capabilities into one engine. For example, the storage, real-time data warehouse, interactive analytics, point query, and online analytical processing (OLAP) analysis capabilities can be integrated. MaxCompute Hologres is just such a solution.

10

MaxCompute Hologres makes the real-time analysis architecture simple and efficient. Hologres supports real-time data writing, analysis, and queries by focusing on real-time analysis. MaxCompute Hologres enables the same data to be used for real-time analysis, online services, and unified real-time and offline storage with a cloud-native hybrid serving/analytical processing (HSAP) architecture. This supports perfect integration with MaxCompute.

11

In another scenario, MaxCompute Hologres can be used as an analysis and acceleration capability module and an Application Data Service (ADS) modeling capability module for MaxCompute. No data is migrated, and data analysis is highly efficient. At the ADS layer, modeling and services are integrated, and the OLAP capability is enhanced, as shown in the following figure:

12

The Kappa architecture is upgraded based on the stream architecture. This requires data warehouse playback and association. You also need to consider how to implement real-time data and model warehousing. Open-source real-time data warehouses feature high real-time costs, long development cycles, and inflexible service support.

The Kappa architecture is optimized based on the Lambda architecture by combining real-time analysis and streaming and replacing data storage and channels with message queues. Therefore, the Kappa architecture still focuses on stream processing. However, data is stored and modeled at the data lake layer and will be played back in message queues for offline analysis or re-computing. The Kappa architecture seems simple but is difficult to implement, especially for data playback.

13

MaxCompute Hologres integrates real-time, offline, analysis, and service capabilities. This allows it to support joint real-time and offline analysis and provides insight into cold, hot, and warm data, as shown in the following figure:

14

4. Real-Time Analysis Cases

In common real-time analysis scenarios, MaxCompute provides a solution that integrates real-time, offline, analysis, and service capabilities by using Hologres. These capabilities have been mentioned in the preceding section, including Lambda architecture simplification, interactive query enhancement, Kappa architecture enhancement, joint real-time and offline analysis, and full insight into cold, hot, and warm data.

This solution applies to data-driven operations in Internet industries, such as e-commerce, gaming, and social networking, including but not limited to intelligent recommendations, log collection and analysis, user profiling, data governance, business dashboards, and search.

15

VivaVideo is a short video community app for original videos with a wide range of editing features. It provides short video editing tools, including filming, editing, and tutorials. It ranks among the top five (by income) in the Google Play store and serves more than 890 million users worldwide.

  • Tag Data Development: The customer uses MaxCompute to calculate the basic attribute data, behavior log data, and content data generated by the app each day and update tag data offline daily. This feature can be used for marketing businesses.
  • Real-Time Insights for User Profiling: The customer uses MaxCompute-Hologres to perform multi-dimensional real-time analysis on user tags that have been calculated offline based on MaxCompute. This allows the customer to understand the association between users' attribute and content tags, analyze cross-sales opportunities, and push app messages to selected users.
  • Real-Time Video Recommendation: The customer built a personalized real-time recommendation system using Flink, MaxCompute, MaxCompute-Hologres, and Machine Learning Platform for Artificial Intelligence. This helped the customer recommend personalized short videos to users in real-time based on user characteristics and real-time behavior characteristics.

16

0 0 0
Share on

Alibaba Cloud MaxCompute

137 posts | 19 followers

You may also like

Comments