All Products
Search
Document Center

AnalyticDB:Product overview

Last Updated:Sep 20, 2024

What is AnalyticDB?

AnalyticDB is a cloud-native real-time data warehouse service developed in-house by Alibaba Cloud. AnalyticDB allows you to write data from online transaction processing (OLTP) databases and log files in real time and analyze petabytes of data within seconds. AnalyticDB uses a cloud-native storage-compute decoupled architecture that supports the pay-as-you-go billing method for storage and the elastic scaling feature for computing. AnalyticDB provides batch processing and real-time analysis based on resource isolation to meet enterprise requirements for data processing efficiency, cost control, and system stability. AnalyticDB is compatible with the MySQL, PostgreSQL, and Spark ecosystems.

AnalyticDB provides two engines: AnalyticDB for MySQL and AnalyticDB for PostgreSQL.

Item

AnalyticDB for MySQL

AnalyticDB for PostgreSQL

Ecosystem

Highly compatible with MySQL

Highly compatible with Spark

Fully compatible with PostgreSQL

Highly compatible with Oracle

Architecture

Storage-compute decoupled architecture

Scalability

Similarities

Vertical scaling

Horizontal scaling

Differences

Uses a multi-cluster model to automatically scale resources

Uses a min-max model to automatically scale resources in a scheduled manner

Uses scheduled jobs to change configurations in a scheduled manner

Scales resources on demand in Serverless mode

Features

Similarities

Vector search

Full-text search

Batch processing

Real-time materialized views

Differences

Data lake

Spark batch processing

Intelligent diagnostics and optimization of query performance

Retrieval-Augmented Generation (RAG) service

Spatio-temporal data analysis

Scenarios

Similarities

Real-time data warehouses

Real-time log analysis

Business intelligence (BI) reports

Differences

Precision marketing

Multi-source joint analysis

Big data storage and analysis

Accelerated query of offline data

Data migration of other data lake or data warehouse services, such as Databricks, Athena, and self-managed Spark or Presto clusters

One-stop building of Large Language Model (LLM) applications

Dedicated enterprise knowledge base

Geographic Information System (GIS)-based big data analysis

Integrated batch processing with real-time analysis

Data migration of other data warehouse services, such as Greenplum, Redshift, Synapse, Snowflake, and BigQuery

Industries

Gaming, retail, and automobile

Retail, e-commerce, and education

Cost-effectiveness

Similarities

Data storage fees based on actual data volumes

Tiered storage of hot and cold data to reduce storage costs

Scheduled auto scaling based on regular traffic fluctuations to ensure sufficient resources during traffic fluctuations and prevent idle resources after traffic fluctuations

Differences

Auto scaling based on business workloads

Manual instance starting or pausing based on business requirements

Architecture of AnalyticDB for MySQL

Data Lakehouse Edition

Compared with Data Warehouse Edition, Data Lakehouse Edition can implement low-cost batch processing and high-performance, real-time analysis. Data Lakehouse Edition significantly improves the data processing capabilities in collection, storage, computing, management, and application.

The following figure shows the architecture of Data Lakehouse Edition.

image

Data source

AnalyticDB Pipeline Service (APS) is provided to implement low-cost access to data sources, such as databases, logs, and big data platforms.

Storage layer and compute layer

Data Lakehouse Edition provides two in-house engines: the XIHE compute engine and the XUANWU storage engine. Data Lakehouse Edition also supports the open source Spark compute engine and Hudi storage engine. Data Lakehouse Edition is suitable for a variety of data analysis scenarios and supports access between the in-house and open source engines to implement centralized data management.

  • Storage layer: One copy of full data can be used for both batch processing and real-time analysis.

    In batch processing scenarios, data needs to be stored on low-cost storage media to reduce costs. In real-time analysis scenarios, data needs to be stored on fast storage media to improve performance. To meet the requirements for batch processing, Data Lakehouse Edition stores one copy of full data on low-cost, high-throughput storage media. This reduces data storage and I/O costs and ensures high throughput. To meet the requirement of real-time analysis within 100 milliseconds, Data Lakehouse Edition stores real-time data on individual elastic I/O units (EIUs). This helps meet the timeliness requirements for row data query, full indexing, and cache acceleration.

  • Compute layer: The system automatically selects an appropriate computing mode for the XIHE compute engine. The open source Spark compute engine is suitable for various scenarios.

    The XIHE compute engine provides two computing modes: MPP and bulk synchronous parallel (BSP). The MPP mode uses stream computing, which is not suitable for low-cost and high-throughput batch processing scenarios. The BSP mode divides tasks within a DAG and computes data for each task. This way, large amounts of data can be processed by using limited resources, and the data can be stored on disks. If the MPP mode fails to process data within a specific period of time, the XIHE compute engine can automatically switch to the BSP mode to process data.

    The open source Spark compute engine is suitable for more complex batch processing and machine learning scenarios. The compute layer and storage layer are separated but interconnected, which allows you to easily create and configure Spark resource groups.

Access layer

The access layer leverages unified billing units, metadata and permissions, development languages, and transmission links to improve development efficiency.

Data Warehouse Edition

The following figure shows the architecture of Data Warehouse Edition.

image

Access layer

The access layer consists of linearly scalable coordinator nodes. The access layer is used for protocol layer access, SQL parsing and optimization, real-time sharding of written data, data scheduling, and query scheduling.

Compute engine

The compute engine integrates the distributed massively parallel processing (MPP) and directed acyclic graph (DAG) capabilities. The compute engine leverages an intelligent optimizer to support high-concurrency and complex SQL queries. The cloud-native infrastructure allows compute nodes to be scaled within seconds. This way, resources are efficiently utilized.

Storage engine

The storage engine supports real-time data write operations that have strong consistency and high availability based on the Raft consensus protocol. The storage engine uses data sharding and Multi-Raft to support parallel processing, tiered storage of hot and cold data to reduce costs, and hybrid row-column storage and intelligent indexing to provide ultimate performance.

Data Warehouse Edition uses the three-layer architecture to support failover within seconds and implement cross-zone deployment, automatic fault detection, and replica deletion and recreation. Data Warehouse Edition supports three-replica data storage and full and incremental backups, which provides data reliability that is required in the finance industry. Data Warehouse Edition provides tools that can be used to migrate, synchronize, manage, integrate, and protect your data, which allows you to focus on business development.

Architecture of AnalyticDB for PostgreSQL

image

AnalyticDB for PostgreSQL is available in elastic storage mode and Serverless mode. The elastic storage mode uses a shared-nothing architecture based on Elastic Compute Service (ECS) and Enterprise SSDs (ESSDs) and provides MPP capabilities. The Serverless mode uses a shared-storage architecture based on ECS, local cache, and Object Storage Service (OSS) and provides decoupled storage and computing capabilities.

An AnalyticDB for PostgreSQL instance consists of a coordinator node and multiple compute nodes. The coordinator node is responsible for metadata management and load balancing. The compute nodes are responsible for data processing. The compute nodes integrate the Orca optimizer and the self-developed Laser execution engine and Beam storage engine to implement high-performance queries. The compute nodes also use incremental materialized views (IMVs) to build real-time materialized views. AnalyticDB for PostgreSQL stores hot data on ESSDs attached to the compute nodes and cold data in OSS. The tiered storage of hot and cold data helps improve query performance and reduce storage costs. You can separately scale the computing and storage resources of the compute nodes.

References

Benefits

Scenarios