What is AnalyticDB?
AnalyticDB is a cloud-native real-time data warehouse service developed in-house by Alibaba Cloud. AnalyticDB allows you to write data from online transaction processing (OLTP) databases and log files in real time and analyze petabytes of data within seconds. AnalyticDB uses a cloud-native storage-compute decoupled architecture that supports the pay-as-you-go billing method for storage and the elastic scaling feature for computing. AnalyticDB provides batch processing and real-time analysis based on resource isolation to meet enterprise requirements for data processing efficiency, cost control, and system stability. AnalyticDB is compatible with the MySQL, PostgreSQL, and Spark ecosystems.
AnalyticDB provides two engines: AnalyticDB for MySQL and AnalyticDB for PostgreSQL.
Item | AnalyticDB for MySQL | AnalyticDB for PostgreSQL | |
Ecosystem | Highly compatible with MySQL Highly compatible with Spark | Fully compatible with PostgreSQL Highly compatible with Oracle | |
Architecture | Storage-compute decoupled architecture | ||
Scalability | Similarities | Vertical scaling Horizontal scaling | |
Differences | Uses a multi-cluster model to automatically scale resources Uses a min-max model to automatically scale resources in a scheduled manner | Uses scheduled jobs to change configurations in a scheduled manner Scales resources on demand in Serverless mode | |
Features | Similarities | Vector search Full-text search Batch processing Real-time materialized views | |
Differences | Data lake Spark batch processing Intelligent diagnostics and optimization of query performance | Retrieval-Augmented Generation (RAG) service Spatio-temporal data analysis | |
Scenarios | Similarities | Real-time data warehouses Real-time log analysis Business intelligence (BI) reports | |
Differences | Precision marketing Multi-source joint analysis Big data storage and analysis Accelerated query of offline data Data migration of other data lake or data warehouse services, such as Databricks, Athena, and self-managed Spark or Presto clusters | One-stop building of Large Language Model (LLM) applications Dedicated enterprise knowledge base Geographic Information System (GIS)-based big data analysis Integrated batch processing with real-time analysis Data migration of other data warehouse services, such as Greenplum, Redshift, Synapse, Snowflake, and BigQuery | |
Industries | Gaming, retail, and automobile | Retail, e-commerce, and education | |
Cost-effectiveness | Similarities | Data storage fees based on actual data volumes Tiered storage of hot and cold data to reduce storage costs Scheduled auto scaling based on regular traffic fluctuations to ensure sufficient resources during traffic fluctuations and prevent idle resources after traffic fluctuations | |
Differences | Auto scaling based on business workloads | Manual instance starting or pausing based on business requirements |
Architecture of AnalyticDB for MySQL
Data Lakehouse Edition
Compared with Data Warehouse Edition, Data Lakehouse Edition can implement low-cost batch processing and high-performance, real-time analysis. Data Lakehouse Edition significantly improves the data processing capabilities in collection, storage, computing, management, and application.
The following figure shows the architecture of Data Lakehouse Edition.
Data source
AnalyticDB Pipeline Service (APS) is provided to implement low-cost access to data sources, such as databases, logs, and big data platforms.
Storage layer and compute layer
Data Lakehouse Edition provides two in-house engines: the XIHE compute engine and the XUANWU storage engine. Data Lakehouse Edition also supports the open source Spark compute engine and Hudi storage engine. Data Lakehouse Edition is suitable for a variety of data analysis scenarios and supports access between the in-house and open source engines to implement centralized data management.
Storage layer: One copy of full data can be used for both batch processing and real-time analysis.
In batch processing scenarios, data needs to be stored on low-cost storage media to reduce costs. In real-time analysis scenarios, data needs to be stored on fast storage media to improve performance. To meet the requirements for batch processing, Data Lakehouse Edition stores one copy of full data on low-cost, high-throughput storage media. This reduces data storage and I/O costs and ensures high throughput. To meet the requirement of real-time analysis within 100 milliseconds, Data Lakehouse Edition stores real-time data on individual elastic I/O units (EIUs). This helps meet the timeliness requirements for row data query, full indexing, and cache acceleration.
Compute layer: The system automatically selects an appropriate computing mode for the XIHE compute engine. The open source Spark compute engine is suitable for various scenarios.
The XIHE compute engine provides two computing modes: MPP and bulk synchronous parallel (BSP). The MPP mode uses stream computing, which is not suitable for low-cost and high-throughput batch processing scenarios. The BSP mode divides tasks within a DAG and computes data for each task. This way, large amounts of data can be processed by using limited resources, and the data can be stored on disks. If the MPP mode fails to process data within a specific period of time, the XIHE compute engine can automatically switch to the BSP mode to process data.
The open source Spark compute engine is suitable for more complex batch processing and machine learning scenarios. The compute layer and storage layer are separated but interconnected, which allows you to easily create and configure Spark resource groups.
Access layer
The access layer leverages unified billing units, metadata and permissions, development languages, and transmission links to improve development efficiency.
Data Warehouse Edition
The following figure shows the architecture of Data Warehouse Edition.
Access layer
The access layer consists of linearly scalable coordinator nodes. The access layer is used for protocol layer access, SQL parsing and optimization, real-time sharding of written data, data scheduling, and query scheduling.
Compute engine
The compute engine integrates the distributed massively parallel processing (MPP) and directed acyclic graph (DAG) capabilities. The compute engine leverages an intelligent optimizer to support high-concurrency and complex SQL queries. The cloud-native infrastructure allows compute nodes to be scaled within seconds. This way, resources are efficiently utilized.
Storage engine
The storage engine supports real-time data write operations that have strong consistency and high availability based on the Raft consensus protocol. The storage engine uses data sharding and Multi-Raft to support parallel processing, tiered storage of hot and cold data to reduce costs, and hybrid row-column storage and intelligent indexing to provide ultimate performance.
Data Warehouse Edition uses the three-layer architecture to support failover within seconds and implement cross-zone deployment, automatic fault detection, and replica deletion and recreation. Data Warehouse Edition supports three-replica data storage and full and incremental backups, which provides data reliability that is required in the finance industry. Data Warehouse Edition provides tools that can be used to migrate, synchronize, manage, integrate, and protect your data, which allows you to focus on business development.
Architecture of AnalyticDB for PostgreSQL
AnalyticDB for PostgreSQL is available in elastic storage mode and Serverless mode. The elastic storage mode uses a shared-nothing architecture based on Elastic Compute Service (ECS) and Enterprise SSDs (ESSDs) and provides MPP capabilities. The Serverless mode uses a shared-storage architecture based on ECS, local cache, and Object Storage Service (OSS) and provides decoupled storage and computing capabilities.
An AnalyticDB for PostgreSQL instance consists of a coordinator node and multiple compute nodes. The coordinator node is responsible for metadata management and load balancing. The compute nodes are responsible for data processing. The compute nodes integrate the Orca optimizer and the self-developed Laser execution engine and Beam storage engine to implement high-performance queries. The compute nodes also use incremental materialized views (IMVs) to build real-time materialized views. AnalyticDB for PostgreSQL stores hot data on ESSDs attached to the compute nodes and cold data in OSS. The tiered storage of hot and cold data helps improve query performance and reduce storage costs. You can separately scale the computing and storage resources of the compute nodes.