What is AnalyticDB?
AnalyticDB is a cloud-native real-time data warehouse service developed in-house by Alibaba Cloud. AnalyticDB allows you to write data from online transaction processing (OLTP) databases and log files in real time and analyze petabytes of data within seconds. AnalyticDB uses a cloud-native storage-compute decoupled architecture that supports the pay-as-you-go billing method for storage and the elastic scaling feature for computing. AnalyticDB provides batch processing and real-time analysis based on resource isolation to meet enterprise requirements for data processing efficiency, cost control, and system stability. AnalyticDB is compatible with the MySQL, PostgreSQL, and Spark ecosystems.
AnalyticDB provides two engines: AnalyticDB for MySQL and AnalyticDB for PostgreSQL.
Item | AnalyticDB for MySQL | AnalyticDB for PostgreSQL | |
Ecosystem | Highly compatible with MySQL Highly compatible with Spark | Fully compatible with PostgreSQL Highly compatible with Oracle | |
Architecture | Storage-compute decoupled architecture | ||
Scalability | Similarities | Vertical scaling Horizontal scaling | |
Differences | Uses a multi-cluster scaling model to automatically scale resources Uses a min-max model to automatically scale resources in a scheduled manner | Uses scheduled jobs to change configurations in a scheduled manner Scales resources on demand in Serverless mode | |
Features | Similarities | Vector search Full-text search Batch processing Real-time materialized views | |
Differences | Data lakes Spark batch processing Intelligent diagnostics and optimization of query performance | Retrieval-Augmented Generation (RAG) service Spatio-temporal data analysis | |
Scenarios | Similarities | Real-time data warehouses Real-time log analysis Business intelligence (BI) reports | |
Differences | Precision marketing Multi-source joint analysis Big data storage and analysis Accelerated query of offline data Data migration of other data lake or data warehouse services, such as Databricks, Athena, and self-managed Spark or Presto clusters | End-to-end building of Large Language Model (LLM) applications Dedicated enterprise knowledge base Geographic Information System (GIS)-based big data analysis Integrated batch processing with real-time analysis Data migration of other data warehouse services, such as Greenplum, Redshift, Synapse, Snowflake, and BigQuery | |
Industries | Gaming, retail, and automobile | Retail, e-commerce, and education | |
Cost-effectiveness | Similarities | Data storage fees based on actual data volumes Tiered storage of hot and cold data to reduce storage costs Scheduled scaling based on regular traffic fluctuations to ensure sufficient resources during traffic spikes and prevent idle resources after traffic spikes | |
Differences | Auto scaling based on business workloads | Manual instance starting or pausing based on business requirements |
Introduction to AnalyticDB for MySQL
Data source
AnalyticDB Pipeline Service (APS) is provided to implement low-cost access to data sources, such as databases, logs, and big data platforms.
Storage layer and compute layer
Data Lakehouse Edition provides two in-house engines: the XIHE compute engine and the XUANWU storage engine. Data Lakehouse Edition also supports the open source Spark compute engine and Hudi storage engine. Data Lakehouse Edition is suitable for a variety of data analysis scenarios and supports access between the in-house and open source engines to implement centralized data management.
Storage layer: One copy of full data can be used for both batch processing and real-time analysis.
In batch processing scenarios, data needs to be stored on low-cost storage media to reduce costs. In real-time analysis scenarios, data needs to be stored on fast storage media to improve performance. To meet the requirements of batch processing, Data Lakehouse Edition stores one copy of full data on low-cost, high-throughput storage media. This reduces data storage and I/O costs and ensures high throughput. To meet the requirements of real-time analysis within 100 milliseconds, Data Lakehouse Edition stores real-time data on individual elastic I/O units (EIUs). This helps meet the timeliness requirements for row data queries, full indexing, and cache acceleration.
Compute layer: The system automatically selects an appropriate computing mode for the XIHE compute engine. The open source Spark compute engine is suitable for various scenarios.
The XIHE compute engine provides two computing modes: massively parallel processing (MPP) and bulk synchronous parallel (BSP). The MPP mode uses stream computing, which is not suitable for low-cost and high-throughput batch processing scenarios. The BSP mode divides tasks within a DAG and computes data for each task. This way, large amounts of data can be processed by using limited resources, and the data can be stored on disks. If the MPP mode fails to process data within a specific period of time, the XIHE compute engine can automatically switch to the BSP mode to process data.
The open source Spark compute engine is suitable for more complex batch processing and machine learning scenarios. The compute layer and storage layer are separated but interconnected, which allows you to easily create and configure Spark resource groups.
Access layer
The access layer leverages unified billing units, metadata and permissions, development languages, and transmission links to improve development efficiency.
For more information about AnalyticDB for MySQL editions, see Editions.
Introduction to AnalyticDB for PostgreSQL
AnalyticDB for PostgreSQL is available in elastic storage mode and Serverless mode. The elastic storage mode uses a shared-nothing architecture based on Elastic Compute Service (ECS) and Enterprise SSDs (ESSDs) and provides MPP capabilities. The Serverless mode uses a shared-storage architecture based on ECS, local cache, and Object Storage Service (OSS) and provides decoupled storage and computing capabilities.
An AnalyticDB for PostgreSQL instance consists of a coordinator node and multiple compute nodes. The coordinator node is responsible for metadata management and load balancing. The compute nodes are responsible for data processing. The compute nodes integrate the Orca optimizer and the self-developed Laser execution engine and Beam storage engine to implement high-performance queries. The compute nodes also use incremental materialized views (IMVs) to build real-time materialized views. AnalyticDB for PostgreSQL stores hot data on ESSDs attached to the compute nodes and cold data in OSS. The tiered storage of hot and cold data helps improve query performance and reduce storage costs. You can separately scale the computing and storage resources of the compute nodes.