All Products
Search
Document Center

AnalyticDB:Overall architecture

Last Updated:Jul 24, 2024

AnalyticDB for MySQL is a real-time data warehouse service that is developed in-house by Alibaba Cloud. AnalyticDB for MySQL can process petabytes of data and has been tried and tested in core business of ultra-large scales.

Overview

After the initial release in Alibaba Group in 2012, AnalyticDB for MySQL has been iterated through nearly 100 versions and has been supporting real-time analysis for a variety of business sectors owned by Alibaba Group, such as e-commerce, advertising, logistics, entertainment, tourism, and risk control. In 2014, AnalyticDB for MySQL was officially released to the public. AnalyticDB for MySQL provides services for traditional large and medium-sized enterprises, public service sectors, and Internet enterprises in more than a dozen industries.

AnalyticDB for MySQL is a cloud-native data warehouse service that integrates database and big data capabilities.

Technical architecture

AnalyticDB for MySQL adopts a cloud-native architecture that separates computing from storage and hot data from cold data. AnalyticDB for MySQL supports real-time data write operations that have high throughput, strong data consistency, high query concurrency, and high-throughput batch processing.

AnalyticDB for MySQL Data Warehouse Edition is suitable for high-performance, real-time analysis. As the data volume increases and more data formats are supported, data must be preprocessed before extract, transform, load (ETL) operations are performed. To resolve this issue, AnalyticDB for MySQL Data Lakehouse Edition is released and provides high-throughput batch processing capabilities to meet batch processing and real-time analysis requirements.

Data Warehouse Edition

The following figure shows the architecture of Data Warehouse Edition.

image

Access layer

The access layer consists of linearly scalable coordinator nodes. The access layer is used for protocol layer access, SQL parsing and optimization, real-time sharding of written data, data scheduling, and query scheduling.

Compute engine

The compute engine integrates the distributed massively parallel processing (MPP) and directed acyclic graph (DAG) capabilities. The compute engine leverages an intelligent optimizer to support high-concurrency and complex SQL queries. The cloud-native infrastructure allows compute nodes to be scaled within seconds. This way, resources are efficiently utilized.

Storage engine

The storage engine supports real-time data write operations that have strong consistency and high availability based on the Raft consensus protocol. The storage engine uses data sharding and Multi-Raft to support parallel processing, tiered storage of hot and cold data to reduce costs, and hybrid row-column storage and intelligent indexing to provide ultimate performance.

Data Warehouse Edition uses the three-layer architecture to support failover within seconds and implement cross-zone deployment, automatic fault detection, and replica deletion and recreation. Data Warehouse Edition supports three-replica data storage and full and incremental backups, which provides data reliability that is required in the finance industry. Data Warehouse Edition provides tools that can be used to migrate, synchronize, manage, integrate, and protect your data, which allows you to focus on business development.

Data Lakehouse Edition

Compared with Data Warehouse Edition, Data Lakehouse Edition can implement low-cost batch processing and high-performance, real-time analysis. Data Lakehouse Edition significantly improves the data processing capabilities in collection, storage, computing, management, and application.

The following figure shows the architecture of Data Lakehouse Edition.

image

Data source

AnalyticDB Pipeline Service (APS) is provided to implement low-cost access to data sources, such as databases, logs, and big data platforms.

Storage layer and compute layer

Data Lakehouse Edition provides two in-house engines: the XIHE compute engine and the XUANWU storage engine. Data Lakehouse Edition also supports the open source Spark compute engine and Hudi storage engine. Data Lakehouse Edition is suitable for a variety of data analysis scenarios and supports access between the in-house and open source engines to implement centralized data management.

  • Storage layer: One copy of full data can be used for both batch processing and real-time analysis.

    In batch processing scenarios, data needs to be stored on low-cost storage media to reduce costs. In real-time analysis scenarios, data needs to be stored on fast storage media to improve performance. To meet the requirements for batch processing, Data Lakehouse Edition stores one copy of full data on low-cost, high-throughput storage media. This reduces data storage and I/O costs and ensures high throughput. To meet the requirement of real-time analysis within 100 milliseconds, Data Lakehouse Edition stores real-time data on individual elastic I/O units (EIUs). This helps meet the timeliness requirements for row data query, full indexing, and cache acceleration.

  • Compute layer: The system automatically selects an appropriate computing mode for the XIHE compute engine. The open source Spark compute engine is suitable for various scenarios.

    The XIHE compute engine provides two computing modes: MPP and bulk synchronous parallel (BSP). The MPP mode uses stream computing, which is not suitable for low-cost and high-throughput batch processing scenarios. The BSP mode divides tasks within a DAG and computes data for each task. This way, large amounts of data can be processed by using limited resources, and the data can be stored on disks. If the MPP mode fails to process data within a specific period of time, the XIHE compute engine can automatically switch to the BSP mode to process data.

    The open source Spark compute engine is suitable for more complex batch processing and machine learning scenarios. The compute layer and storage layer are separated but interconnected, which allows you to easily create and configure Spark resource groups.

Access layer

The access layer leverages unified billing units, metadata and permissions, development languages, and transmission links to improve development efficiency.

AnalyticDB for MySQL combines the advantages of distributed architecture, elastic computing, and cloud computing to significantly improve scalability, usability, reliability, and security. This helps meet the requirements for data warehousing in different scenarios. AnalyticDB for MySQL supports concurrent access on a larger scale, provides faster read and write performance, and implements smarter management of hybrid query workloads. AnalyticDB for MySQL helps utilize resources in a finer-grained manner and at a lower cost, which allows you to focus more on business development and data value.