AnalyticDB for MySQL: Your Choice for Real-time Data Analysis in the AI Era

By Jiye Tu (Hejun), from AnalyticDB for MySQL Storage Kernel R&D Team

AnalyticDB for MySQL and Rockset have many similarities in architecture design, such as storage-computing separation and real-time writes. However, AnalyticDB for MySQL shows more mature and advanced features in many aspects. AnalyticDB for MySQL supports richer scalability, highly consistent real-time data reads and writes, comprehensive index types, high-throughput writes, complete DML and Online DDL operations, and intelligent data lifecycle management. In terms of vector retrieval and analysis, AnalyticDB for MySQL provides higher retrieval accuracy. Design principles of AnalyticDB for MySQL include distributed tables, a synchronization layer based on the Raft protocol, an engine layer that supports DML and DDL, and a high-performance and low-cost persistence layer. Together, these ensure the high performance and cost-effectiveness of AnalyticDB for MySQL as a real-time data warehouse in the AI era.

1. Introduction

After years of research and development within Alibaba Group and Alibaba Cloud, AnalyticDB for MySQL has continuously addressed the challenges of massive data analysis and met the diverse and differentiated needs of customers. Eventually, it has evolved into a real-time data warehouse that is most suitable for the AI era. This article will introduce the design and implementation principles of AnalyticDB for MySQL in depth, and compare it with Rockset accordingly.

2. Introduction to AnalyticDB for MySQL and Comparative Analysis with Rockset

AnalyticDB for MySQL is a real-time data warehouse service developed in-house by Alibaba Cloud. It can process petabytes of data and has been tried and tested in core businesses of ultra-large scales. It serves various online analysis systems including traditional large and medium-sized enterprises, government agencies, and Internet companies, covering many industries such as e-commerce, advertising, logistics, entertainment, tourism, and risk control. AnalyticDB for MySQL adopts a cloud-native architecture that horizontally or vertically scales components and separates computing from storage and hot data from cold data. It supports real-time data write operations that have high throughput, strong data consistency, high query concurrency, and high-throughput batch processing. The underlying XUANWU storage engine of AnalyticDB for MySQL is a self-developed column-oriented storage engine that supports real-time data write operations that have high throughput, high-performance real-time query, ultra-large data volumes, database with complete DML and OnlineDDL, automatic data tiering, and lifecycle elimination, so as to provide a low-cost solution. It is the basic support of AnalyticDB for MySQL.

Rockset is a real-time data warehouse company that has been in research and development since 2016. It provides external cloud services based on RocksDB, whose specific applications cover many fields such as recommendation engines, logistics tracking, financial technology, and e-commerce. Its main features include storage-computing separation, real-time writes, full indexes, schemaless, low-cost storage, and integrated vector retrieval.

In the AI era, model training, fine-tuning, deployment, and iteration of large models are extremely dependent on massive data with high quality as the basis. Multi-source collection, mass storage, low-cost management, quick import, and fast retrieval of these data are the core competitiveness in the AI era. As typical representatives of real-time data warehouses, AnalyticDB for MySQL and Rockset provide powerful support for data processing capabilities, but there are some differences, which will be described in detail below.

2.1 Storage-Computing Separation Architecture

AnalyticDB for MySQL is similar to Rockset in completely separating computing from storage. The underlying storage nodes of AnalyticDB for MySQL are based on Alibaba Cloud Object Storage Service (OSS) and Alibaba Cloud Elastic Block Storage (EBS), and the upper-layer computing nodes are based on ECS and Kubernetes. Storage and computing resources can be independently expanded. In addition, the computing layer further supports multiple resource groups. Different resource groups are completely isolated from each other and can be used for different businesses without mutual interference.

AnalyticDB for MySQL differs from Rockset in that it supports a wider variety of scalability:

• Scheduled scaling: You can configure the rules for the scheduled upgrade and downgrade of computing resources based on business characteristics.

• On-demand scaling: Resources are pulled up only when ETL SQL statements are received, which minimizes costs.

• Auto-scaling Multi-Cluster within a single resource group: The scale of resources is adjusted in real time based on real-time resource usage, such as CPU utilization and memory usage.

The training of large AI models requires massive amounts of data, and the amount of data required will increase over time, which consumes an astonishing amount of storage, computing, and power resources.

Combined with the storage-computing separation architecture provided by AnalyticDB for MySQL, while continuously providing and increasing storage resources, there is no need to increase computing resources simultaneously, greatly reducing the consumption of CPU and memory. Moreover, compared with Rockset, since AnalyticDB for MySQL provides scheduled scaling and on-demand scaling, Multi-Cluster, and other capabilities that Rockset does not have, it will be able to maximize resource consumption equal to actual needs, so as to minimize the cost of AI enterprises.

2.2 Real-Time Consistency

AnalyticDB for MySQL is a synchronous write model that supports strong consistency and data is visible immediately after write operations that have high throughput. It is implemented in AnalyticDB for MySQL based on log synchronization, real-time engine, primary key, DeleteSet, and row-level MVCC. All DML statements are visible in real time at the row level even in ultra-large datasets. When you query data, each shard takes a row-level snapshot corresponding to the latest offset of the leader, which ensures full visibility and strong consistency in real time.

However, the write operation of Rockset is asynchronous and only guarantees eventual consistency. The query can be performed 1 to 5 seconds after the write operation.

The data required for AI model training comes from a variety of data sources. Traditional offline import capabilities often require T +1 import every other day, which loses data freshness. In AI scenarios such as intelligent Q&A robots and news Artificial Intelligence Recommendations, the reduction in data freshness will seriously lower the business value. AnalyticDB for MySQL provides better real-time data reading and writing capabilities with stronger consistency than Rockset, and ensures data freshness through real-time visibility. In addition, strong consistency greatly simplifies the development difficulty of upper-layer data systems, and improves the efficiency of data cleaning, extraction, and conversion, so as to accelerate every aspect of model training.

2.3 Full Indexes of All Types

In terms of indexing, AnalyticDB for MySQL is similar to Rockset in that both automatically index all columns. The differences are that:

1) Rockset makes three kinds of indexes for each column: inverted indexes for point query, in-memory column indexes for aggregation, and row indexes for data retrieval. While providing rich access modes, it also incurs large storage overhead.

2) AnalyticDB for MySQL provides richer and finer-grained indexing capabilities.

AnalyticDB for MySQL is designed for extreme query pushdown and delayed materialization. It supports almost all types of column-level indexes, including inverted indexes, BKD indexes, Bitmap indexes, vector indexes, full-text indexes, and JSON indexes. It covers structured and semi-structured data and is suitable for query performance in different scenarios. The default setting of full indexes enables the XUANWU engine to automatically select different indexes based on different column types. You only need to focus on SQL usage without manual intervention.

In addition to column-level secondary indexes, Clustering Key, a sort key for row data is also supported. This distributes data in a defined column order and enables I/O aggregation based on sort columns, so as to greatly improve the data reading efficiency on these sort columns. AnalyticDB for MySQL further provides intelligent indexing capabilities, combined with the user's SQL Pattern to automatically recommend indexes and delete useless indexes.

RAG (Retrieval-augmented-generation) is the most typical application scenario of large AI models and a key precondition of Q&A robots such as ChatGPT. High-quality and efficient retrieval can greatly improve the accuracy of Q&A and is an essential infrastructure for the entire AI era. Compared with Rockset, AnalyticDB for MySQL provides more types of finer-grained indexes and higher data search performance. It also provides Clustering Key to store data in a more orderly manner, which further improves the compression ratio and I/O performance. Finally, it can reduce the end-to-end response latency on the client side.

2.4 High-Throughput Write

Both are capable of supporting high-throughput data import. AppendOnly and compaction in the background of AnalyticDB for MySQL are designed and implemented based on the primary key and DeleteSet. This enables high-performance real-time writes at the engine level. In addition, the sharding architecture allows you to horizontally scale out the write concurrency and supports high-throughput real-time warehousing at the instance level by using multiple concurrent data writes to multiple tables in real time. Rockset uses a Tailer-Leaf-Aggregator architecture and independent Tailer resources to support data writes.

In addition to real-time writes, both support using bulk load for batch import. AnalyticDB for MySQL allows you to import large-scale elastic data at a time by using third-party resources such as Spark.

The training of high-quality models depends on high-quality data input, while the cleaning, transformation, and loading of large training datasets often require repeated trial and error and continuous adjustment of cleaning strategies, which is time-consuming. High-throughput real-time writes and batch import are powerful tools to improve efficiency. The real-time write technology of AnalyticDB for MySQL based on mark-for-delete enables any update/deletion of petabytes of data can be completed within milliseconds. The elastic data import capability makes it easy to expand resources horizontally to ensure import efficiency when large amounts of data need to be imported.

2.5 Complete DML and OnlineDDL

AnalyticDB for MySQL, with its sharding and partitioning design, still implements the complete row-level DML semantics of the database on the basis of ultra-large datasets. In addition, the currently supported rich DDL (such as adding and subtracting columns and indexes and modifying column types, partition formats, and storage policies) are fully online. Users can flexibly use them as needed without additional maintenance or worrying about the impact of locking tables on online business.

Rockset, on the other hand, provides a Document data model, which does not require specifying the type of each field in advance. It can automatically add columns and infer types based on the written data. You can even write data of different types in the same column.

In contrast, AnalyticDB for MySQL is a strongly typed system but supports flexible subsequent changes. The early data modeling design will have a certain learning cost, but will provide great help for the later data governance, data quality, and query performance; Rockset is a weakly typed system, which is easy to get started in the early stage, but is more difficult in the later data governance and management.

For the massive datasets that need to be managed in the AI era, data governance is often required repeatedly over a long lifecycle to ensure data quality, and the AnalyticDB for MySQL strongly typed system can greatly ease the management burden.

2.6 Tiered Storage and Automatic Data Lifecycle Management

Both support tiered storage of hot and cold data and TTL lifecycle management of data. The differences are that:

• Rockset stores all data on the OSS and additionally caches hot data on the SSD. AnalyticDB for MySQL, on the other hand, allows only cold data to be stored on the OSS and hot data to be placed directly on the SSD, thus reducing storage costs. AnalyticDB for MySQL supports a specified mixed storage policy. You can specify multiple closest partitions as hot partitions and historical partitions as cold partitions. Hot partitions are stored on [E]SSDs for online analysis and cold partitions are automatically archived to DFS to reduce costs. The archiving process is automatically performed in the backend without user intervention. For low-frequency cold data access, a cache is also provided to accelerate high-frequency access. Both hot and cold partitions are billed based on actual usage without incurring costs for allocated but unused disk space.

• Rockset can only eliminate data based on the _event_time field, while AnalyticDB for MySQL can flexibly specify any field as the partition key column and eliminate data at the partition level.

In AI scenarios such as financial time-series data prediction, body health monitoring, and network security protection, data is naturally defined as hot and cold. Cold data must be stored but is rarely accessed. In addition, the storage cost of massive data often accounts for the majority of the cost of data infrastructure. In this case, low-cost and reliable storage media and user-transparent automatic dump mechanism are essential capabilities, along with the need to ensure that hot data query performance is not reduced. Rockset can only manage the lifecycle based on time, while AnalyticDB for MySQL provides the ability to eliminate and dump data based on any column, which is more flexible to meet the needs of diverse AI business scenarios.

2.7 Vector Retrieval and Integrated Analysis

Both AnalyticDB for MySQL and Rockset provide vector retrieval capability and support integrated retrieval of the vector and data of basic data types.

The difference is that in addition to the vector indexes and inverted indexes, AnalyticDB for MySQL provides the full-text indexes specifically for text and functions such as custom tokenizers, custom dictionaries, and keyword highlighting to allow businesses to be flexibly adjusted in combination with the knowledge domain, providing more precise retrieval capabilities.

As mentioned earlier, high-quality retrieval is essential for the accuracy of AI Q&A models. AnalyticDB for MySQL not only provides vector and basic data type integrated retrieval capabilities comparable to Rockset, but also customizes token segmentation and dictionary configuration for texts in special fields such as security, technical investigation, and medical treatment, providing more accurate text retrieval results as model input for domain AI models.

3. Design Principle and In-depth Analysis of AnalyticDB for MySQL

The preceding section compares the key features of AnalyticDB for MySQL and Rockset, and the following section will describe in detail the design and implementation principles of AnalyticDB for MySQL.

3.1 Data Model

The data abstraction of the XUANWU engine is a distributed table, where the data is hashed to the corresponding shard (sub-table) based on the value of the distribution key, and is further partitioned based on the date (typical) within the shard. Concepts related to data distribution include databases, physical databases, tables, hash distribution, and partitions.

Physical database (shard). When you create an instance, a number of shards are allocated based on resource availability. Each shard corresponds to a physical database. The entire instance is equivalent to a distributed database composed of all physical databases (shards). The scale-out is based on shards to migrate hot data between nodes.
Database. That is, the user database is logically meaningful and has no actual physical distribution significance.
Table. The user table is also logically meaningful. Real data is distributed to various physical databases, each of which has a sub-table. As shown above, Table2 is distributed in the physical database (physical database 1 is not expanded).
Hash distribution. When defining a table, you can use the Distributed by Hash(column_name,...) command to specify a distribution key. When writing data, you can calculate the number of shards of the hash node based on the distribution key to obtain the shard (physical database) to which the record is to be written.
Partition. After a record falls into a specific shard, the record should fall into a specific partition based on the partitioning method specified when the table is defined. Partitioning is to further split data on the basis of sharding. Generally, partitions are organized in the order of year, month, and day based on the date. Currently, two types are supported in terms of definition. They are both value-based partitioning (PARTITION BY VALUE). The difference is that the sources of values are different.

Partition By Value(column_name): partition according to the value of the partition key column column_name. For example, if the value of column_name is similar to 20240101, data is logically partitioned by day. This method requires that data has been formatted as the required partition value before warehousing.
Partition by Value(date_format(column_name,'format')): partition the value of the partition key column column_name formatted by date_format. For example, if column_name is of the date type, it can be formatted as year, month, and day.

Here we only describe the macro data distribution method. Hash distribution and Partition by Value are used to distribute data across nodes of an instance and organize data by date. This method can meet the business requirements of ultra-large data volumes. For more information about table structure definitions, please refer to the official documentation.

3.2 Storage Architecture

From a logical perspective, the XUANWU engine can be roughly divided into the synchronization layer, engine layer, and persistence layer from top to bottom.

Synchronization layer. Following the principle of log first, the synchronization layer is responsible for the temporary persistence of data in the database. The implementation is based on the Raft protocol and uses multiple replicas to ensure data reliability and consistency.
Engine layer. The engine layer manages the entire lifecycle of data, distributes the log data from Raft to each table engine, consumes and persistently stores the log data in a timely manner, and provides external query services. Various types of memory caches for different purposes are also maintained at the node level to improve the query performance of high-frequency access.
Persistence layer. Due to the tradeoff between performance and cost, there are high-performance Elastic Block Storage devices and low-cost throughput Object Storage Service options, and automatic cache acceleration for Object Storage Service access.

Synchronization Layer

The synchronization layer is based on the Raft protocol.

Multi-Raft. In the table engine hash sharding model, one shard corresponds to one Raft cluster, and the entire instance consists of a synchronous replication layer formed by several Raft clusters.
Shard distribution. All shards of an instance are evenly distributed across all EIUs, and scale-out is performed at the shard level. In the preceding figure, suppose that there are four shards and two groups of EIUs, two shards are allocated to each group of EIUs.
Raft Leader. For a shard allocated to an EIU, create a RaftServer on each replica to form a Raft cluster and elect a Leader.
Leader writes data. When writing data, data is first written to the Leader replica, and then the Leader synchronizes logs to other replicas. The data writing reliability is guaranteed by the majority of successful writes.
Query consistency. The Leader replicas are preferentially queried during queries to ensure strong consistency in queries. For queries that are sent to the Follower for other reasons, you can also synchronously obtain the latest Leader offset and wait to ensure consistency.
Load balancing. The Leader is responsible for read and write services to ensure real-time visibility and consistency. Therefore, the Leader needs to be evenly scattered in all EIUs to ensure load balancing. As shown in the above EIU1, Shard1 in the Replica1 is Leader, and Shard3 in the Replica3 is Leader.
Log cleanup. Logs are consumed in the table engine after they are written. The table engine periodically performs checkpoint data persistence. The log data before the persistence point can be cleaned up to avoid excessive space occupancy.
Restart. Restart to achieve the goal of recovery based on the engine persistence state and maintain the persistent log playback. Therefore, the log is played back from the table engine persistence point during restart.

Engine Layer

The overall table engine is based on the design of "append write" + "background merge modification" to reduce the random I/O overhead during the write process to improve the write performance. Complete DML and OnlineDDL are supported to enable the data management capability of databases in ultra-large datasets.

The primary key. The primary key (index) is used to quickly locate duplicate records and add, delete, and modify data in real time based on DML semantics.
DeleteSet. In the append write-oriented design, historical data is read-only, and updates to read-only data are mapped to mark the deletion of read-only data and appending new records. Therefore, the modification of read-only data only means deletion, which is recorded by DeleteSet. The deleted historical data rows are filtered out during the query.
DML. DML first checks the primary key to confirm whether the record exists and determines different behaviors based on DML semantics. For example, when the primary key exists (there is an old record), Insert Into will skip writing the new record, Replace Into will mark to delete the old record and then write the new record, Delete will mark to delete the old record, Update will mark to delete the old record and fill in the new record, and InsertDuplicate will perform the semantic operation of updating when the old record exists. Depending on the design of primary keys and DeleteSet, it currently supports complete real-time row-level DML.
In-memory data. The appended data is first cached in memory and persisted to disk after reaching the threshold. Basic lightweight statistics are collected during persistence.
Disk Layout. XUANWU engine is mainly aimed at analysis scenarios. As shown in the memory data, it uses column-oriented storage as the data distribution mode, which can make full use of the high compression ratio of column-oriented storage and only read data from query columns. In terms of storage, pure column-oriented storage and PAX-like RowGroup organization mode can be chosen according to different scenarios.
Meta. During batch storage on disks, simple statistical information, such as min/max/cnt/nullCnt, can be collected to perform simple rough filtering when indexes are not available and align the rough filtering capability of typical data warehouses, so as to reduce the amount of scanned data.
RealtimeEngine. The real-time engine is an additional and modifiable engine that acts as a real-time buffer. To improve write performance, the real-time engine does not have indexes by default. Queries are roughly filtered based on collected Meta. Correspondingly, some systems are designed to write one or more independent small files on the disk at a time, which will lead to the problem that multiple small files are not merged in time when a large number of small files are batch-written. When the real-time engine exceeds a certain threshold, the background automatic scheduling merges with full data (PartitionEngines) to append new records and remove the records that were previously marked for deletion.
PartitionEngine. The partition engine splits data based on the defined partition format. Compared with the real-time engine, the partition engine has an additional index part. Indexes are built by using asynchronous merging in the background to improve query performance.
Indexes. XUANWU engine supports all types of indexes, including general column indexes, vector indexes, full-text indexes, and JSON indexes.
OnlineDDL. All DDL statements supported by the XUANWU engine are online and do not affect online services.

Persistence Layer

Due to the trade-off between performance and cost, the persistence layer has [E]SSDs with high performance and low latency, which are mainly used for hot data, and DFS options with high throughput and low cost, which are mainly used for cold data. In addition, user data is charged according to usage, without the need to intervene in the scale-out or scale-in operation of the disk.

[E]SSD (hot). [E]SSDs feature high performance and low latency, which are suitable for latency-sensitive hot data scenarios, such as online analysis. To meet high-performance write requirements, the real-time write part of the XUANWU engine is also based on the cloud disk, and then the archive is asynchronously done according to the hot and cold partition definitions during compaction.
DFS (cold). DFS is suitable for offline throughput scenarios that are cost-sensitive, which corresponds to cold storage.
Automatic archiving. XUANWU engine allows you to set the storage policy to mixed mode and set the number of hot partitions to be retained. This way, historical partitions are automatically archived in cold storage during background compaction to reduce costs.
Cache (warm). For occasional access to data in cold storage, direct read DFS will limit query performance by access latency and total bandwidth throughput. XUANWU engine automatically caches data in cold storage locally based on the popularity of the access, achieving a compromise between performance and cost.

3.3 Query Process

When querying, each shard selects a leader to ensure strong consistency. When querying the table engine inside the shard, a lightweight snapshot is composed of the virtual row number and DeleteSet of the real-time engine. All data and deletions before the snapshot are visible, and data and deletions that are concurrently added after the snapshot are ignored. This ensures the semantics of row-level MVCC.

XUANWU engine is responsible for ScanFilterAndProject in queries. As the source of data, the input of the XUANWU engine is predicate conditions and projection columns with indexes, and the output is accurate projection column data that meets the predicate conditions. It pushes down projects and filters to storage based on column-oriented storage and indexes to minimize the amount of data involved in the calculation. In addition to common column pushdown, it also supports richer pushdown functions such as semi-structured data functions. The extreme near-storage pushdown provides an ultra-high-performance query experience.

The preceding figure shows a query example. The ScanFilterAndProject can be expressed as the SQL statement in the example. In this SQL statement, the XUANWU engine needs to match the results that meet the conditions of "Hangzhou" and "Shanghai" with age in the range of [15,30], or the results that the distance between the facial feature and the target feature is within a constant value or contain some labels at the same time, and finally retrieve the id and name that meet these conditions. XUANWU engine makes full use of indexes to accelerate the preceding queries, achieving extreme delayed materialization and minimizing I/O reads.

First, based on the column-level indexes, the row number set that meets each condition is obtained respectively, and then according to the composite condition of the query, the intersection and difference of the row number set is performed to obtain the final effective row set RowIdSet. XUANWU engine supports a wide variety of index types, including

Inverted indexes: The key of the index is the value of the column, and the value of the index is the virtual row number, which is more efficient for common filters. Compression based on ordered data can also reduce the space occupied by the index, such as the idx_city condition shown in the preceding figure.

BKD indexes: The tree structure is more advantageous than inverted indexes for range queries, thus being typically used for numeric and spatial data types, such as the idx_age condition shown in the preceding figure.

Bitmap indexes: Common indexes with a small number of Card values, such as gender and Null distribution.

Vector indexes: This type of index is used for face recognition and other vector analysis and supports the HNSW_PQ search algorithm and the SquaredL2 distance calculation formula, such as the idx_feature condition shown in the preceding figure, which also accordingly supports the pushdown of vector functions.

Full-text indexes: This type of index is used for full-text searches that support typical tokenizers such as IK, AliNLP, Ngram, and Standard and also supports custom dictionaries. The typical match against, phrase, and fuzzy full-text search functions are supported.

JSON indexes: This type of index is used to query data of the JSON type. It flattens nested JsonPath and automatically selects a specific index implementation based on its value type. It can optionally automatically flatten all nested JsonPaths or specify a JsonPath to reduce indexing overhead. In addition, for JSON arrays, the wildcard [*] is also supported, that is, the array elements are uniformly indexed, which is used to efficiently search for certain values in the array. For example, the idx_tags condition shown in the preceding figure, also accordingly supports the pushdown of JSON functions.

After obtaining the qualified RowIdSet, directly read the data of the required projection column (id,name) according to the virtual row number, and read the data by addressing the virtual row number, which requires the least I/O consumption.

4. Summary

This article compares AnalyticDB for MySQL with Rockset, and elaborates the design and implementation of AnalyticDB for MySQL. As a cloud-native data warehouse that has been developed by Alibaba Group and Alibaba Cloud for many years and has been verified by massive businesses, AnalyticDB for MySQL uses a storage-computing separation architecture and features strong consistency, real-time data write operations that have high throughput, full indexes of all types, flexible schemas, tiered storage of hot and cold data, and mixed load capabilities for high query concurrency and high-throughput batch processing. It provides a real-time data warehouse solution with extreme performance and high cost-effectiveness in the AI era.

Learn more about AnalyticDB for MySQL.

Community

AnalyticDB for MySQL: Your Choice for Real-time Data Analysis in the AI Era

1. Introduction

2. Introduction to AnalyticDB for MySQL and Comparative Analysis with Rockset

2.1 Storage-Computing Separation Architecture

2.2 Real-Time Consistency

2.3 Full Indexes of All Types

2.4 High-Throughput Write

2.5 Complete DML and OnlineDDL

2.6 Tiered Storage and Automatic Data Lifecycle Management

2.7 Vector Retrieval and Integrated Analysis

3. Design Principle and In-depth Analysis of AnalyticDB for MySQL

3.1 Data Model

3.2 Storage Architecture

Synchronization Layer

Engine Layer

Persistence Layer

3.3 Query Process

4. Summary

Read previous post:

Read next post:

ApsaraDB

You may also like

Comments

ApsaraDB

Related Products

AI Acceleration Solution

AnalyticDB for MySQL

Managed Service for Grafana

Hologres

A Free Trial That Lets You Build Big!