By Jiaze and Zhiliu
In the era of AI-native application architectures, message queues must support efficient asynchronous communication for large-scale data and complex AI model training and inference scenarios. Optimizing cost-effectiveness is increasingly important. As message volumes grow with large models or big data, ApsaraMQ aims to reduce costs, ease user burden, and improve data processing capabilities, security, performance, and resource utilization through architectural evolution. This enables AI developers to achieve higher efficiency at lower costs.
ApsaraMQ focuses on three core areas: high elasticity and low cost, enhanced stability and security, and intelligent, maintenance-free operation. To achieve high elasticity and low cost, all ApsaraMQ products (including RocketMQ, Kafka, RabbitMQ, MQTT, and MNS) have been made serverless, supporting adaptive elasticity and scaling to tens of thousands of QPS. They use a pay-as-you-go model, reducing instance costs by 50% on average.
This article explores the cost management practices of ApsaraMQ and introduces architectural optimizations and new capabilities in the serverless version. The goal is to provide cost control references for enterprises and individuals, helping them better understand and use ApsaraMQ to maximize cost-effectiveness.
In software development, beyond product iteration and operational costs, runtime costs are primarily composed of resource and O&M costs, which are interdependent.
• To reduce resource costs, we need to adjust the existing architecture. To ensure the safety and stability of these changes, we need a robust O&M system, including effective monitoring and rapid recovery mechanisms.
• By introducing monitoring and alert systems to improve system reliability, we may increase the resource overhead of core components and affect product performance. However, without a robust O&M system, system issues can lead to significant business losses.
Therefore, a strong O&M system is essential to ensure system stability and reliability while reducing costs. We must find a balance between resource utilization and O&M management so that we can maximize cost-effectiveness.
To reduce resource and O&M costs while improving system performance, stability, and operational efficiency, we have implemented the following strategies:
1. Reducing Resource Costs:
2. Reducing O&M Costs:
For example, open-source Kafka uses an integrated computing and storage architecture, often built on local file systems or local disks. This setup has several limitations:
(1) Performance bottleneck: The throughput and capacity of a single disk are limited, causing a performance bottleneck.
(2) Limited resource flexibility: The fixed ratio of storage to computing resources prevents flexible adjustments.
(3) Lengthy scaling time: Stateful nodes require data migration for scaling, which is influenced by factors such as node load, data volume, and disk throughput. Migrating terabyte-level data often takes hours, increasing risk and operational pressure.
(4) Complex storage architecture: Cold data is stored in object storage, while hot data is stored on local disks. Local disks need multiple replicas for data reliability, increasing network resource consumption. In addition, a logical mapping mechanism between local and secondary storage files adds system complexity.
The figure above shows the architecture of Kafka 3.0, which achieves compute-storage separation. The compute layer is stateless and uses an open-source ISR mechanism for leader-follower election. The RDMA protocol is introduced, which significantly reduces CPU consumption during interactions. The storage layer uses the Apsara Distributed File System for shared storage. Message data (checkpoint and index files) are written to the Apsara Distributed File System to ensure reliability.
In addition, the storage structure is optimized in the following areas:
• Memory batching: Supports multiple commit policies, such as time-based, space-based, and frequency-based policies; reduces network jitter and minimizes long-tail impacts on service quality.
• User-space caching: Implements multi-level caching mechanisms for faster data access; separates hot and cold data to prevent cache pollution.
• Cold read optimization: Separates cold and hot threads (coroutines) to prevent global unavailability; includes data preloading and pre-reading with adaptive IO size adjustments.
Performance tests show that the optimized Kafka 3.0 outperforms the open-source version in both batching and fragmented sending scenarios.
Regarding resource consumption, Kafka 3.0 simplifies the architecture and reduces resource costs by separating storage with compute. It ensures data reliability with a single set of storage resources, reducing overall storage costs by about 30%.
To improve O&M efficiency and reduce costs while ensuring system stability and reliability, we offer the following enhancements:
One-Stop Tracing Analysis: Alibaba Cloud's End-to-End Solution
208 posts | 12 followers
FollowAlibaba Cloud Native - August 6, 2024
Alibaba Cloud Native - July 5, 2023
Alibaba Cloud Community - June 14, 2024
Alibaba Cloud Community - May 17, 2024
Alibaba Cloud Native - December 10, 2024
Alibaba Container Service - April 17, 2024
208 posts | 12 followers
FollowA message service designed for IoT and mobile Internet (MI).
Learn MoreApsaraMQ for RocketMQ is a distributed message queue service that supports reliable message-based asynchronous communication among microservices, distributed systems, and serverless applications.
Learn MoreA distributed, fully managed, and professional messaging service that features high throughput, low latency, and high scalability.
Learn MoreShort Message Service (SMS) helps enterprises worldwide build channels to reach their customers with user-friendly, efficient, and intelligent communication capabilities.
Learn MoreMore Posts by Alibaba Cloud Native