×
Community Blog Cost Management Practices for ApsaraMQ

Cost Management Practices for ApsaraMQ

This article discusses the cost management practices of ApsaraMQ and introduces architectural optimizations and new capabilities in the serverless version.

By Jiaze and Zhiliu

Introduction

In the era of AI-native application architectures, message queues must support efficient asynchronous communication for large-scale data and complex AI model training and inference scenarios. Optimizing cost-effectiveness is increasingly important. As message volumes grow with large models or big data, ApsaraMQ aims to reduce costs, ease user burden, and improve data processing capabilities, security, performance, and resource utilization through architectural evolution. This enables AI developers to achieve higher efficiency at lower costs.

Background

ApsaraMQ focuses on three core areas: high elasticity and low cost, enhanced stability and security, and intelligent, maintenance-free operation. To achieve high elasticity and low cost, all ApsaraMQ products (including RocketMQ, Kafka, RabbitMQ, MQTT, and MNS) have been made serverless, supporting adaptive elasticity and scaling to tens of thousands of QPS. They use a pay-as-you-go model, reducing instance costs by 50% on average.

This article explores the cost management practices of ApsaraMQ and introduces architectural optimizations and new capabilities in the serverless version. The goal is to provide cost control references for enterprises and individuals, helping them better understand and use ApsaraMQ to maximize cost-effectiveness.

Balancing Resource and O&M Costs

In software development, beyond product iteration and operational costs, runtime costs are primarily composed of resource and O&M costs, which are interdependent.

• To reduce resource costs, we need to adjust the existing architecture. To ensure the safety and stability of these changes, we need a robust O&M system, including effective monitoring and rapid recovery mechanisms.

• By introducing monitoring and alert systems to improve system reliability, we may increase the resource overhead of core components and affect product performance. However, without a robust O&M system, system issues can lead to significant business losses.

Therefore, a strong O&M system is essential to ensure system stability and reliability while reducing costs. We must find a balance between resource utilization and O&M management so that we can maximize cost-effectiveness.

1

To reduce resource and O&M costs while improving system performance, stability, and operational efficiency, we have implemented the following strategies:

1.  Reducing Resource Costs:

  • Improving software performance: Optimize the software to enhance performance, increase efficiency, and reduce resource usage.
  • Lowering resource consumption: Use cost-effective product dependencies to improve resource efficiency.
  • Increasing resource utilization: Work with the O&M team to monitor resource use, ensuring it matches system stability requirements.

2.  Reducing O&M Costs:

  • Comprehensive monitoring and alerts: Develop a complete metrics system to reflect system status accurately, ensuring performance improvement and architecture optimization meet goals.
  • Fast recovery: Improve mean time to repair (MTTR) with strong monitoring and automated recovery systems, reducing the time to detect and fix issues.
  • Seamless upgrades: Perform upgrades without disrupting services or users, reducing upgrade costs.
  • Elastic scalability: Enhance cloud service components to dynamically adjust resources based on demand, controlling costs and improving resource use.

Architecture Optimization, Performance Improvement and Resource Consumption Reduction

For example, open-source Kafka uses an integrated computing and storage architecture, often built on local file systems or local disks. This setup has several limitations:

(1) Performance bottleneck: The throughput and capacity of a single disk are limited, causing a performance bottleneck.

(2) Limited resource flexibility: The fixed ratio of storage to computing resources prevents flexible adjustments.

(3) Lengthy scaling time: Stateful nodes require data migration for scaling, which is influenced by factors such as node load, data volume, and disk throughput. Migrating terabyte-level data often takes hours, increasing risk and operational pressure.

(4) Complex storage architecture: Cold data is stored in object storage, while hot data is stored on local disks. Local disks need multiple replicas for data reliability, increasing network resource consumption. In addition, a logical mapping mechanism between local and secondary storage files adds system complexity.

2

The figure above shows the architecture of Kafka 3.0, which achieves compute-storage separation. The compute layer is stateless and uses an open-source ISR mechanism for leader-follower election. The RDMA protocol is introduced, which significantly reduces CPU consumption during interactions. The storage layer uses the Apsara Distributed File System for shared storage. Message data (checkpoint and index files) are written to the Apsara Distributed File System to ensure reliability.

In addition, the storage structure is optimized in the following areas:

Memory batching: Supports multiple commit policies, such as time-based, space-based, and frequency-based policies; reduces network jitter and minimizes long-tail impacts on service quality.

User-space caching: Implements multi-level caching mechanisms for faster data access; separates hot and cold data to prevent cache pollution.

Cold read optimization: Separates cold and hot threads (coroutines) to prevent global unavailability; includes data preloading and pre-reading with adaptive IO size adjustments.

Performance tests show that the optimized Kafka 3.0 outperforms the open-source version in both batching and fragmented sending scenarios.

Regarding resource consumption, Kafka 3.0 simplifies the architecture and reduces resource costs by separating storage with compute. It ensures data reliability with a single set of storage resources, reducing overall storage costs by about 30%.

3

Enhancing O&M Capability and Reducing O&M Costs

To improve O&M efficiency and reduce costs while ensuring system stability and reliability, we offer the following enhancements:

  • Monitoring system: We optimize resource usage and control costs by monitoring key performance indicators. ApsaraMQ provides comprehensive monitoring and alerting metrics that help users understand and adjust resource allocation. This includes monitoring physical nodes, network, disk, IO, and other operating system metrics, as well as business metrics such as message volume, RPC exceptions, and message accumulation. In addition, end-to-end inspections simulate user SDK behavior to detect system anomalies in real time, with minute-level detection and alerts.
  • Message health manager: ApsaraMQ Copilot for RocketMQ offers advanced end-to-end health inspections and diagnostics. It is a crucial tool for building efficient message integration systems. With comprehensive monitoring, quantitative analysis, customizable configurations, and simplified diagnostic processes, it significantly enhances monitoring and diagnostic capabilities.
  • Non-disruptive upgrades: RocketMQ 5.2.0 introduces a feature that notifies clients during server upgrades, enabling graceful shutdowns. Clients automatically reconnect to non-upgraded nodes and retry the last sent message, ensuring business continuity.
  • Auto scaling: Auto scaling is a core feature of ApsaraMQ Serverless, ideal for handling fluctuating workloads. By monitoring business traffic, cluster water levels, and resource availability, it automatically performs vertical or horizontal scaling or elastic scheduling across cluster instances. Now, it supports second-level elastic expansion to handle tens of thousands of QPS, meeting large-scale business demands.
0 1 0
Share on

Alibaba Cloud Native

208 posts | 12 followers

You may also like

Comments

Alibaba Cloud Native

208 posts | 12 followers

Related Products