Modern Message Queue Services and Cloud Storage

The article delves into the evolution of modern messaging systems, focusing on key features such as software-hardware integration, millions of queues, and tiered storage.

By Xieyang

The article delves into the evolution of modern messaging systems, focusing on key features such as software-hardware integration, millions of queues, and tiered storage. It explores advanced technologies like shared-log storage-compute separation, FFM with coroutines, RDMA transmission, and column-oriented storage, extending messaging into streaming.

In the late 1970s, messaging systems managed print jobs across multiple hosts, decoupling and load shifting. This ability was standardized as a P2P model and a more complex publish-subscribe model, enabling distributed data processing. Over time, products like Kafka, RabbitMQ, and others provide developers with competitive solutions for various scenarios, adding features like ordered messages, transactional messages, timed messages, and exactly-once delivery. These advancements have made messaging systems a standard component of distributed systems. With the growth of cloud-native and microservice concepts, dynamic and asynchronous systems like Serverless and event-driven systems have evolved. The architectures and models of message queue services have also changed.

This article shares how messaging systems, as PaaS/MaaS infrastructure services, have benefited from advancements in software and hardware, meeting the growing demands for data analysis in stream processing and real-time analysis scenarios while balancing cost and efficiency. To fully embrace cloud-native architecture, messaging infrastructure must leverage recent advancements like shared-log storage-compute separation for cost reduction and integrate technologies like FFM, coroutines, user-space TCP, RDMA transmission, and column-oriented storage. These technologies further extend messaging into streaming, laying the foundation for future event-driven architectures (EDAs), comprehensive serverless, and lightweight computing.

1. Sending and Receiving Messages

1.1 Sending: Fairness or Greed?

From a user or client perspective, sending a message might seem as simple as using RPC to transmit serialized data to the message queue server and handling the response. However, in a distributed environment, the following issues need to be considered:

(1) Latency: Messaging system servers are usually deployed across zones for disaster recovery. How can we send messages to reduce latency and traffic costs?

(2) Multi-language support: Producers have more diverse origins compared to consumers, which are typically scaled in the cloud. Producers range from small microcontroller units (MCUs) and various frontend pages to complex internal communications of backend services. This diversity necessitates strong support for multiple languages and protocols.

(3) Multi-scenario support: The system must adapt to various models in multiple scenarios , such as scenarios that require ordered messaging with binlog synchronization and massive IoT clients.

(4) Sending failure: If a message fails to send due to server downtime or network issues, what should the client do? Should it opt for data backpressure or fast failure?

Figure: Direct Request Data Transmission to Broker Compared to Transmission via Proxy

Reflecting on the time when network speeds were slow and mobile data was less efficient than mobile computing, we often mention the direct-connect network architecture. Typical products using this architecture include Hadoop Distributed File System (HDFS), Kafka, Redis, and Elasticsearch. These systems allow clients to establish direct connections with multiple server nodes, reducing latency and improving real-time performance compared to architectures where all traffic passes through a proxy. However, with direct connections, clients must handle various distributed system issues, requiring complex service discovery and load balancing mechanisms. Clients also need to know how to gracefully handle single points of failure (SPOFs). All of these have complicated client version upgrades.

In contrast, using a proxy for storage-compute separation offers several advantages. Proxies can handle request caching, shared authentication, and service discovery. This division of responsibilities simplifies the configuration of advanced capabilities such as cross-region networking and active geo-redundancy. With a proxy, it becomes easier to integrate multi-language clients, as the stateless proxy can parse data over various protocols. Regarding additional communication latency, proxies can use high-performance private protocols between the server-side proxy and the backend storage cluster. Technologies like FlatBuffer for low serialization overhead and RDMA for communication further minimize latency differences.

Figure: TCP Congestion Control [2]

Modern message queue services are evolving to offer smarter error handling capabilities and give users the flexibility to choose their strategies. From a network perspective, message sending failures are caused by inaccessibility or congestion. Two common solutions are the TCP-BBR congestion control algorithm and multi-packet transmission algorithms like LotServer. In log collection-oriented applications, message queues prioritize global throughput and may choose algorithms like TCP-BBR, which aim to find the optimal bandwidth-delay product (BDP). When the server is busy, backpressure is used to reduce the send rate of the client.

For example, the accumulation mechanism of Kafka configured with the linger.ms parameter is similar to the Nagle algorithm, where small packets are accumulated into larger ones to improve performance. Other methods include sliding windows at the framework level and the trust-based flow control of Flink, which handle backpressure more gracefully. In contrast, LotServer maximizes bandwidth utilization by sending multiple packets simultaneously. Message queues also offer a fail-fast strategy, allowing users to decide how to handle errors, whether to retry more times or use a fallback route. Configuring short timeout periods for sending messages and using quick retries is a globally non-fair, greedy strategy. This is commonly used in message queue services such as RocketMQ, which emphasize data importance and real-time performance. These message queues require that the asynchronous write latency is less than 1 millisecond and the synchronous write or multi-replica latency is a few milliseconds in the server. When the storage layer is busy, these queues prefer "fail fast" over waiting in line, prioritizing the needs of online applications.

1.2. Consuming Messages: Tracking State across Multiple Models

Consuming messages involves a two-phase commit process facilitated by the server. Engineers with a distributed systems background might describe it in the following way: The server maintains logical or physical queues, and the clients use long polling to send requests to the server. If there are messages, the messages are returned. If not, the requests stay pending on the server. After the messages are processed based on the business logic, the clients report the consumption results back to the server.

The mission of the server is to efficiently satisfy these clients by tracking the current offsets in the queue and maintaining the handles and quantities of the consumed messages. At the storage layer of message queues, we must design data structures to maintain this information and use consistency protocols such as Paxos and Raft to spread this information within the server for high availability. This information is also known as the state of the server. Here are some examples: Kafka records group offsets using an internal topic and reuses message replication pathways to ensure reliability, making it easy to monitor and rewind offsets. RocketMQ handles a much larger number of subscription groups than Kafka, so it periodically saves in-memory offsets as checkpoints. For a specific subscription group and queue, a single offset number can describe the consumption progress. The queue model is simple and efficient for most scenarios but has some inherent limitations:

(1) Queue-based load balancing for message consumers can be implemented only when certain preconditions and assumptions are met:

When the number of queues is not evenly divisible by the number of consumers, it leads to load imbalance. For example, the optimal method to distribute eight queues to three consumers is 3, 3, and 2.
This model assumes that all clients have equal capabilities. In the actual production environment, old and new models are deployed in a hybrid manner, which prevents optimal utilization of computing resources.

(2) Slow tasks block the queue. If five messages exist at offsets 3, 4, 5, 6, and 7, and it takes a long time to process the message at offset 5, messages at offsets 6 and 7 might be processed quickly in parallel. However, the queue appears stuck at offset 3, leading to a misinterpretation of the backlog.

(3) In scenarios such as SMS push notifications, even a few seconds of duplicate consumption due to consumer or server downtime can negatively impact user experience. More representative scenarios include rendering, where each message represents a rendering task.

a) A single subscription group might have hundreds or thousands of machines consuming messages simultaneously.

b) Processing a single message can take anywhere from a few seconds to several hours.

c) Due to high consumer load and extensive use of spot instances, the rate of consumer process crashes or hang-ups is significantly higher than in typical scenarios.

Traditional message queues use a model similar to that of Kafka, which often involves the classic Work-Stealing problem. This results in uneven task distribution among consumers, where the blocking of a single message can delay the processing of subsequent messages. To address this problem, a delivery algorithm based on invisibility time can be used. The algorithm works in the following process:

(1) The client sets an invisibility time, such as 5 minutes, and requests a batch of messages from the server.

(2) The server returns a batch of messages and starts a 5-minute countdown in the backend. Each message is tagged with a handle.

(3) If the client does not acknowledge the successful consumption of the messages (ack by handle) within 5 minutes, the messages become available for other clients to retrieve again after the 5 minutes elapse.

However, this model has its flaws. For example, if a consumer retrieves messages and then crashes after 1 minute, the business must tolerate a 4-minute delay before other consumers can process these messages, even if they are idle. To mitigate this, the invisibility time can be set to 1 minute. While processing data, the client continuously refreshes the invisibility time by calling the change invisible time operation every 30 seconds to update the remaining invisibility time back to 1 minute. This way, if the client crashes at a random point in time, the message delay is kept within 1 minute. In RocketMQ, this consumption method based on intervals and individual messages is known as pop consumption, implemented by the SimpleConsumer. Clients no longer need to manage complex load balancing and offset tracking. It also makes it easy to support multiple languages. In Pulsar, this management capability is more complex and is called the WorkQueue mode with Range Ack management. The more the server handles, the less the clients and users need to worry about. This provides significant flexibility. This evolution in business models has driven changes in cloud storage models for messages. However, this comes at a cost. For example, stateless consumption models such as SimpleConsumer typically have a longer average retrieval time compared to the commonly used PullConsumer, and it also results in more frequent interactions between the server and the client.

2. Enhancing Server Capabilities

The richness of client interfaces and strategies relies on advancements in server technology. Techniques such as asynchronous I/O (AIO), zero-copy, and direct I/O have become more widespread, significantly simplifying the construction of high-performance systems. A well-designed single-machine storage engine can perform hundreds of thousands or even millions of data writes per second. The focus of users has shifted from mere read/write throughput to product features. While throughput can be scaled horizontally, the complexity to maintain additional separate components is increased exponentially.

Users prefer to minimize external dependencies and do not choose a product solely for a fancy dashboard. However, they might be forced to abandon a product due to missing features, such as the need for transmission and storage encryption in financial-grade products. Unlike open source communities, cloud service providers see the unified message kernel as a key competitive advantage in message queues. This kernel supports multiple product access protocols, providing consistent underlying capabilities for all products and maximizing the benefits of feature reuse. In this scenario, the marginal cost of adapting to a new product decreases, leading to a more integrated ecosystem for NoSQL databases, message queues, caching components, and even logging services.

I believe that modern message queues enhance storage features in the following areas: support for massive queues, tiered storage to reduce costs, layered replication architecture for multi-modal storage as distributed file systems mature, flexible multi-replica strategies, and better, faster support for streaming tasks.

2.1 Massive Queues and Multi-Mode Unification

Different messaging products focus on different areas. Kafka emphasizes global throughput, RocketMQ targets real-time applications, RabbitMQ integrates business logic with its message model, and MQTT supports a vast number of devices and topics. These distinct features extend the two messaging models: the topic-based publish-subscribe model and the queue-based P2P model. A unified message kernel must support these multi-mode scenarios.

In Apache Kafka, each partition has an independent LogSegment to store messages, using strategies such as disk space preallocation. This approach faces significant performance issues with massive queues. RocketMQ stores all messages in a unified CommitLog format to ensure high write performance in the frontend. However, this has led to problems such as thread blocks caused by page faults when zero-copy is used in I/O threads. To address this, extensive custom optimizations are required at the storage engine layer, such as using separate cold and hot data services for weight calculations and dedicated thread pools for cold data read. However, indexes remain as independent small files. By default, each consumer queue stores 300,000 message indexes, with each index occupying 20 bytes. In this case, the size of each index file is approximately 5.72 MB, and a million queues can occupy several terabytes of disk space.

In scenarios with a large number of queues, these indexes are often mixed and merged into large files in the file system. Log-Structured Merge-Tree (LSM-Tree) structures such as RocksDB, where data is sorted based on key values, can merge small files and write them into an SST file. This significantly reduces fragmented files. The performance test result shows that a single machine can support millions of queues when RocksDB is used to store indexes. For example, in a scenario with 40,000 queues (including retry queues), the local disk space occupied by indexes is reduced from 200 GB to 30 GB, with a 10% increase in CPU overhead compared to using file-based indexes.

What Are the Advantages of LSM-Tree Structures?

Typically, storage engines use two update structures: in-place updates and out-of-place updates. The most common in-place update structure is the B-tree and its variants. A B-tree is a balanced multi-way search tree that inserts multiple entries into each leaf. This reduces tree height and improves query performance and stability. Because B-trees sort new index entries by key value and place them with existing entries, data read from a B-tree is unlikely to be referenced again in the buffer for a second insertion. This makes batch writes, like those in LSM trees, impossible in B-trees. In addition, the volume of data persisted by storage engines usually far exceeds the memory directly used to cache data. The frequency of access by mergeable operations such as sequential access by queue decreases over time, making the data cold. In LSM-Tree structures, the compaction mechanism groups related data together, making sequential data retrieval more efficient. This works well with prefetching. This is a significant advantage.

While the compaction mechanism works well in databases, it poses challenges for message queues due to the large value sizes. The additional read/write amplification brings more overheads to messaging systems. Techniques like WiscKey, which separate keys from values, can help reduce this amplification. The compaction mechanism can also be optimized in a way similar to that used for TerarkDB and Titan. However, this makes it more complex to implement topic-level time to live (TTL) and periodic offset correction in messaging systems.

2.2 Tiered Storage Transforming Data Assets

In recent years, controlling the ever-expanding infrastructure costs has been a hot topic in the community. By reducing costs, commercial products can offer more competitive prices. This has been a key reason why they shift from building their own VMs to using commercial solutions.

Why not further shrink block storage disks to minimize costs? In tiered storage scenarios, the continuous reduction of local disk capacity yields diminishing returns.

Here are the main reasons why minimizing local disk capacity is not always beneficial:

• Fault tolerance: Message queues are critical infrastructure components, and the stability is paramount. While object storage services are highly available, using this type of service as the primary storage method can cause backpressure during network issues, preventing hot data from being written. This can severely impact service availability by affecting the read/write performance of hot data in online business operations.

• Cost efficiency: Small local disks do not offer significant cost advantages. Cloud computing emphasizes broad accessibility and fairness. With tiered storage, the proportion of computing costs increases while the write traffic for hot data remains unchanged. If you use a cloud disk of 50 GB for block storage and require the IOPS capability of a 200 GB Enterprise SSD (ESSD), the unit cost of the cloud disk can be several times higher than the standard low-IOPS cloud disk.

• Batch uploading: Data is accumulated over time on local disks and then uploaded as a single batch to object storage services. This significantly reduces request costs.

• Data access: Local disks provide lower latency and save read costs for "warm" data, and can cache retrieved cold data.

"Whoever is closest to the data is most likely to succeed." This is the core principle of data-driven decision-making. Message queues, as data channels for application development, are often limited by the capacity and cost constraints of local disks. Data is typically stored for only a few days. Tiered storage offers a low-cost solution to significantly extend the lifecycle of messages. As data cools, various data formats such as FlatBuffer and Parquet can be used when hot data is transformed into cold data. This transforms message queues from mere channels into storage pools for the data assets of users. These technologies further extend messages into the realm of streaming, becoming the foundation for future EDAs and lightweight computing.

2.3. Reducing Computing Costs with Distributed Storage

While tiered storage effectively addresses the storage cost issue of cold data, it does not fully reduce the total cost of ownership. In messaging systems, multi-replication for hot data ensures reliability and high availability. Readable replicas also provide greater read bandwidth for read-heavy scenarios. However, this architecture can lead to several issues:

• Suboptimal write performance: To simplify client complexity, message queues often use Y-type writing. Server-side algorithms such as Raft or other log replication methods consume significant bandwidth, reducing write performance and throughput. Every message update requires updating the primary replica, replicating over a consistency protocol, and performing a quorum calculation. This process involves at least four network round trips (client to primary replica, primary replica to secondary replica, secondary replica to primary replica, and primary replica back to client), causing latency and long-tail delays.

• Wasted computing costs: Secondary replicas generally handle fewer read requests from the clients. As a result, the CPU usage is typically half that of the primary replica. When the primary replica can fully handle read and write requests, the computing capability of secondary replicas is wasted. This inefficiency is difficult to optimize by using hybrid deployments and single-process multi-container setups, such as the slot mechanism of Flink and the broker container mechanism of RocketMQ.

• Slow scaling: In the event of hotspots or urgent scaling needs, Apache Kafka requires time-consuming data replication to take effect.

• Complex multi-replica management: Managing multiple replicas requires control-plane components for identity arbitration, which many teams may be hesitant to maintain due to potential issues with services such as Zookeeper. Kafka Raft (KRaft), RocketMQ Dledger for synchronizing CommitLogs, and JRaft Controller all add complexity and increase the operational burden of the entire system.

A multi-replica architecture faces numerous challenges, such as difficulty in achieving monotonic reads and potential synchronization issues with in-sync replicas (ISRs). Fortunately, we have excellent theoretical guidance. In 2008, Microsoft published the PacificA paper, which proposed three methods for achieving replicated data consistency in log-based systems:

Figure: From Microsoft PacificA: Replication in Log-Based Distributed Storage Systems [8]

(1) Log Replication: This method is similar to the Replicated State Machine described in Raft. It involves replicating logs between the primary and secondary servers. Each node executes the same instructions in the same order.

(2) Log Merge: In this approach, the primary server maintains the data structure, while the secondary server does not keep the in-memory data structure. They only receive checkpoints and log replications. When the primary server fails, the secondary server can recover the state by loading and replaying the logs.

(3) Layered Replication: This method delegates data replication to the underlying distributed file system such as HDFS for data consistency.

Figure: Network Traffic Between Primary and Read-only Nodes Is Reduced by 98% After Optimization by the PolarDB Team (Image from "Analyze the Technical Essentials of PolarDB at the Architecture Level")

Message queues are data-intensive and latency-sensitive storage applications. In a layered replication architecture, write latency is fully optimized. However, due to kernel limitations, methods like sendfile system call in Kafka and the mmap function in RocketMQ cannot fully utilize modern hardware. Distributed file systems often use user-space file systems based on Storage Performance Development Kit (SPDK), the run-to-complete thread model, and star-shaped 2-3 asynchronous writes. These systems ensure data reliability and provide write performance that far surpasses cloud disks (which are also based on distributed file systems) and old local SSDs. In a layered replication architecture, the use of computing and memory resources becomes more flexible. The system supports fine-grained management of computing capabilities, and can scale dynamically with the user load. Compute nodes can be quickly added or removed without identity binding, making scaling extremely fast. Data reliability and read/write performance are handled by specialized teams. Of course, some existing technologies will change accordingly. For example, reading data can no longer rely on the page cache of the operating system. For this issue, many solutions are available, such as the shared buffer pool of PolarDB and "distributed mmap" of WarpStream, which can effectively use memory across multiple nodes. Modern application-layer storage engines follow the "Log is Streaming" concept. They offload storage complexity to lower-level distributed storage systems. Given the limitations of human resources, it is essential to simplify storage when appropriate. Introducing new technologies adds complexity, so it is crucial to avoid over-engineering, which can increase maintenance costs. For example, Apache Flink, with its rich features, has deterred many small and medium-sized users due to the maintenance complexity. Keeping storage engine dependencies low will attract a broader developer base and ensure sustainable development.

2.4 Self-Contained Streaming Processing Capabilities

For many years, message queue services such as RocketMQ have been used in core processes such as transactions and supply chain fulfillment, handling large volumes of high-value business data. Log-based messaging services like Kafka also accumulate significant amounts of user behavior data. To leverage these data assets and generate value by computing, the community has proposed various solutions, such as lightweight computing tools like KStream and KsqlDB. I am also familiar with more feature-rich platforms like Spark and Flink. After contributing to the community by improving the new Flink-Connector-RocketMQ based on FLIP 27/191, I realized that modern message queue services are evolving into integrated platforms for messaging, events, and streaming.

The complexity of streaming processing lies in balancing performance, accuracy, and cost.

• Repeatable read: This ensures that data can be replayed without causing inconsistencies, allowing for correct recovery from the last successful processing, such as a snapshot, in the event of a failure. Message consumers, which are often referred to as sources in computing frameworks, must be able to accurately locate and replay messages from specific offsets when issues like crashes or network interruptions occur. The message storage system must ensure that downstream consumers only see data progress up to the latest state confirmed by the cluster, adhering to the major threshold for resource usage defined by the quorum principle. This guarantees idempotency in the consumption process.

• Partition strategy and time watermark: A common engineering requirement is to maintain a stable number of partitions at the data source to avoid load imbalances caused by partition changes. This model has many implicit assumptions, such as a robust high-availability mechanism on the server side, no hotspots or data skew in partitions, equal capabilities of nodes where downstream consumers reside, and support for periodic time watermarks in the message queue. In models that do not require awareness of the partition quantity, such as Google Pub/Sub, achieving end-to-end "exactly-once" processing requires maintaining numerous handle states on the computing side. In domain-specific language (DSL) mechanisms, users need to implement logic for both accumulations and retractions to correct results. For example, if a join operation generates a new table, corrections will propagate from the current sub-topology, which can result in higher end-to-end latency.

• I/O overhead of data transfer: Even in modern network environments with a network speed of 100 Gbit/s, operations like broadcast, aggregation, and shuffle within stream processing frameworks involve repeated reads and writes to message queues. This leads to significant I/O latency. Maintaining distributed transactions by using methods such as TwoPhaseCommit in Flink Sink is complex and causes read-write amplification in message queues.

By separating computing frameworks and message queue storage, we can flexibly decouple computing logic from data storage. However, it also makes maintenance more complex. In the future, the storage layer may be on top of the core features of message queues and integrate lightweight computing capabilities such as data transformation, rolling aggregation, and window support. Currently, message queues only support simple tag filtering and SQL92 filtering actions. By introducing schemas in the storage layer, you can retrieve specific data based on your business requirements. This further reduces read-write operations.

3. Cutting-Edge Developments in the Industry

Beyond widely used messaging services such as RocketMQ and Kafka, we also focus on niche market challengers who address industry pain points or enhance performance to create differentiated competitiveness. Notable examples include WarpStream and Redpanda.

3.1 Completely Removing Local Disks

When Kafka nodes crash and recover, the data replication process is complex and time-consuming because Kafka relies heavily on local disks. Even with community-proposed solutions such as tiered storage, hot data must be kept on multiple local disks for at least 12 to 24 hours, which is costly. WarpStream, compatible with the Kafka protocol, eliminates the dependency on block storage by building Kafka directly on Amazon Simple Storage Service (Amazon S3). The architecture includes agents, similar to proxies, and Meta, a metadata management service supporting millions of transactions per second (TPS). The following content describes the core process:

(1) Sending messages: The agent batches data from different topics and writes it to object storage services. Once successful, the Meta service sequences it. a. The write request latency can be as high as 400 milliseconds due to batching, cost saving, and the high inherent latency of object storage services. b. In multi-zone scenarios, write load balancing and stream switchover are achieved by modifying the client ID without disrupting the native Kafka protocol. c. Idempotent message support is managed by the Meta service, which arbitrates offsets and ignores failed requests. This design aligns with the "seal and new chunk" calls in Apsara Distributed File System and the fast rolling designs in various tiered storage models.

(2) Receiving messages: Multiple agents form a consistent hashing ring, directing reads from the same partition to a single agent to improve cache hit rates. This setup is known as distributed mmap. Backend compaction increases replay speed and resolves TTL issues. This design is entirely stateless, making horizontal scaling straightforward and enabling simple multi-tenant and large-cluster implementations. The drawback is the heavy reliance on the Meta service.

The official website also mentions low-latency improvement techniques. For example, for Express One Zone, a single availability zone version of object storage services, the following actions may be done:

• Reducing upload buffer and timeout period to improve performance.

• Supporting writes to multiple buckets to achieve 2-3 write techniques, and using the quorum principle for fast acknowledgment sending.

3.2 Rebuilding with Native Languages

Redpanda leverages modern hardware features and uses native languages to achieve low latency and reduced cloud costs, especially for improving long-tail latencies. The nodes of Redpanda rely on an enhanced version of Raft (including an optimistic approach to Raft and parallel commits) and use object storage services for cold data. Many challengers to Kafka use LogSegment to reduce compatibility issues with the evolving computing layer protocol of Kafka. However, Redpanda opts for a bottom-up approach, using a single fixed thread to handle all operations for a partition, including network polling, asynchronous I/O, event fetching, and task scheduling. This thread model is known as thread-per-core or run-to-complete. The actor model, a powerful concurrency model, minimizes critical sections and replaces the multi-threading approach under Reactor that uses mutexes. This ensures that all operations can be completed within 500 microseconds. Using C++ for development provides deterministic latency, effectively reducing the long-tail latencies associated with JVM-based applications and offering predictable P99 latencies.

The second approach involves managing hot data as it transitions to cold data. The leader of each partition is responsible for uploading and reusing the Raft chain to replicate metadata.

• The scheduler_service and archival_metadata_stm components use a PID Controller-like fair scheduling algorithm. This algorithm calculates the total data to be uploaded to object storage services and dynamically updates priorities. Partitions with larger backlogs are given higher priority, while those with smaller backlogs are deprioritized to minimize backend traffic interference with frontend read and write operations.

• The remote_partition and cache_service components manage downloading and caching data from object storage services. They calculate the relative offset of the hybrid log segment (hydrated log segment) based on the partition and offset requested by the consumer. This prefetching and caching reduce the number of calls to object storage services, lowering the average response time (RT). These components also support strategies for nearby data retrieval to reduce cross-zone traffic and costs.

• Performance improvements include developing the HTTP client using Seastar with Boost.Beast to enhance access to object storage services. The management of data uploads and caching requires fairness considerations, and the internal state management of each partition is complex.

While we hope for the server to "write once, run everywhere," in practice, to fully leverage modern hardware, we often embed JNI or optimize instruction sets to enhance the performance of hotspot functions. Redpanda specifically mentions support for ARM architecture. Since the project is entirely developed in C++, it is necessary to dynamically compile dependency libraries by enabling certain features. This process is not as complex as cross-platform development. The conclusion is that ARM can achieve about 20% cost savings compared to x86, similar to the benefits that we observed when we migrate RocketMQ from x86 to ARM.

3.3 Introducing New Technologies

Modern message queues are not only expanding their use cases but also experimenting with new technologies and integrated hardware-software solutions to enhance performance and reduce costs. Here are some examples.

• Communication layer: Traditionally, the computing layer proxy and storage nodes use TCP for communication. However, TCP has inherent latency and bandwidth limitations, especially in high-density container deployments. TCP, designed for WANs, is not optimal for data center environments. We are experimenting with RDMA to replace TCP. RDMA allows direct access to remote host memory without involving the network software stack. Data can be directly sent to or received from buffers without being copied to the network layer. This direct user-space data transfer eliminates context switching between the kernel and user space and allows applications to access remote memory without consuming CPUs of the remote host. Many network protocol stack operations are offloaded to hardware. This reduces end-to-end network latency, ensuring persistent storage, high throughput, and real-time performance of messages. Our tests show an 8% reduction in CPU usage. However, for Java applications not deployed in CPU set mode, this can lead to increased long-tail latency.

• Computing layer: Message queues are also introducing JDK17 coroutine technology to improve code maintainability for numerous asynchronous operations. Traditional optimizations, such as reducing buffer copies with reference counting, hotspot analysis with targeted JNI optimization, and native language refactoring, have not been fully realized. For example, SpinLock performance varies significantly between x86 and ARM architectures. By converting these optimizations into platform-specific dynamic link libraries, we can greatly improve the performance at a low cost. In addition, repetitive operations can be optimized using FFM API and Single Instruction, Multiple Data (SIMD) technologies in JDK21. This significantly reduces CPU overhead.

• Storage layer: In a message queue system, the payload data of messages under the same topic is highly correlated. A typical compression ratio is up to 10:1. As messaging extends into streaming, we are experimenting with storage formats such as FlatBuffer and Parquet, which are memory-friendly and have low deserialization overhead, to improve query performance.

References

John K. Ousterhout, et al. "Cloud Programming Simplified: A Berkeley View on Serverless Computing." arXiv preprint arXiv:1902.03383v1 (2019). https://arxiv.org/abs/1902.03383v1
TCP Slow-Start and Congestion Avoidance, https://commons.wikimedia.org/w/index.php?title=File:TCP_Slow-Start_and_Congestion_Avoidance.svg
Asterios Katsifodimos, et al. "Consistency and Completeness: Rethinking Distributed Stream Processing in Apache Kafka." In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '21). Association for Computing Machinery, 2021.
Flink, Apache. "Stateful computations over data streams." Accessed: Apr 23 (2021): 2021.
State management in Apache Flink®: consistent stateful distributed stream processing. https://dl.acm.org/doi/10.14778/3137765.3137777
Michael Armbrust, et al. "Delta Lake: High-performance ACID Table Storage over Cloud Object Stores." Databricks White Paper, Aug. 2020. https://www.databricks.com/wp-content/uploads/2020/08/p975-armbrust.pdf
Wei Cao, Zhenjun Liu, Peng Wang, Sen Chen, Caifeng Zhu, Song Zheng, Yuhui Wang, and Guoqing Ma. 2018. PolarFS: an ultra-low latency and failure resilient distributed file system for shared storage cloud database. Proc. VLDB Endow. 11, 12 (August 2018), 1849–1862. https://doi.org/10.14778/3229863.3229872
PacificA: Replication in Log-Based Distributed Storage Systems, https://www.microsoft.com/en-us/research/publication/pacifica-replication-in-log-based-distributed-storage-systems/
Heidi Howard and Richard Mortier. 2020. Paxos vs Raft: have we reached consensus on distributed consensus?In Proceedings of the 7th Workshop on Principles and Practice of Consistency for Distributed Data (PaPoC '20). Association for Computing Machinery, New York, NY, USA, Article 8, 1–9. https://doi.org/10.1145/3380787.3393681
Abhishek Verma, etc. 2015. Large-scale cluster management at Google with Borg. In Proceedings of the Tenth European Conference on Computer Systems (EuroSys '15). Association for Computing Machinery, New York, NY, USA, Article 18, 1–17. https://doi.org/10.1145/2741948.2741964
Peter A. Alsberg and John D. Day. 1976. A principle for resilient sharing of distributed resources. In Proceedings of the 2nd international conference on Software engineering (ICSE '76). IEEE Computer Society Press, Washington, DC, USA, 562–570. https://dl.acm.org/doi/10.5555/800253.807732
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. SIGOPS Oper. Syst. Rev. 37, 5 (December 2003), 29–43. https://doi.org/10.1145/1165389.945450
K. Mani Chandy and Leslie Lamport. 1985. Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3, 1 (Feb. 1985), 63–75. https://doi.org/10.1145/214451.214456

Community

Modern Message Queue Services and Cloud Storage

1. Sending and Receiving Messages

1.1 Sending: Fairness or Greed?

1.2. Consuming Messages: Tracking State across Multiple Models

2. Enhancing Server Capabilities

2.1 Massive Queues and Multi-Mode Unification

What Are the Advantages of LSM-Tree Structures?

2.2 Tiered Storage Transforming Data Assets

2.3. Reducing Computing Costs with Distributed Storage

2.4 Self-Contained Streaming Processing Capabilities

3. Cutting-Edge Developments in the Industry

3.1 Completely Removing Local Disks

3.2 Rebuilding with Native Languages

3.3 Introducing New Technologies

References

Read previous post:

Read next post:

Alibaba Cloud Native

You may also like

Comments

Alibaba Cloud Native

Related Products

ApsaraMQ for RocketMQ

AliwareMQ for IoT

Storage Capacity Unit

Message Queue for RabbitMQ