By Guyi
The early batch processing model in RocketMQ had certain limitations. To further enhance performance, RocketMQ underwent an index construction pipeline upgrade. Additionally, the BatchCQ and AutoBatch models optimized the batch processing workflow, providing a more streamlined user experience.
RocketMQ aims to build a super-converged platform for messages, events, and streams. This means it needs to meet a variety of requirements across different scenarios. Batch processing is a classic solution for achieving maximum throughput in stream computing, which naturally implies that RocketMQ also has its own unique batch processing model.
So what does a "unique" batch processing model mean? Let me explain.
First, since we are discussing the batch processing model of RocketMQ, let's talk about what batch processing is and why it is a classic solution for achieving maximum throughput. In my view, batch processing is a generalized methodology that permeates all aspects of various systems. Whether in traditional industries, the internet, or even daily life, you can find its presence everywhere.
The core idea of batch processing is to group multiple tasks or data sets together for unified processing. The advantage of this method lies in making full use of system resources and reducing the overhead associated with task switching, thus improving overall efficiency. For example, in industrial manufacturing, factories typically produce the same type of parts in batches to reduce production costs and increase speed. In the realm of the internet, batch processing manifests as the storage, transmission, and processing of data in batches, optimizing performance and enhancing system throughput.
The application of batch processing becomes even more pronounced under extreme throughput demands. For instance, in big data analysis, vast amounts of data need to be processed collectively to derive meaningful results. Processing data piece by piece would not only be inefficient but could also create system bottlenecks. Through batch processing, data can be divided into several batches and then processed within predefined time windows, thereby improving the parallel processing capability of the system and improving the overall throughput.
Moreover, batch processing does not necessarily mean sacrificing latency. For example, in CPU Cache, operating on a single byte will always be faster than on multiple bytes. However, such comparisons are meaningless because the perception of latency is not infinitely small. Users often do not care how long it takes for the CPU to execute a single instruction but care about how long it takes to complete an entire "task/job." On a macro level, batch processing actually results in lower latency.
Next, let's take a look at how RocketMQ and batch processing are closely intertwined. In fact, the seeds of batch processing were planted when RocketMQ was first created. We will call this seed - the early batch processing model.
The following figure shows the three main components from a user's perspective: Producer, Consumer, and Broker:
The early batch processing model is only related to the Producer and Broker. In this chain, the concept of batch messages exists until the messages reach the Broker.
Let's look at how this works specifically. First, the source of batch messages is actually the Send interface of the Producer. In most scenarios, we will use the following format to send a message:
SendResult send(Message msg);
The code is very succinct for sending a message to the Broker. If we use the early batch processing model to do so, we just need to make a slight modification:
SendResult send(Collection<Message> msgs)
As you can see, multiple messages are grouped into a collection, and then the send interface is called to complete the use of the early batch processing model (from the user's perspective, this is already fine). As shown in the figure below, it is like a battle where the side with stronger firepower clearly wins.
So, is that the whole story? Of course not. First, the collection has some specific requirements. It is not just a matter of casually grouping multiple messages together to send. It needs to meet certain constraints:
• Same Topic.
• Not a RetryTopic.
• Not timed messages.
• Same isWaitStoreMsgOK mark.
I won't go into detail about these constraints now since they are self-explanatory. However, these constraints mean that the usage of the collection is not unconditional. There is a certain learning cost and development requirements before using it. You need to categorize messages based on these constraints and then bundle them before sending.
Someone might ask, isn’t this making things difficult? Why add so many constraints? Is it intentional? Actually, that is not the case. Imagine if we were a merchant:
• Customer A bought two items. Naturally, we can pack them together (group multiple Messages into an ArrayList) and send them out in one go, potentially saving on postage.
• Customer B and Customer C each bought one item. In this case, I still pack them together just as before and tell the delivery person to send one to Heilongjiang and the other to Hainan, and only pay a single postage fee. . . And that’s where it ends.
Clearly, the second scenario doesn’t work out. That is why messages in the same collection must meet various constraints. When the Broker receives a "batch message," it processes them as follows:
First, it will select the corresponding queue based on certain attributes of this batch of messages, which corresponds to the bottom part of the figure labeled p1, p2... After selecting the queue, subsequent operations like writing can proceed. This is why messages must have the same Topic, as different Topics cannot be assigned to the same queue.
Next, the process shown in the above figure occurs. You can see that three messages arrive that are "Four Messages", "One Message", and "Three Messages". They will then enter the unPack process sequentially. This process is somewhat similar to deserialization, as messages sent from the client are in memory structures, which differ from the structure actually stored in the file system. During the unPack process, they are unpacked into four messages, one message, and three messages respectively. At this point, there is no difference compared with sending eight messages continuously. This is where the lifecycle of batch messages comes to an end, and from this moment on, all messages are treated equally.
Because of this mechanism, the Consumer does not know whether the Producer sent messages like "firing arrows" or "lighting a cannon." This method has the significant advantage of high compatibility, making it seem no different from the classic usage of sending individual messages. In this scenario, each message retains the highest degree of freedom, such as independent tags, independent keys, and unique msgIds. The ecosystem derived from these features, such as message tracking, remains seamlessly connected. This means that simply by changing the Send interface used by the sender, significant sending performance improvements can be achieved without any changes needed on the consumer side.
I always use words very carefully. In the previous paragraph, I mentioned "significant sending performance improvements." The reason I say this is because there is still some distance from the overall system improvement, which leads us to the title of this section: "Index Construction Pipeline Upgrade."
First, we need to agree on one thing: for a message queue system, the overall performance ratio of "consumption to production" should be at least greater than or equal to one. This is because, in most cases, messages produced should be consumed at least once; otherwise, there would be no need to send them at all, right?
In the past, before sending performance was improved, it was the bottleneck in the entire production-to-consumption chain. This means that consumption rates could easily exceed production rates, making the entire process very smooth. But! After using the early batch processing model, the significant increase in production rate exposed another issue, that is, consumption rates could not keep up with production rates. In such a case, talking about the overall system performance is meaningless.
The reason for the consumption rate bottleneck lies in the index construction process. Since consumption requires finding the exact position of a message, indexes are thus necessary. This means that a message cannot be consumed until its index is built. The figure below is a simplified illustration of the index construction process:
This is the process that directly determines the upper limit of the consumption rate. Through a thread called ReputMessageService, it sequentially scans the CommitLog files, splits them into individual messages, validates these messages, converts them into index entries, and writes them into the corresponding ConsumeQueue files.
The whole process is completely serial, from splitting messages to converting indexes and writing to files. Each message goes through this flow once. Because it was initially implemented serially, upgrading it was quite natural. The goal was to improve concurrency through pipeline optimization. However, several issues need to be addressed here:
• Scanning CommitLog files in parallel is challenging because the length of each message is inconsistent, making it difficult to clearly define message boundaries for task distribution.
• The task of building indexes for individual messages is not heavy, so the overhead of task transitions (enqueue/dequeue) cannot be ignored.
• Writing to ConsumeQueue files requires maintaining order within the queue dimension; otherwise, it would introduce additional checking overhead.
To address these challenges, the design incorporated the concept of "batch processing." This idea is reflected both in the architecture design and in the implementation details. The figure below shows the process after the upgrade:
Since parallelizing the CommitLog scanning process is difficult, we decided not to parallelize it and instead use a single thread for sequential scanning. However, during scanning, a simple batch processing is performed. The messages scanned are not individual ones but are collected into larger buffer blocks, defaulting to 4MB. We can refer to these buffer blocks as batch msgs.
Next, these batch msgs are parsed in parallel. The batch msgs are scanned out at the granularity of individual messages and converted into DispatchRequest structures, which are then sequentially written to the ConsumeQueue files. The key points here are ensuring the order of batch msgs and maintaining the order and efficiency of DispatchRequests during their processing. To achieve this, I implemented a lightweight queue called DispatchRequestOrderlyQueue. It uses a circular structure to incrementally advance with sequence numbers and can achieve unordered enqueue, ordered dequeue. The detailed design and implementation are available in the open-source RocketMQ repository, so I will not elaborate further.
After the upgrade, the index construction process no longer holds back the system.
After the above index construction pipeline upgrade, the entire system achieves the most basic batch processing model, allowing for significant performance improvements with minimal modification and maximum compatibility.
However, this is not enough! Due to considerations like compatibility, the early model still had limitations, leading to the birth of the BatchCQ model. The primary reasons are twofold:
Performance:
Capabilities:
How does BatchCQ address the above issues? It is quite straightforward. It applies batching to the ConsumeQueue as well. This model removes the unpacking step before Broker writes and constructs the index only once:
As shown in the figure above, if we compare indexes to envelopes, originally each envelope could only contain one index entry. With batching, an envelope can hold any number of index entries. The storage structure has also changed significantly:
For example, if two batches of messages arrive, with three and two messages each, in a common CQ model, they would be inserted into five slots, each indexed to five messages respectively. In the BatchCQ model, the three and two messages would be inserted into only two slots, indexed to three and two messages respectively.
Due to this feature, the original format of CQ has changed. To record more information, elements like Base Offset and Batch Num are added, altering the logic for locating indexes.
• Common CQ: Each slot is fixed-length. The location [Slot Length * QueueOffset] can directly find the index. The complexity is O(1).
• BatchCQ: Use binary search, the complexity is O(log n).
Although this part only involves the modification of ConsumeQueue, it actually has a significant impact as a part of the core chain. Messages in a batch are treated as a single message, eliminating the need for re-unpacking. These messages will have the same TAG, Keys, and even MessageId. The only way to identify messages within the same batch is to rely on their QueueOffset. This means that features depending on MessageId, such as message tracking, cannot be directly compatible. However, the granularity of message processing can remain unchanged, which relies on QueueOffset.
After the BatchCQ upgrade, we have already achieved extreme throughput. So what is AutoBatch?
We need to start from the beginning again. In the summary of the early batch processing model, a major flaw was mentioned, that is, it is not user-friendly. Users need to be concerned with various constraints, such as Topic, message type, and special flags; and in BatchCQ, additional constraints like Keys and Tags were introduced, which can lead to unexpected situations if used incorrectly.
It is clear that both the early batch processing model and the BatchCQ model come with learning costs. Besides needing to understand various usage methods, there are hidden issues that users need to actively resolve:
• Both the early batch processing model and BatchCQ require the sender to classify and package messages.
• Classifying and packaging messages are costly tasks. Classification requires understanding the basis for classification, and packaging requires knowing the trigger timing.
• The classification basis is complex. The early batch processing model needs to pay attention to multiple attributes, and BatchCQ adds multiple restrictions on this basis.
• Timing for packaging is hard to grasp, and improper use can lead to performance degradation, unstable latency, and uneven partitioning.
To solve these problems, AutoBatch was born. It acts as an automated sorting and packaging machine, operating continuously, precisely, and efficiently, shielding users from the details they previously had to manage. It has the following advantages:
• AutoBatch manages classification and packaging with simple configurations.
• Users do not perceive the managed process. They can use the existing send interfaces to enjoy performance improvements from batching, while maintaining compatibility with synchronous and asynchronous sending.
• AutoBatch is compatible with both the early batch processing model and the BatchCQ model.
• It is lightweight, performs well, and optimizes for latency jitter and small partition issues.
First, how simple is it? Let's take a look:
// Enable AutoBatch on the sender
rmqProducer.setAutoBatch(true);
This means that by adding just one line, you can enable the performance mode of RocketMQ and achieve the extreme throughput improvements provided by the early batch processing model or the BatchCQ model. After enabling AutoBatch, users do not need to change any existing behavior and can continue using the classic Send(Message msg) method. Of course, finer control over memory and delay can also be achieved:
// Set the size of a single MessageBatch (in kb)
rmqProducer.batchMaxBytes(32 * 1024);
// Set the maximum aggregation wait time (in ms)
rmqProducer.batchMaxDelayMs(10);
// Set the maximum memory usage for all aggregators (in kb)
rmqProducer.totalBatchMaxBytes(32 * 1024 * 1024);
So where exactly is it lightweight and efficient? The following simplified flowchart should provide an answer:
First, it introduces only a single background thread that runs at a cycle of 1/2 maxDelayMs. This background thread submits messages that have exceeded the waiting period from the buffer to the asynchronous sending thread pool, thus completing the aggregation in the time dimension. The aggregation in the space dimension is checked by the sending thread when passing messages, and if maxBytes is met, the messages are sent in place.
The design is very streamlined, introducing only one periodically running thread. This approach avoids creating performance bottlenecks due to the AutoBatch model itself. Additionally, the serialization process for batchMessages is simplified, removing all checks during sending (since classification has already occurred during the aggregation process).
We have shared the evolution of RocketMQ’s batch processing models. Now, let’s showcase their specific effects. All the stress testing results below are from the Openmessaging-Benchmark framework. The configurations used during stress testing are as follows:
Stress Testing Machine | x86 Chip Machine | |
---|---|---|
Specification | 32 cores (vCPU) 64 GiB 20 Mbps ecs.c7.8xlarge |
8 cores (vCPU) 64 GiB 20 Mbps ecs.r7.2xlarge |
Cloud Disk | None | ESSD cloud disk PL1 965GiB (50000 IOPS) |
Operating System | Alibaba Cloud Linux 3.2104 LTS 64-bit | Alibaba Cloud Linux 3.2104 LTS 64-bit |
JDK Version | openjdk version "11.0.19" 2023-04-18 LTSOpenJDK Runtime Environment (Red_Hat-11.0.19.0.7-1.0.1.al8) (build 11.0.19+7-LTS) | openjdk version "11.0.19" 2023-04-18 LTS OpenJDK Runtime Environment (Red_Hat-11.0.19.0.7-1.0.1.al8) (build 11.0.19+7-LTS) |
To set up the stress testing environment for Openmessaging-Benchmark, first deploy the latest version of RocketMQ from the open-source community, then configure information such as the Namesrv access point, and enable the performance mode AutoBatch of RocketMQ by setting the autoBatch field to true:
bin/benchmark --drivers driver-rocketmq/rocketmq.yaml workloads/1-topic-100-partitions-1kb-4p-4c-1000k.yaml
With AutoBatch enabled, the early batch processing model improves performance significantly. It can be seen that performance increases from around 80,000 to approximately 270,000, a 300% improvement.
Pipeline optimization needs to be enabled on the server side. Here is a simple configuration example:
// Enable index construction pipeline optimization
enableBuildConsumeQueueConcurrently=true
// Adjust the maximum consumption threshold for messages in memory
maxTransferBytesOnMessageInMemory=256M
maxTransferCountOnMessageInMemory=32K
// Adjust the maximum consumption threshold for messages in the disk
maxTransferBytesOnMessageInDisk=64M
maxTransferCountOnMessageInDisk=32K
It can be seen that only with index construction optimization enabled can the system consistently reach a throughput of 270,000. Without enabling it, insufficient consumption rates can trigger cold reads, affecting the stability of the entire system, and this would not be practical for production use. Therefore, it is essential to enable index construction optimization when using batch processing models.
The usage of the BatchCQ model differs from the earlier models mentioned. It is not activated by a switch; instead, BatchCQ is a type of Topic. When creating a topic, you can specify it as a BatchCQ type, which allows you to achieve the highest throughput benefits.
// Set the attributes of Topic in the TopicAttributes
public static final EnumAttribute QUEUE_TYPE_ATTRIBUTE = new EnumAttribute("queue.type", false, newHashSet("BatchCQ", "SimpleCQ"), "SimpleCQ");
topicConfig.getAttributes().put("+" + TopicAttributes.QUEUE_TYPE_ATTRIBUTE.getName(), "BatchCQ");
When using the BatchCQ model, the difference compared with the early batch processing model is significant. Therefore, we sought a comparison with open-source Kafka. The deployment architecture is as follows:
Three Masters and three Slaves, deployed with the lightweight Container.
• Node 1: Master-A, Slave-C
• Node 2: Master-C, Slave-B
• Node 3: Master-B, Slave-A
Three nodes with the number of partition replicas set to two.
MQ | Kafka | |
---|---|---|
16-partions | TPS: 251439.34P99: 264.0 | TPS: 267296.34P99: 1384.01 |
10000-partiotions | TPS: 249981.94P99: 1341.01 | Error - No Data |
As can be seen, when using a BatchCQ-type Topic, the performance of RocketMQ is nearly on par with that of Kafka:
• For 16-partitions, the throughput difference between the two is less than 5%, and RocketMQ exhibits significantly lower latency.
• For 10,000-partitions, because of the more concentrated storage structure of RocketMQ, throughput remains almost unchanged in scenarios with a large number of partitions. However, Kafka encounters errors and becomes unusable under default configurations.
Therefore, the BatchCQ model is able to handle the traffic that is required for extreme throughput. If a local disk with better performance is replaced, the same machine configuration can reach a higher upper limit.
If you want to learn more about RocketMQ, please refer to: https://rocketmq-learning.com/
Achieve the Go Application Microservice Governance Capability Without Changing a Line of Code
A Comprehensive Explanation of How RocketMQ Leverages Raft for High Availability
206 posts | 12 followers
FollowAlibaba Cloud Native Community - December 16, 2022
Alibaba Cloud Native Community - November 23, 2022
Alibaba Cloud Native - June 6, 2024
Alibaba Cloud Native - July 18, 2024
Alibaba Cloud Native Community - February 1, 2024
Alibaba Cloud Native - June 11, 2024
206 posts | 12 followers
FollowResource management and task scheduling for large-scale batch processing
Learn MoreA fully-managed Apache Kafka service to help you quickly build data pipelines for your big data analytics.
Learn MoreApsaraMQ for RocketMQ is a distributed message queue service that supports reliable message-based asynchronous communication among microservices, distributed systems, and serverless applications.
Learn MoreAlibaba Cloud Function Compute is a fully-managed event-driven compute service. It allows you to focus on writing and uploading code without the need to manage infrastructure such as servers.
Learn MoreMore Posts by Alibaba Cloud Native