Message middleware used in most video departments includes RedisMQ, ActiveMQ, RocketMQ, and Kafka. This article describes the usage scenarios and problems of MQ in several typical businesses.
RocketMQ was first used in counting services. The counting service calculates and displays the views of the client in real-time. At that time, Redis was used for real-time counting, and the database was called asynchronously to count. At first, there was nothing wrong with this model, but as the business volume increased, the pressure on the database increased. Sometimes the CPU resources of the database are almost used up. In addition, when the database is migrated, writing needs to be suspended, and the counting is at risk of data loss.
Thus, the counting service urgently needs a reliable and accumulated MQ that supports real-time consumption to change this situation.
We considered RocketMQ and Kafka but eventually chose RocketMQ for the reasons listed below:
The delivery service needs to deliver the content recommended for users to each area, but the recommendation service needs users' feedback on the recommended content. Thus, Kafka is adopted in the delivery service to interact with the recommendation service. However, due to a machine failure, a failover occurred in the Kafka cluster. Unfortunately, this cluster has too many partitions, which took several minutes to complete the failover.
This blocked the service thread, and the service entered an unresponsive state. Later, we learned that even if a broker of RocketMQ is down, messages will be sent to other brokers without blocking the entire cluster. So, the delivery service migrated all message interactions to RocketMQ.
In the past, our video basic service used RedisMQ to notify the caller of the video data changes for updating the data. However, the message push of RedisMQ is based on the pub/sub model. This model is highly real-time, but it does not guarantee reliability and persistent messages.
In some cases, these two defects made the caller unable to receive the notification. When the message was lost, it was nearly impossible to get it back.
Therefore, the video business eventually abandoned RedisMQ and turned to RocketMQ. RocketMQ ensures that messages are delivered at least once and can be persisted. Even if the client is restarted, it can start the consumption from the previous place.
Previously, ActiveMQ was used by the basic service for user videos, which was mainly used to notify the dependent party of data changes. Its message body contains the changed data. Unfortunately, when the number of messages is large, ActiveMQ often fails to respond, and consumers fail to receive messages for a long time. We learned that a single RocketMQ broker can support a TPS of over 100,000 and hundreds of millions of accumulated messages. Thus, this business was also migrated to RocketMQ.
Currently, RocketMQ is used in basic video services, user services, livestreaming services, payment services, audit services, and other business systems. Kafka is mostly used for log-related processing services, such as log reporting and log collection.
In addition, since RocketMQ supports more clients, it is easier for our businesses in many other languages to access RocketMQ, such as Python-based clients for AI groups and GO-based clients for some services.
In the early stage, we relied on command lines and the RocketMQ-Console to maintain RocketMQ. The questions frequently asked by business parties include:
There are many strange problems!
As O&M personnel, in addition to answering business parties' questions, we are very careful to maintain RocketMQ by using command lines. A small mistake might cause a large-scale failure. As we get more familiar with O&M, we have written several documents on usage specifications, best practices, naming conventions, operation procedures, and other topics. However, it was found that these documents contributed little to the improvement of production efficiency. Therefore, it is better to convert experience and practice into products to serve the business rather than writing documents. As a result, MQCloud was created.
Let's first look at the positioning of MQCloud:
It is an all-in-one service platform that integrates client SDKs, monitoring and alerting, and cluster O&M. The system architecture of MQCloud is shown below:
Now, I will explain how MQCloud solves the pain points mentioned above.
The dimensions of the user and the resource are introduced to achieve this goal. Users and resources are managed to ensure that different users only focus on their data.
Showing different views to different people makes the operations users can perform very clear.
All operations are approved by administrators in the background approval system in the form of application forms to ensure the security and standardization of cluster operations. This improves security significantly.
One of the core functions of MQCloud is monitoring and alerting. Currently, MQCloud supports alerting of the following aspects:
Statistics is a must for monitoring. MQCloud does a lot of statistical work to understand the operating status of RocketMQ clusters. (Most of it depends on broker statistics.) It mainly includes the following items:
RocketMQ does not provide traffic statistics of producers. (The topic is provided, but the situation of each producer is unknown.) MQCloud provides the statistics of producers through the hook function of RocketMQ.
Statistics mainly contain the following information:
After the statistics are completed, the data is regularly sent to MQCloud for storage, real-time monitoring, and display.
One thing about the statistics is that the time consumption statistics generally have maximum, minimum, and average values. Usually, the time consumption of 99% of requests can represent the real response situation. (The time consumption of 99% of requests is lower than the maximum value.) The biggest obstacle is how to control memory usage. We need to sort all the time consumption within a specific time before we can get the results. There are some algorithms and data structures for statistics of streaming data, such as t-digest. MQCloud uses an inaccurate but relatively simple segmentation statistics method. It is shown below:
1) Create a piecewise array based on maximum time consumption and different hash time span:
Advantages: This method occupies a fixed memory. For example, if the maximum time consumption is 3,500 milliseconds, only an array with a size of 96 is required. Disadvantages: The accuracy needs to be set in advance and cannot be changed.
2) For the piecewise array above, create a counting array of AtomicLong of the same size. It should support concurrent statistics.
3) When performing time-consumption statistics, calculate the subscript of the piecewise array and then call the counting array to perform statistics. Please see the following figure:
As such, time consumption statistics can be obtained from the counting array in real-time. This process is shown in the following figure:
4) Then, the scheduled sampling task takes snapshots of the counting array every minute to generate the following time consumption data:
5) Since the time consumption data above is naturally arranged in order, it is easy to calculate the time consumption data of 99% and 90% of requests as well as the average time consumption.
The newly added trace function in RocketMQ 4.4.0 is also implemented by hook, so it conflicts with the statistics of MQCloud. Now, it is compatible with MQCloud. Trace and statistics are two dimensions. Trace reflects the process of messages from production to storage and consumption, while MQCloud performs statistics on producers. MQCloud can display production time consumption and provide alerts of production exceptions with the statistics data.
nmon is placed in the /tmp directory automatically to collect cluster conditions. Then, scheduled ssh connects to the machine to execute the nmon command, parse the returned data, and store it.
The process above has laid a solid data foundation for monitoring and alerting.
For some demands of the client, mq-client has carried out development and customization based on rocketmq-client.
MQCloud stores the relationship between producers, consumers, and clusters. Clients can route to the target cluster automatically through route adaptation, making clients transparent to multiple clusters.
Trace data can be sent to separate clusters by building separate trace clusters and customized clients. This will not influence the primary cluster.
If clients integrate and couple different serialization mechanisms with MQCloud, they do not need to care about serialization issues. Currently, serialization mechanisms Protobuf and JSON are supported, and the mechanism can be switched online by type detection.
The traffic control mechanism is enabled automatically by providing a token bucket and a leaky bucket throttling mechanism. This process prevents message peaks from flooding the business end and provides convenience for businesses that need to control the traffic rate accurately.
By providing isolation API for production messages with Hystrix, the business end will not be influenced when the broker fails.
Any disturbance of the client can be found in time through statistics, collection, and monitoring in MQCloud.
Certain conventions, specifications, and best practices can be implemented through coding assurance, including (but not limited to):
Manual deployment of a broker instance is not very difficult. However, when the number of instances increases, manual deployment is highly error-prone and time-consuming.
MQCloud provides a set of automated deployment mechanisms, including writing suspension, enabling and disabling, local update, and remote migration (including data verification).
Support Quick Deployment:
In addition, as the core of RocketMQ, the broker has hundreds of configuration items, and many of them involve performance tuning. This often requires careful tuning according to the status of the server. MQCloud has developed the configuration template feature to support flexible deployment.
As an O&M platform, MQCloud involves the following things that we need to consider:
1) Broker configuration items are complicated and need to be managed clearly.
2) Prompts and suggestions are provided when adjusting existing broker parameters. In addition, the following situations need to be considered:
3) Parameters are inherited when a broker is newly deployed. Parameters that have been optimized and verified by the online brokers are expected to be used automatically when a broker is newly deployed.
Broker Configuration Template
MQCloud uses the following methods to solve the problems above:
MQCloud provides a complete set of machine O&M mechanisms to improve productivity.
RocketMQ has supported ACL since version 4.4.0, but it is not enabled by default. This means that anyone can control online clusters using management tools or API. However, enabling ACL has too much impact on the existing business. MQCloud is specially designed for this problem.
After referencing the RocketMQ ACL mechanism, permission verification is only enhanced for the RocketMQ administrator operations.
It also supports customization and the hot-loading of administrator request codes, making it impossible to operate RocketMQ clusters illegally. It also improves security significantly.
Communication Reinforcement of the Broker
Since the code for data synchronization in the broker is not verified, there are security risks. If the slave communication port monitored by master is connected and data of more than 8 bytes are sent, it may cause synchronization offset errors. The code is listed below:
MQCloud ensures communication security by verifying the first pack of data:
if ((this.byteBufferRead.position() - this.processPostion) >= 8) {
int pos = this.byteBufferRead.position() - (this.byteBufferRead.position() % 8);
long readOffset = this.byteBufferRead.getLong(pos - 8);
this.processPostion = pos;
HAConnection.this.slaveAckOffset = readOffset;
if (HAConnection.this.slaveRequestOffset < 0) {
HAConnection.this.slaveRequestOffset = readOffset;
log.info("slave[" + HAConnection.this.clientAddr + "] request offset " + readOffset);
}
HAConnection.this.haService.notifyTransferSome(HAConnection.this.slaveAckOffset);
}
The O&M scale of MQCloud is listed below:
After taking the needs of the business into account, MQCloud takes the focus of each role as its core and comprehensive monitoring as the goal to meet the needs of each business end. MQCloud is constantly developing and improving.
After MQCloud matured gradually, we opened the source code to gain more experience and serve the community. After the design and split, MQCloud was open-sourced in 2018. By now, more than 20 update versions have been released. These versions include function updates, bug fixes, and descriptions in the Wiki. Each major version has undergone detailed testing and internal operation. After that, many users were eager to try it out and provided many useful suggestions. Then, we improved it according to the feedback.
We will follow our goal and remain focused on the path of open-source:
KubeVela v1.2 Focuses on the Developer Experience and Simplified Multi-Cluster Application Delivery
506 posts | 48 followers
FollowAlibaba Cloud Native - November 13, 2024
Alibaba Cloud Native Community - March 14, 2022
Alibaba Cloud Native Community - November 23, 2022
Alibaba Cloud Native Community - January 5, 2023
Alibaba Cloud Native Community - March 14, 2023
Alibaba Developer - September 22, 2020
506 posts | 48 followers
FollowApsaraMQ for RocketMQ is a distributed message queue service that supports reliable message-based asynchronous communication among microservices, distributed systems, and serverless applications.
Learn MoreA fully-managed Apache Kafka service to help you quickly build data pipelines for your big data analytics.
Learn MoreA unified, efficient, and secure platform that provides cloud-based O&M, access control, and operation audit.
Learn MoreManaged Service for Grafana displays a large amount of data in real time to provide an overview of business and O&M monitoring.
Learn MoreMore Posts by Alibaba Cloud Native Community