By Lao Hu
The era of data has come, and the value of data is becoming important and ubiquitous. The data middle platform emerged to put data to use.
The data middle platform has six functions:
There are many data synchronization solutions in the industry. These synchronization solutions are based on two aspects.
The architecture of most synchronization solutions in the industry is similar to the following figure:
An industrial component may not do multiple choice questions. Point-to-point synchronization is required, as is many-to-many synchronization. Offline synchronization and real-time synchronization are all necessary. The following are the heterogeneous and synchronous data component architectures:
Heterogeneous and synchronization components based on message middleware have the following advantages.
As can be seen from the figure, there are many types of sources:
Gather/SDK: Mainly external data, such as customer data and app collection data source
RPC: Mainly valuable data generated by internal business systems
Agent: Collects logs, system, and hardware operation information
Data Sources: Reading data from various storage
The read speed of the source and the write data of slink is uncontrollable. In common cases, the efficiency of the source is often several times that of the sink, which may make gather services unavailable and cause serious accidents. No one knows when the peak will occur. Such unpredictable things will pose a threat to the stability and high availability of the entire system. Therefore, the message middleware is introduced as a buffer.
Customers have two sets of commodity systems, online and offline, and the recommended behavior of planning the two systems can be shared. It is necessary to collect the commodity data of the customer's two systems at the same time and synchronize the five storages. A heterogeneous solution based on message middleware can handle the problem smoothly.
The first process shows the architecture of a source with multiple sinks, and the second process shows multiple sources with one sink. In the heterogeneous architecture, the number of sources and sinks and whether they are running can be flexibly matched. This saves server resources.
Why do we choose RocketMQ? The features of RocketMQ help us solve many problems. The following chart lists the problems:
A basic principle in the data synchronization component is that data cannot be lost. There are many uncertain factors and unpredictable situations for the SaaS platform, and data recovery is difficult. If there is a difference between the synchronized data and the customer's internal data in the real-time scenario, it may be fatal. Compared with other message middleware, data security is the most important thing. The following features of RocketMQ ensure that data will not be lost.
Data synchronization includes adding, modification, and deletion. Tables (result sets) can be classified as appended tables and modified tables.
Appended tables only have the add behavior, which is suitable for parallel consumption.
The modified table has three behaviors of add, modify, and delete. In the case of high concurrency and multithreading of big data, data middle platforms are easily not executed according to the data behavior of the business, resulting in inconsistency between data middle platform and business data. The synchronous consumption feature of RocketMQ is selected to ensure that the execution is consistent with the business operation and the operation is executed in sequence.
The preceding figure demonstrates data inconsistency in the case of parallel consumption. Ideally, the sequence of sources is 1234, and the sequence of sinks is 6785, but the actual sequence of implementation is 5678. Therefore, RocketMQ's sequential message feature is used to ensure data consistency.
Ordered messages allow only one consumer to consume messages, and other consumers will rob consumption rights.
The efficiency of synchronous consumption is much lower than parallel consumption, and the write speed of the sink is much lower than source reading data, which often causes a large amount of data accumulation. As a result, the consistency between synchronous data and business data is poor, and sometimes, low data cannot be tolerated by businesses.
It is full synchronization during the first synchronization, and no modification operation exists. Therefore, we use parallel synchronization and change to synchronous consumption. This is to improve the performance of synchronous consumption. After in-depth analysis and research, it is found that if the data is classified, the efficiency can be improved. The queue and tag based on RocketMQ implement parallel classification.
Currently, there are a large number of synchronous topics. It is a difficult problem to create, delete, test, locate, observe, and find these topics. Every development, test, and product involved in the project must be observed and operated on. RocketMQ-console can help us solve these problems and improve the overall development efficiency and progress.
The preceding diagram shows the flow of data within the system in the recommended business scenario. Such a flow is internally called a task, and each node is an operator. A piece of data may not produce the expected result of the task due to an irresistible factor. In the case of complex tasks and systems, high concurrency, and high performance, it is necessary to monitor the flow of data, find out the abnormal flow in time, and correct the data and problems quickly. Therefore, one ability is needed: data tracking.
Observation shows that most operators currently read data from RocketMQ, so the RocketMQ-based message trajectory is calculated, and the first-generation data tracking capability is designed.
RocketMQ-connect is a data heterogeneous open-source component implemented based on RocketMQ. It supports synchronization between multiple data sources. Nowadays, more enterprises and companies use RocketMQ to support business platforms and data platforms. Data on each platform is a flow state. Using RocketMQ-connect can build a streaming data platform simply and quickly under the original architecture.
The RocketMQ-connect architecture has two distinct characteristics:
The connect-cli sends heterogeneous tasks to any connect-runtime. The runtime processes the task information and sends the broker. All connect-runtime in the cluster will accept the task and store it locally. The runtime starts and runs the task without directly relying on the broker.
In the overall RocketMQ-connect, no other components are used in the overall architecture design, ensuring the overall simplicity.
json
{
"connector-class":"org.apache.rocketmq.connect.file.FileSourceConnector",
"topic":"fileTopic",
"filename":"/opt/source-file/source-file.txt",
"source-record-converter":"org.apache.rocketmq.connect.runtime.converter.JsonConverter"
}
connector-class: the execution object of the source
source-record-converter: data processing object
topic: configuration of file-source
filename: configuration of file-source
As shown in the task information, starting a task requires providing the source or sink execution class required by the task. RocketMQ-connect will look for the startup class in the plug-in directory.
Lao Hu is a full-stack technical expert and RocketMQ multilingual client contributor. He focuses on middleware, distributed, storage, and AI software solutions in the technical field.
Hybrid Deployment of SLO-Aware Workload Scheduling for Alibaba Cloud Container Service
508 posts | 48 followers
FollowAlibaba Cloud Native Community - February 1, 2023
Alibaba Cloud Native Community - March 14, 2022
Alibaba Cloud Native - June 11, 2024
Alibaba Cloud Native Community - March 20, 2023
Alibaba Cloud Native Community - March 22, 2023
Alibaba Cloud Native Community - January 31, 2023
508 posts | 48 followers
FollowApsaraMQ for RocketMQ is a distributed message queue service that supports reliable message-based asynchronous communication among microservices, distributed systems, and serverless applications.
Learn MoreAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreThis comprehensive one-stop solution helps you unify data assets, create, and manage data intelligence within your organization to empower innovation.
Learn MoreMore Posts by Alibaba Cloud Native Community