Application of RocketMQ in Data Heterogeneous System

By Lao Hu

Scenarios

The era of data has come, and the value of data is becoming important and ubiquitous. The data middle platform emerged to put data to use.

The data middle platform has six functions:

Data Collection
ETL
Data Computing
Storage
Data Analysis
Data Display

Selection

There are many data synchronization solutions in the industry. These synchronization solutions are based on two aspects.

Point-to-Point Synchronization
Offline Synchronization

The architecture of most synchronization solutions in the industry is similar to the following figure:

An industrial component may not do multiple choice questions. Point-to-point synchronization is required, as is many-to-many synchronization. Offline synchronization and real-time synchronization are all necessary. The following are the heterogeneous and synchronous data component architectures:

Heterogeneous and synchronization components based on message middleware have the following advantages.

Multiple Types of Sources

As can be seen from the figure, there are many types of sources:

Gather/SDK: Mainly external data, such as customer data and app collection data source

RPC: Mainly valuable data generated by internal business systems

Agent: Collects logs, system, and hardware operation information

Data Sources: Reading data from various storage

Peak-Load Shifting

The read speed of the source and the write data of slink is uncontrollable. In common cases, the efficiency of the source is often several times that of the sink, which may make gather services unavailable and cause serious accidents. No one knows when the peak will occur. Such unpredictable things will pose a threat to the stability and high availability of the entire system. Therefore, the message middleware is introduced as a buffer.

Multi-Data Source Heterogeneity

Customers have two sets of commodity systems, online and offline, and the recommended behavior of planning the two systems can be shared. It is necessary to collect the commodity data of the customer's two systems at the same time and synchronize the five storages. A heterogeneous solution based on message middleware can handle the problem smoothly.

Better Resource Allocation

The first process shows the architecture of a source with multiple sinks, and the second process shows multiple sources with one sink. In the heterogeneous architecture, the number of sources and sinks and whether they are running can be flexibly matched. This saves server resources.

In-Depth

Why do we choose RocketMQ? The features of RocketMQ help us solve many problems. The following chart lists the problems:

Data Security

A basic principle in the data synchronization component is that data cannot be lost. There are many uncertain factors and unpredictable situations for the SaaS platform, and data recovery is difficult. If there is a difference between the synchronized data and the customer's internal data in the real-time scenario, it may be fatal. Compared with other message middleware, data security is the most important thing. The following features of RocketMQ ensure that data will not be lost.

The overall architect design of RocketMQ ensures data security with synchronous replication and synchronous writing into the disk of the broker.
When the message fails to be consumed, try again.
Dead-Letter Queue:

There is no need to maintain the storage point after the message failure.
Exceptions (such as the absence of the queue caused by abnormal operations) can be sent to the private letter queue.

Parallel and Synchronous Consumption

Data synchronization includes adding, modification, and deletion. Tables (result sets) can be classified as appended tables and modified tables.

Appended tables only have the add behavior, which is suitable for parallel consumption.

The modified table has three behaviors of add, modify, and delete. In the case of high concurrency and multithreading of big data, data middle platforms are easily not executed according to the data behavior of the business, resulting in inconsistency between data middle platform and business data. The synchronous consumption feature of RocketMQ is selected to ensure that the execution is consistent with the business operation and the operation is executed in sequence.

The preceding figure demonstrates data inconsistency in the case of parallel consumption. Ideally, the sequence of sources is 1234, and the sequence of sinks is 6785, but the actual sequence of implementation is 5678. Therefore, RocketMQ's sequential message feature is used to ensure data consistency.

Ordered messages allow only one consumer to consume messages, and other consumers will rob consumption rights.

Classification Parallelism

The efficiency of synchronous consumption is much lower than parallel consumption, and the write speed of the sink is much lower than source reading data, which often causes a large amount of data accumulation. As a result, the consistency between synchronous data and business data is poor, and sometimes, low data cannot be tolerated by businesses.

It is full synchronization during the first synchronization, and no modification operation exists. Therefore, we use parallel synchronization and change to synchronous consumption. This is to improve the performance of synchronous consumption. After in-depth analysis and research, it is found that if the data is classified, the efficiency can be improved. The queue and tag based on RocketMQ implement parallel classification.

Observation and O&M Capabilities

Currently, there are a large number of synchronous topics. It is a difficult problem to create, delete, test, locate, observe, and find these topics. Every development, test, and product involved in the project must be observed and operated on. RocketMQ-console can help us solve these problems and improve the overall development efficiency and progress.

Based on the Message Tracing, Initially Implement the Data Tracking

The preceding diagram shows the flow of data within the system in the recommended business scenario. Such a flow is internally called a task, and each node is an operator. A piece of data may not produce the expected result of the task due to an irresistible factor. In the case of complex tasks and systems, high concurrency, and high performance, it is necessary to monitor the flow of data, find out the abnormal flow in time, and correct the data and problems quickly. Therefore, one ability is needed: data tracking.

Observation shows that most operators currently read data from RocketMQ, so the RocketMQ-based message trajectory is calculated, and the first-generation data tracking capability is designed.

RocketMQ-Connect

RocketMQ-connect is a data heterogeneous open-source component implemented based on RocketMQ. It supports synchronization between multiple data sources. Nowadays, more enterprises and companies use RocketMQ to support business platforms and data platforms. Data on each platform is a flow state. Using RocketMQ-connect can build a streaming data platform simply and quickly under the original architecture.

The RocketMQ-connect architecture has two distinct characteristics:

Decentralized and Dependency-Free Architecture Design
SPI-Based Pluggable Design

Decentralized Design

The connect-cli sends heterogeneous tasks to any connect-runtime. The runtime processes the task information and sends the broker. All connect-runtime in the cluster will accept the task and store it locally. The runtime starts and runs the task without directly relying on the broker.

In the overall RocketMQ-connect, no other components are used in the overall architecture design, ensuring the overall simplicity.

SPI-Based Pluggable Design

json
{
  "connector-class":"org.apache.rocketmq.connect.file.FileSourceConnector",
  "topic":"fileTopic",
  "filename":"/opt/source-file/source-file.txt",
  "source-record-converter":"org.apache.rocketmq.connect.runtime.converter.JsonConverter"
}

connector-class: the execution object of the source
source-record-converter: data processing object
topic: configuration of file-source
filename: configuration of file-source

As shown in the task information, starting a task requires providing the source or sink execution class required by the task. RocketMQ-connect will look for the startup class in the plug-in directory.

About the Author

Lao Hu is a full-stack technical expert and RocketMQ multilingual client contributor. He focuses on middleware, distributed, storage, and AI software solutions in the technical field.

Community

Application of RocketMQ in Data Heterogeneous System

Scenarios

Selection

Multiple Types of Sources

Peak-Load Shifting

Multi-Data Source Heterogeneity

Better Resource Allocation

In-Depth

Data Security

Parallel and Synchronous Consumption

Classification Parallelism

Observation and O&M Capabilities

Based on the Message Tracing, Initially Implement the Data Tracking

RocketMQ-Connect

Decentralized Design

SPI-Based Pluggable Design

About the Author

Read previous post:

Read next post:

Alibaba Cloud Native Community

You may also like

Comments

Alibaba Cloud Native Community

Related Products

ApsaraMQ for RocketMQ

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

Organizational Data Mid-End Solution