Interview Questions We've Learned Over the Years: Kafka

By Taosu

Why Kafka

The role of Message Queue (MQ): asynchronous processing, load shifting, and decoupling

For small and medium-sized enterprises with fewer technical challenges, the active community and open-source nature make RabbitMQ a good choice. On the other hand, for large enterprises with robust infrastructure research and development capabilities, RocketMQ, developed in Java, is a solid option.

In scenarios such as real-time computing and log collection within the realm of big data, Kafka stands as an industry standard. Utilizing Kafka ensures there are no issues, given its high level of community activity and global prominence in the field.

	RabbitMQ	RocketMQ	Kafka
Single Machine Throughput	10,000 magnitude	100,000 magnitude	100,000 magnitude
Programming Language	Erlang	Java	Java and Scala
Message Delay	Microseconds (μs)	Millisecond (ms)	Millisecond (ms)
Message Loss	Low possibility	No loss after parameter optimization	No loss after parameter optimization
Consumption Mode	Push and pull	Push and pull	Pull
Impact of Number of Topics on Throughput	\	Hundreds or thousands of topics will have a small impact on throughput	Dozens or hundreds of topics will have a great impact on throughput
Availability	High (primary/secondary)	Very high (primary/secondary)	Very high (distributed)

RabbitMQ

RabbitMQ was initially used for reliable communication in telecommunication services. It is one of the few products that support Advanced Message Queuing Protocol (AMQP).

Advantages:

Lightweight, fast, and easy to be deployed for use.
It supports flexible routing configurations. RabbitMQ has an exchange module between the producer and the queue. Messages sent by producers can be sent to different queues based on the configured routing rules. Routing rules are flexible and can be implemented in a customized way.
The RabbitMQ client supports most programming languages and AMQP.

Disadvantages:

If a large number of messages are accumulated in the queue, the performance will drop sharply.
Tens of thousands to hundreds of thousands of messages are processed per second. If the application requires high performance, do not choose RabbitMQ.
RabbitMQ is developed by Erlang. The cost of feature extension and secondary development is very high.

RocketMQ

It has drawn on the design of Kafka and has made many improvements. It has almost all the features and functions that an message queue should have.

RocketMQ is mainly used in scenarios such as ordering, transaction, stream computing, message pushing, log stream processing, and binary log distribution.
After the previous tests of the Double 11 Shopping Festival, the performance, stability, and reliability are good enough.
It is developed by Java and is very convenient for reading source code, extensions, and secondary development.
Many optimizations have been made for the response latency in the e-commerce field.
It can process hundreds of thousands of messages per second while responding in milliseconds. If your application demands short response time, you can use RocketMQ.
The performance of RocketMQ is an order of magnitude higher than that of RabbitMQ.
It supports the dead-letter queue and dead-letter exchange (DLX) is a very useful feature. It can handle the abnormal situation where the message is put into the dead-letter queue when the message cannot be correctly consumed by the consumer. The follow-up analysis program can consume the contents of the dead-letter queue to analyze the abnormal situation, and then the system can be improved and optimized.

Disadvantages:

The integration and compatibility with peripheral systems are not very good.

Kafka

High availability: It supports almost all open-source software and meets most application scenarios, especially big data and stream computing.

Kafka is efficient, scalable, and has message persistence. It supports partitions, replicas, and fault tolerance.
Kafka has a lot of designs for batch and asynchronous processing to get high performance.
Hundreds of thousands of asynchronous messages are processed per second. If compression is enabled, the system can process up to 20 million messages per second.
However, due to asynchronous and batch processing, the delay will be high, which is not suitable for e-commerce scenarios.

What Is Kafka

Producer API: It allows an application to publish record streams to one or more Kafka topics.
Consumer API: It allows an application to subscribe to one or more topics and process the record streams generated for them.
Streams API: It allows an application to act as a stream processor, transforming an input stream into an output stream.

Message

Message is the data unit of Kafka. A message can be seen as a "row of data" or a "record" in a database.

Batch

Messages are written to Kafka in batches to improve efficiency. However, more throughput increases response time.

Topic

Messages are classified by topics. It is similar to tables in a database.

Partition

Topics can be divided into multiple partitions and distributed in the Kafka clusters. Each single partition is ordered, which facilitates scale-out. If a global order is needed, you need to set the partition to one.

Replica

Each topic is divided into multiple partitions, and each partition has multiple replicas.

Producer

By default, the producer distributes messages evenly across all partitions in the topic.

The partition of the message is directly specified.
The partition is obtained based on the key hash of the message.
The partition is specified by polling.
Consumer

The consumer uses the offset to distinguish the messages that have been read and then consume the messages. The last read message offset of each partition is stored in Zookeeper or Kafka. If the consumer shuts down or restarts the system, the read status will not be lost.

Consumer Group

The consumer group ensures that each partition can only be used by one consumer, which avoids repeated consumption. If one consumer in the group fails, other consumers in the group can take over the work of the failed consumer for rebalance and repartition.

Broker

It connects producers and consumers. A single broker can easily process thousands of partitions and millions of messages per second.

Brokers receive messages from the producer, set an offset for the message, and submit the message to the disk for storage.
Brokers provide services for consumers, respond to requests for reading partitions, and return messages that have been submitted to the disk.
Cluster

Each partition has a leader. When a partition is assigned to multiple brokers, the leader is used to replicate the partition.

Producer Offset

When a message is written, each partition has an offset, which is the latest and largest offset of each partition.

Consumer Offset

Consumers in different consumer groups can store different offsets for a partition without affecting each other.

LogSegment
A partition consists of multiple LogSegments.
A LogSegment consists of .log .index .timeindex.
.log append is written sequentially. The file name is named after the offset of the first message in the file.
. Index can be used to quickly locate the position when you delete logs or search for data.
.timeStamp searches for the corresponding offset according to the timestamp.

How Is Kafka

Advantages:

High throughput: A single server processes tens of millions of messages per second. It keeps stable performance even when TB and messages are stored.
- Zero copy: It reduces the number of copies from the kernel mode to the user mode. The disk uses sendfile to implement the DMA copy of socket buffer.
- Sequential read-write: It makes full use of the ultra-high performance of disk sequential read-write.
- Page cache mmap: The disk file is mapped to the memory, and the user can modify the disk file by modifying the memory.
High performance: A single node supports thousands of clients with zero downtime and zero data loss.
Persistence: Messages are persisted to the disk. Data loss is prevented by persisting data to the disk and by replication.
Distributed system: Easy to scale. All components are distributed, allowing deployment of more machines without downtime.
Reliability: Kafka is distributed, and has partitions, replicas, and fault tolerance.
Client status maintenance: The status of message processing is maintained on the consumer side. When a message fails, it can be automatically balanced.

Scenarios

Log collection: With Kafka, you can collect logs of various services and process them on the big data platform.
Message system: Kafka can be used to decouple producers from consumers, and cache messages.
User activity tracking: Kafka is often used to record various activities of web users or app users, such as web browsing, searching, and clicking. These activities are published by various servers to Kafka topics. Then, consumers can subscribe to these topics for real-time monitoring and analysis of operation data. The activities can also be saved to databases.

Basic Process of Production and Consumption

1. When a producer is created, a sender thread is also created and set as a daemon thread.

2. The produced message goes through the interceptor -> serializer -> partitioner and then is cached in the buffer.

3. The condition for batch sending is that the buffer data size reaches the batch.size or linger.ms reaches the upper limit.

4. After the batch sending, the message is sent to the specified partition and then dropped to the broker.

acks=0: The message is considered to have been sent as long as it is placed in the buffer.
acks=1: The message only needs to be written to the primary partition. In this case, if the primary partition is down after receiving the message acknowledgment and the replica partition has not synchronized the message, the message will be lost.
acks=all (default): The leader partition waits for all records of in-sync replica (ISR) partitions to be acknowledged. This process ensures that messages will not be lost as long as one ISR replica partition is active.

5. If the producer configures the retries parameter greater than 0 and does not receive an acknowledgment, the client will retry the message.

6. If the message is successfully dropped to the broker, the production metadata will be returned to the producer.

Leader Election

Kafka maintains the ISR collection for each topic on Zookeeper.
Kafka considers that the message has been submitted after the replicas in the collection are synchronized with the replicas in the leader.
Only followers who are synchronized with the leader should be selected as the new leader.
Assume that a topic has N + 1 replicas. N server unavailability and low redundancy are allowed by Kafka.

If the replicas in the ISR are all lost:

You can wait for any one of the replicas in the ISR to recover and then provide external services, which takes time to wait.
You can select a replica from out-sync replicas (OSR) as the leader replica, which will cause data loss.

Replica Message Synchronization

First, the follower sends a FETCH request to the leader. Then, the leader reads the message data in the underlying log file and updates the LEO value of the follower replica in its memory with the fetchOffset value in the FETCH request. Finally, try to update the HW value. After receiving the FETCH response, the follower writes the message to the underlying log and then updates the LEO and HW values.

Related Concepts: LEO and HW

LEO: Log end offset. It records the offset value of the next message in the replica log. If LEO=10, it means that the replica stores 10 messages, and the offset value range is [0,9].
HW: High watermark. It indicates that the offset has been replicated. The HW value of the same replica object will not be greater than the LEO value. All messages less than or equal to the HW value are considered "replicated".

Rebalance

The number of group members has changed.
The number of subscribed topics has changed.
The number of partitions of subscribed topics has changed.

After the leader election is completed and when the above three situations occur, the leader starts to allocate consumption plans according to the configured RangeAssignor, that is, which consumer is responsible for consuming which partitions of which topics. Once the allocation is completed, the leader will encapsulate this plan into the SyncGroup request and send it to the coordinator. Non-leaders will also send a SyncGroup request, but the content is empty. After receiving the allocation plan, the coordinator will insert the plan into the response of the SyncGroup and send it to each consumer. In this way, all members of the group know which partitions they should consume.

Partition Allocation Algorithm: RangeAssignor

The principle is to evenly distribute all partitions to all consumers based on the total number of consumers and the total number of partitions.
Consumers who subscribe to topics are sorted according to the alphabetical order of names and evenly distributed. The remaining ones are also distributed according to alphabetical order.

Adding, Deleting, Modifying, and Querying

kafka-topics.sh --zookeeper localhost:2181/myKafka --create --topic topic_x                                 --partitions 1 --replication-factor 1kafka-topics.sh --zookeeper localhost:2181/myKafka --delete --topic topic_xkafka-topics.sh --zookeeper localhost:2181/myKafka --alter --topic topic_x                                --config max.message.bytes=1048576kafka-topics.sh --zookeeper localhost:2181/myKafka --describe --topic topic_x

How to view the messages whose offset is 23?

By querying the skip list ConcurrentSkipListMap, you can locate the 00000000000000000000.index, and find the largest index entry that is not greater than 23 in the offset index file by using the dichotomy method, that is, the column of offset 20. Then, you can sequentially search for the messages whose offset is 23, starting from the physical location of 320 in the log segment file.

Splitting Files

Size splitting: The size of the current log segment file exceeds the value configured by the broker parameter log.segment.bytes.
Time splitting: The difference between the maximum timestamp of the message in the current log segment and the system timestamp is greater than the value configured by the log.roll.ms.
Index splitting: The value of the offset or the size of the timestamp index file reaches the value configured by the broker log.index.size.max.bytes.
Offset splitting: The difference between the offset of the message appended and the offset of the current log segment is greater than Integer.MAX_VALUE.

Consistency

Idempotence

It ensures that the consumer does not repeat the processing when the message is resent. Even if the consumer receives repeated messages, the repeat processing also ensures the consistency of the final result. The mathematical concept of the idempotence is: f(f(x)) = f(x)

How to Implement?

You can add a unique ID that is similar to the primary key of a database to uniquely mark a message.

ProducerID: # When each new producer is initialized, a unique PIDSequenceNumber is assigned: # Each topic that receives data from each PID corresponds to a monotonically increasing SN value from 0

How to Elect?

Use the distributed lock of Zookeeper to elect a controller and notify the controller when a node joins or exits the cluster.
The controller is responsible for the partition leader election when a node joins or leaves the cluster.
The controller uses epochs to ignore small epochs to avoid split-brain: both nodes simultaneously consider themselves the current controller.

Availability

When you create a topic, you can specify replication-factor 3 to specify the number of replicas that does not exceed the number of brokers.
Only the leader is responsible for reading and writing, and the follower regularly pulls data from the leader.
The ISR is a list of replicas that the leader maintains and keeps synchronized with, that is, the list of active replicas. If a follower falls too far behind, the leader will remove it from the ISR. Followers are selected preferentially from the ISR at the time of election.
Set acks = all. The leader sends an acknowledgment to the producer only after receiving acknowledgments from all the replicas in the ISR.

Interview Questions

Online Questions: Rebalance

If the Kafka cluster contains a large number of nodes, such as hundreds of nodes, the rebalance within the consumer group caused by the changes in cluster architecture may take several minutes to several hours. In this case, Kafka is almost unavailable, which greatly affects the transactions per second (TPS) of Kafka.

Causes

The number of group members has changed.
The number of subscribed topics has changed.
The number of partitions of subscribed topics has changed.

The crash and leave of group members are two different scenarios. When a crash occurs, the member does not actively inform the coordinator of the crash. The coordinator may need a complete session.timeout period (heartbeat period) to detect the crash, which will inevitably cause the consumer to lag. To put it simply, leaving the group is actively initiating rebalance while crashing is passively initiating rebalance.

Solutions

Increase timeout session.timout.ms = 6s. Decrease heartbeat interval heartbeat.interval.ms = 2s. Increase poll interval max.poll.interval.ms = t + 1 minutes

The Role of ZooKeeper

Currently, Kafka uses ZooKeeper to take on cluster metadata storage, member management, controller election, and other management tasks. After that, when the KIP-500 proposal is completed, Kafka will no longer depend on ZooKeeper at all.

Metadata storage means that all data in the topic partition is stored in ZooKeeper, and other "people" must be aligned with it.
Member management refers to the registration, deregistration, and property changes of broker nodes.
Controller election refers to the election of the cluster controller, including but not limited to topic deletion and parameter configuration.

In conclusion, KIP-500 is a consensus algorithm based on Raft. It is developed by the community and implements controller self-election.

For metadata storage, the etcd based on Raft has become more popular in recent years.

More and more systems are using etcd to store critical data. For example, the flash sale system often uses it to store information about each node to control the number of services that consume MQ. Some configuration data of business systems are also synchronized to each node of the business system in real time through etcd. For example, etcd is used in the flash sale management to synchronize the configuration data of the flash sale activity to each node of the flash sale API service in real time.

The Role of Replica

Only the leader replica of Kafka can provide external read and write services and respond to requests from clients. Follower replicas only use the PULL method to passively synchronize data in the leader replica and keep ready to apply for the leader replica when the broker where the leader replica resides is down.

Since Kafka 2.4, the community allows follower replicas to provide limited read services by parameter configuration.
Previously, the HW mechanism was mainly used to ensure data consistency. However, HW values cannot guarantee data consistency in scenarios where the leader changes continuously. Therefore, the community has introduced the leader epoch mechanism to fix the drawbacks of HW values.

Why Is Read-Write Splitting Not Supported?

Since version 2.4, Kafka has provided limited read-write splitting.
Not applicable in common scenarios. Read-write splitting is suitable for scenarios where the read load is very large, but writes are relatively infrequent.
Synchronization mechanism. Kafka uses the PULL method to synchronize followers. In addition, the replication latency is high.

How to Prevent Repeated Consumption?

Each consumption at the code level needs to submit an offset.
You can use the unique key constraint of MySQL which is combined with Redis to check whether the ID is consumed. You can use the SET method to store the ID in Redis.
If the quantity is large and misjudgment is allowed, the Bloom Filter can also be used.

How to Ensure No Data Loss?

The producer messages can be ensured by the confirmation configuration ack = all.
If the leader is down during the broker synchronization, you can configure ISR replica and retry.
If the consumer is lost, you can disenable the automatic submission of offset. The system submits the offset when completing processing.

How to Ensure Ordered Consumption?

Single topic, single partition, single consumer, and single thread consumption with low throughput are not recommended.
If you only need to ensure the ordered consumption of a single key, apply a memory queue for each key respectively and make each thread consume a memory queue separately. This ensures the ordered consumption of a single key such as a user ID and an activity ID.

[Online] How to Solve Consumption Backlog?

Fix the consumer to enable the consumer to consume and scale out N consumers.
Write a distribution program to evenly distribute topics to temporary topics.
Use N consumers at the same time to consume different temporary topics.

How to Avoid Message Backlog?

Increase consumption parallelism
Batch consumption
Reduce the number of component I/O interactions
Priority consumption

if (maxOffset-curOffset> 100000) { // The priority processing logic when TODO messages pile up // Unprocessed messages can be discarded or logged return ConsumeConcurrentlyStatus.CONSUME_SUCCESS;}// TODO normal consumption process return ConsumeConcurrentlyStatus.CONSUME_SUCCESS;

How to Design MQ?

Quick scale-out needs to be supported. Brokers + partitions. Partitions should be placed on different machines. When adding machines, data is migrated according to topics. In distributed mode, consistency, availability, and partition fault tolerance need to be considered.

Consistency: producer message confirmation, consumer idempotence, and broker data synchronization
Availability: how to ensure that data is not lost or heavy, how to persist data, and how to read and write data during persistence
Partition fault tolerance: the choice of the election mechanism and how to synchronize multiple replicas
Large data capacity: How to resolve message backlog and performance degradation of a large number of topics

In terms of performance, you can learn from TimingWheel, Zero Copy, I/O multiplexing, sequential read-write, and compressed batch processing.

Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.