A Kafka cluster can be deployed on Elastic Compute Service (ECS) instances on which Elastic Remote Direct Memory Access (eRDMA) is enabled. The Kafka cluster can fully utilize the low latency, high throughput, and low CPU utilization provided by eRDMA to improve the efficiency of data transmission between nodes in the Kafka cluster. The Kafka cluster is suitable for scenarios that require high message throughput and low latency. This topic describes how to deploy a Kafka cluster on eRDMA-capable ECS instances. This topic also describes how to test Kafka performance that is improved by eRDMA.
Kafka is a distributed stream processing platform that efficiently processes and stores a large number of data streams and supports real-time message publishing and subscription. Kafka is widely used in scenarios such as log aggregation, event sourcing, and real-time analytics. For more information, see Kafka documentation.
eRDMA is a Remote Direct Memory Access (RDMA) service developed by Alibaba Cloud to ensure high network performance with low latency, high throughput, and high elasticity. For more information, see Overview.
Step 1: Prepare ECS instances
Before you deploy a Kafka cluster, prepare multiple ECS instances and deploy the Broker and ZooKeeper services and build a stress testing environment on the instances.
A Broker-enabled instance serves as a core data node in a cluster to store, transmit, and manage messages.
A ZooKeeper-enabled instance implements distributed service coordination and management in the Kafka cluster.
A stress testing instance is used to test the performance of the deployed Kafka cluster.
In this example, five ECS instances are created, which include one ZooKeeper-enabled instance, three Broker-enabled instances, and one stress testing instance. The following table describes the requirements for the instance configurations.
The selected instance types must support eRDMA. For information about the instance types that support eRDMA, see the Limits section in the "Configure eRDMA on an enterprise-level instance" topic.
Purpose | Instance requirement | Disk requirement | Network requirement | Image requirement |
Broker-enabled instance | Create three instances. In this example, the ecs.g8a.2xlarge instance type is used. | Select Enterprise SSDs (ESSDs) at performance level 3 (PL3). Select the disk capacity based on your business requirements. |
| Alibaba Cloud Linux 3.2104 LTS 64-bit. |
Zookeeper-enabled instance | Create one instance. In this example, the ecs.g8a.xlarge instance type is used. | None. | ||
Stress testing instance | Create one instance. In this example, the ecs.g8a.16xlarge instance type is used. | None. |
Step 2: Install required tools and Kafka
Log on to each of the five ECS instances prepared in Step 1 and run the relevant commands to install Shared Memory Communications over Remote Direct Memory Access (SMC-R), the Java tool, and Kafka.
Before you can use the eRDMA feature, you must deploy SMC-R. SMC-R works in the kernel space, and the SMC-R protocol stack helps instances use, manage, and maintain eRDMA resources. For information about SMC-R, see Use SMC.
Log on to all ECS instances in sequence.
For more information, see Connect to a Linux instance by using a password or key.
(Conditionally required) Run the
uname -r
command to view the kernel version. Make sure that the kernel version of all instances is5.10.134-16.3
or later. If the kernel version of an instance is earlier than5.10.134-16.3
, run the following commands to upgrade the kernel to the latest version:sudo yum update kernel sudo reboot
Run the following command to install the smc-tools toolkit on each instance:
sudo yum install smc-tools -y
Run the following command to check whether eRDMA is enabled on each instance:
smcr dev
Sample output:
Net-Dev IB-Dev IB-P IB-State Type Crit #Links PNET-ID eth0 erdma_0 1 ACTIVE 0x107f No 0
Run the following command to disable IPv6 on each instance.
NoteAlibaba Cloud eRDMA devices and SMC devices do not support IPv6 addresses. After you disable IPv6, traffic can pass through RDMA channels by using IPv4.
sudo sysctl net.ipv6.conf.all.disable_ipv6=1
Run the following command to install Java and Git:
sudo yum install java-11-openjdk-1:11.0.21.0.9-2.0.3.al8 java-11-openjdk-devel-1:11.0.21.0.9-2.0.3.al8 git -y
Run the following commands to download and decompress the Kafka package:
wget https://archive.apache.org/dist/kafka/3.5.0/kafka_2.13-3.5.0.tgz tar -xf kafka_2.13-3.5.0.tgz
Step 3: Start ZooKeeper and Broker for Kafka
Log on to all ECS instances in sequence.
Add the mapping between the private IP address and hostname of an instance to the
/etc/hosts
file on each instance.Run the following command on the ZooKeeper-enabled instance to start ZooKeeper:
bash $HOME/kafka_2.13-3.5.0/bin/zookeeper-server-start.sh -daemon $HOME/kafka_2.13-3.5.0/config/zookeeper.properties
Start Broker on each of the three Broker-enabled instances.
NoteIf you perform a test without using the eRDMA feature, remove the
smc_run
parameter from the command.On the first Broker-enabled instance, set the Broker ID to
0
and start Broker. Replace<zookeeper ip>
with the private IP address of the Zookeeper-enabled instance.KAFKA_HEAP_OPTS="-Xmx4G -Xms4G" smc_run bash $HOME/kafka_2.13-3.5.0/bin/kafka-server-start.sh -daemon $HOME/kafka_2.13-3.5.0/config/server.properties --override broker.id=0 --override log.dirs=$HOME/kafka-logs --override zookeeper.connect=<zookeeper ip>:2181
On the second Broker-enabled instance, set the Broker ID to
1
and start Broker. Replace<zookeeper ip>
with the private IP address of the Zookeeper-enabled instance.KAFKA_HEAP_OPTS="-Xmx4G -Xms4G" smc_run bash $HOME/kafka_2.13-3.5.0/bin/kafka-server-start.sh -daemon $HOME/kafka_2.13-3.5.0/config/server.properties --override broker.id=1 --override log.dirs=$HOME/kafka-logs --override zookeeper.connect=<zookeeper ip>:2181
On the third Broker-enabled instance, set the Broker ID to
2
and start Broker. Replace<zookeeper ip>
with the private IP address of the Zookeeper-enabled instance.KAFKA_HEAP_OPTS="-Xmx4G -Xms4G" smc_run bash $HOME/kafka_2.13-3.5.0/bin/kafka-server-start.sh -daemon $HOME/kafka_2.13-3.5.0/config/server.properties --override broker.id=2 --override log.dirs=$HOME/kafka-logs --override zookeeper.connect=<zookeeper ip>:2181
Step 4: Perform performance tests
Download the Benchmark tool and simulate an environment for which the highest available network bandwidth is configured. Test Kafka performance when eRDMA is enabled and when eRDMA is disabled. Compare the test results to evaluate the performance enhancement that eRDMA contributes to the Kafka cluster.
Log on to the stress testing instance. Download and compile Open Messaging Benchmark.
Download and install Maven that is an Open Messaging Benchmark compiler.
wget https://dlcdn.apache.org/maven/maven-3/3.8.8/binaries/apache-maven-3.8.8-bin.tar.gz tar -xf apache-maven-3.8.8-bin.tar.gz export PATH=$PATH:$HOME/apache-maven-3.8.8/bin/
Modify the configuration file of the Maven repository to accelerate downloads.
vi $HOME/apache-maven-3.8.8/conf/settings.xml
Add the following content to the
settings.xml mirrors
tag. Then, save and close the configuration file.<mirror> <id>nexus-aliyun</id> <mirrorOf>central</mirrorOf> <name>Nexus aliyun</name> <url>http://maven.aliyun.com/nexus/content/groups/public</url> </mirror>
Download and compile the Open Messaging Benchmark source code.
git clone https://github.com/openmessaging/benchmark.git cd benchmark && mvn clean verify -DskipTests
Specify the private IP addresses of Broker-enabled instances in the kafka-throughput.yaml file.
vi $HOME/benchmark/driver-kafka/kafka-throughput.yaml
In the kafka-throughput.yaml file, set the
bootstrap.servers
parameter to the private IP addresses of Broker-enabled instances in the following format:<Private IP address of Broker 0>:9092,<Private IP address of Broker 1>:9092,<Private IP address of Broker 2>:9092
.commonConfig: | bootstrap.servers=<172.17.XX.XX>:9092,<172.17.XX.XX>:9092,<172.17.XX.XX>:9092 default.api.timeout.ms=1200000 request.timeout.ms=1200000
Specify the rate at which Kafka messages are sent to simulate the highest available network bandwidth for the environment in which the performance of the Kafka cluster is tested.
vi $HOME/benchmark/workloads/1-topic-100-partitions-1kb-4p-4c-200k.yaml
Change the
producerRate: <Message sending rate>
parameter in the file. The message sending rate is calculated by using the following formula:Available bandwidth of a Broker-enabled instance/Size of a single message
.In this example, the instance type of the Broker-enabled instances is ecs.g8a.2xlarge. The maximum network bandwidth of this instance type is 4 Gbit/s, and the total bandwidth of the three Broker-enabled instances is 12 Gbit/s. Due to the Kafka triplicate mechanism, the actual available bandwidth is about one third of the total bandwidth. Therefore, the
available bandwidth of a Broker-enabled instance
is calculated as 12 Gbit/s divided by 3, which equals 4 Gbit/s (or 512 MB/s). Given that the test is conducted in theworkloads
directory, where the size of each message is 1 KB. Themessage sending rate
is calculated as 512 MB/s divided by 1 KB, which equals 524,288. In this case, change theproducerRate: <Message sending rate>
parameter toproducerRate: 524288
in the file. In the actual test, set the message sending rate based on your business requirements.Use one of the following methods to test the performance of the Kafka cluster.
Perform a performance test when eRDMA is enabled
smc_run $HOME/benchmark/bin/benchmark --drivers $HOME/benchmark/driver-kafka/kafka-throughput.yaml $HOME/benchmark/workloads/1-topic-100-partitions-1kb-4p-4c-2000k.yaml
During the Kafka performance test, you can perform the following operations at the same time:
Run the
smcss -a
command in another window on the stress testing instance to check whether SMC-R is used to transmit messages.Run the
sar
command on each of the three Broker-enabled instances to check CPU utilization. For example, thesar 1 20
command indicates that data is sampled once every 1 second for 20 times. Then, the CPU utilization values of the three Broker-enabled instances are summed up to obtain the total CPU utilization.
Perform a performance test when eRDMA is disabled
$HOME/benchmark/bin/benchmark --drivers $HOME/benchmark/driver-kafka/kafka-throughput.yaml $HOME/benchmark/workloads/1-topic-100-partitions-1kb-4p-4c-2000k.yaml
ImportantTo prevent remaining test data from adversely affecting the performance test in which eRDMA is disabled, delete the Broker and ZooKeeper records and restart Broker and Zookeeper on the instances.
Obtain the latency information from the results of the two tests to evaluate the performance enhancement that eRDMA contributes to the Kafka cluster.
Find and view the last
Aggregated Pub Latency (ms)
entry.avg
represents the average latency, 99% represents the P99 latency, and 999% represents the P999 latency.