All Products
Search
Document Center

Elastic Compute Service:Deploy a high-network-performance Kafka cluster on eRDMA-capable instances

Last Updated:May 20, 2024

A Kafka cluster can be deployed on Elastic Compute Service (ECS) instances on which Elastic Remote Direct Memory Access (eRDMA) is enabled. The Kafka cluster can fully utilize the low latency, high throughput, and low CPU utilization provided by eRDMA to improve the efficiency of data transmission between nodes in the Kafka cluster. The Kafka cluster is suitable for scenarios that require high message throughput and low latency. This topic describes how to deploy a Kafka cluster on eRDMA-capable ECS instances. This topic also describes how to test Kafka performance that is improved by eRDMA.

Note
  • Kafka is a distributed stream processing platform that efficiently processes and stores a large number of data streams and supports real-time message publishing and subscription. Kafka is widely used in scenarios such as log aggregation, event sourcing, and real-time analytics. For more information, see Kafka documentation.

  • eRDMA is a Remote Direct Memory Access (RDMA) service developed by Alibaba Cloud to ensure high network performance with low latency, high throughput, and high elasticity. For more information, see Overview.

Step 1: Prepare ECS instances

Before you deploy a Kafka cluster, prepare multiple ECS instances and deploy the Broker and ZooKeeper services and build a stress testing environment on the instances.

  • A Broker-enabled instance serves as a core data node in a cluster to store, transmit, and manage messages.

  • A ZooKeeper-enabled instance implements distributed service coordination and management in the Kafka cluster.

  • A stress testing instance is used to test the performance of the deployed Kafka cluster.

In this example, five ECS instances are created, which include one ZooKeeper-enabled instance, three Broker-enabled instances, and one stress testing instance. The following table describes the requirements for the instance configurations.

Important

The selected instance types must support eRDMA. For information about the instance types that support eRDMA, see the Limits section in the "Configure eRDMA on an enterprise-level instance" topic.

Purpose

Instance requirement

Disk requirement

Network requirement

Image requirement

Broker-enabled instance

Create three instances. In this example, the ecs.g8a.2xlarge instance type is used.

Select Enterprise SSDs (ESSDs) at performance level 3 (PL3). Select the disk capacity based on your business requirements.

  • You must configure public IP addresses for the instances.

  • You must deploy the instances in the same virtual private cloud (VPC). By default, the instances can communicate with each other within the internal network.

  • You must enable the eRDMA feature for the instances. For more information, see Configure eRDMA on an enterprise-level instance.

Alibaba Cloud Linux 3.2104 LTS 64-bit.

Zookeeper-enabled instance

Create one instance. In this example, the ecs.g8a.xlarge instance type is used.

None.

Stress testing instance

Create one instance. In this example, the ecs.g8a.16xlarge instance type is used.

None.

Step 2: Install required tools and Kafka

Log on to each of the five ECS instances prepared in Step 1 and run the relevant commands to install Shared Memory Communications over Remote Direct Memory Access (SMC-R), the Java tool, and Kafka.

Note

Before you can use the eRDMA feature, you must deploy SMC-R. SMC-R works in the kernel space, and the SMC-R protocol stack helps instances use, manage, and maintain eRDMA resources. For information about SMC-R, see Use SMC.

  1. Log on to all ECS instances in sequence.

    For more information, see Connect to a Linux instance by using a password or key.

  2. (Conditionally required) Run the uname -r command to view the kernel version. Make sure that the kernel version of all instances is 5.10.134-16.3 or later. If the kernel version of an instance is earlier than 5.10.134-16.3, run the following commands to upgrade the kernel to the latest version:

    sudo yum update kernel
    sudo reboot
  3. Run the following command to install the smc-tools toolkit on each instance:

    sudo yum install smc-tools -y
  4. Run the following command to check whether eRDMA is enabled on each instance:

    smcr dev

    Sample output:

    Net-Dev IB-Dev IB-P IB-State Type Crit #Links PNET-ID 
    eth0 erdma_0 1 ACTIVE 0x107f No 0 
  5. Run the following command to disable IPv6 on each instance.

    Note

    Alibaba Cloud eRDMA devices and SMC devices do not support IPv6 addresses. After you disable IPv6, traffic can pass through RDMA channels by using IPv4.

    sudo sysctl net.ipv6.conf.all.disable_ipv6=1
  6. Run the following command to install Java and Git:

    sudo yum install java-11-openjdk-1:11.0.21.0.9-2.0.3.al8 java-11-openjdk-devel-1:11.0.21.0.9-2.0.3.al8 git -y
  7. Run the following commands to download and decompress the Kafka package:

    wget https://archive.apache.org/dist/kafka/3.5.0/kafka_2.13-3.5.0.tgz
    tar -xf kafka_2.13-3.5.0.tgz

Step 3: Start ZooKeeper and Broker for Kafka

  1. Log on to all ECS instances in sequence.

  2. Add the mapping between the private IP address and hostname of an instance to the /etc/hosts file on each instance.

    image

  3. Run the following command on the ZooKeeper-enabled instance to start ZooKeeper:

    bash $HOME/kafka_2.13-3.5.0/bin/zookeeper-server-start.sh -daemon $HOME/kafka_2.13-3.5.0/config/zookeeper.properties
  4. Start Broker on each of the three Broker-enabled instances.

    Note

    If you perform a test without using the eRDMA feature, remove the smc_run parameter from the command.

    • On the first Broker-enabled instance, set the Broker ID to 0 and start Broker. Replace <zookeeper ip> with the private IP address of the Zookeeper-enabled instance.

      KAFKA_HEAP_OPTS="-Xmx4G -Xms4G" smc_run bash $HOME/kafka_2.13-3.5.0/bin/kafka-server-start.sh -daemon $HOME/kafka_2.13-3.5.0/config/server.properties --override broker.id=0 --override log.dirs=$HOME/kafka-logs --override zookeeper.connect=<zookeeper ip>:2181
    • On the second Broker-enabled instance, set the Broker ID to 1 and start Broker. Replace <zookeeper ip> with the private IP address of the Zookeeper-enabled instance.

      KAFKA_HEAP_OPTS="-Xmx4G -Xms4G" smc_run bash $HOME/kafka_2.13-3.5.0/bin/kafka-server-start.sh -daemon $HOME/kafka_2.13-3.5.0/config/server.properties --override broker.id=1 --override log.dirs=$HOME/kafka-logs --override zookeeper.connect=<zookeeper ip>:2181
    • On the third Broker-enabled instance, set the Broker ID to 2 and start Broker. Replace <zookeeper ip> with the private IP address of the Zookeeper-enabled instance.

      KAFKA_HEAP_OPTS="-Xmx4G -Xms4G" smc_run bash $HOME/kafka_2.13-3.5.0/bin/kafka-server-start.sh -daemon $HOME/kafka_2.13-3.5.0/config/server.properties --override broker.id=2 --override log.dirs=$HOME/kafka-logs --override zookeeper.connect=<zookeeper ip>:2181

Step 4: Perform performance tests

Download the Benchmark tool and simulate an environment for which the highest available network bandwidth is configured. Test Kafka performance when eRDMA is enabled and when eRDMA is disabled. Compare the test results to evaluate the performance enhancement that eRDMA contributes to the Kafka cluster.

  1. Log on to the stress testing instance. Download and compile Open Messaging Benchmark.

    1. Download and install Maven that is an Open Messaging Benchmark compiler.

      wget https://dlcdn.apache.org/maven/maven-3/3.8.8/binaries/apache-maven-3.8.8-bin.tar.gz
      tar -xf apache-maven-3.8.8-bin.tar.gz
      export PATH=$PATH:$HOME/apache-maven-3.8.8/bin/
    2. Modify the configuration file of the Maven repository to accelerate downloads.

      vi $HOME/apache-maven-3.8.8/conf/settings.xml

      Add the following content to the settings.xml mirrors tag. Then, save and close the configuration file.

      <mirror>
          <id>nexus-aliyun</id>
          <mirrorOf>central</mirrorOf>
          <name>Nexus aliyun</name>
          <url>http://maven.aliyun.com/nexus/content/groups/public</url>
      </mirror>
    3. Download and compile the Open Messaging Benchmark source code.

      git clone https://github.com/openmessaging/benchmark.git
      cd benchmark && mvn clean verify -DskipTests
  2. Specify the private IP addresses of Broker-enabled instances in the kafka-throughput.yaml file.

    vi $HOME/benchmark/driver-kafka/kafka-throughput.yaml

    In the kafka-throughput.yaml file, set the bootstrap.servers parameter to the private IP addresses of Broker-enabled instances in the following format: <Private IP address of Broker 0>:9092,<Private IP address of Broker 1>:9092,<Private IP address of Broker 2>:9092.

    commonConfig: |
      bootstrap.servers=<172.17.XX.XX>:9092,<172.17.XX.XX>:9092,<172.17.XX.XX>:9092
      default.api.timeout.ms=1200000
      request.timeout.ms=1200000
  3. Specify the rate at which Kafka messages are sent to simulate the highest available network bandwidth for the environment in which the performance of the Kafka cluster is tested.

    vi $HOME/benchmark/workloads/1-topic-100-partitions-1kb-4p-4c-200k.yaml

    Change the producerRate: <Message sending rate> parameter in the file. The message sending rate is calculated by using the following formula: Available bandwidth of a Broker-enabled instance/Size of a single message.

    In this example, the instance type of the Broker-enabled instances is ecs.g8a.2xlarge. The maximum network bandwidth of this instance type is 4 Gbit/s, and the total bandwidth of the three Broker-enabled instances is 12 Gbit/s. Due to the Kafka triplicate mechanism, the actual available bandwidth is about one third of the total bandwidth. Therefore, the available bandwidth of a Broker-enabled instance is calculated as 12 Gbit/s divided by 3, which equals 4 Gbit/s (or 512 MB/s). Given that the test is conducted in the workloads directory, where the size of each message is 1 KB. The message sending rate is calculated as 512 MB/s divided by 1 KB, which equals 524,288. In this case, change the producerRate: <Message sending rate> parameter to producerRate: 524288 in the file. In the actual test, set the message sending rate based on your business requirements.

  4. Use one of the following methods to test the performance of the Kafka cluster.

    Perform a performance test when eRDMA is enabled

    smc_run $HOME/benchmark/bin/benchmark --drivers $HOME/benchmark/driver-kafka/kafka-throughput.yaml $HOME/benchmark/workloads/1-topic-100-partitions-1kb-4p-4c-2000k.yaml

    During the Kafka performance test, you can perform the following operations at the same time:

    • Run the smcss -a command in another window on the stress testing instance to check whether SMC-R is used to transmit messages.

    • Run the sar command on each of the three Broker-enabled instances to check CPU utilization. For example, the sar 1 20 command indicates that data is sampled once every 1 second for 20 times. Then, the CPU utilization values of the three Broker-enabled instances are summed up to obtain the total CPU utilization.

    Perform a performance test when eRDMA is disabled

    $HOME/benchmark/bin/benchmark --drivers $HOME/benchmark/driver-kafka/kafka-throughput.yaml $HOME/benchmark/workloads/1-topic-100-partitions-1kb-4p-4c-2000k.yaml
    Important

    To prevent remaining test data from adversely affecting the performance test in which eRDMA is disabled, delete the Broker and ZooKeeper records and restart Broker and Zookeeper on the instances.

    Delete the Broker and Zookeeper records and restart Broker and Zookeeper

    • Stop Zookeeper on the Zookeeper-enabled instance and delete the Zookeeper records.

      bash $HOME/kafka_2.13-3.5.0/bin/zookeeper-server-stop.sh
      rm -rf /tmp/zookeeper/
    • Restart ZooKeeper on the ZooKeeper-enabled instance.

      bash $HOME/kafka_2.13-3.5.0/bin/zookeeper-server-start.sh -daemon $HOME/kafka_2.13-3.5.0/config/zookeeper.properties
    • Stop Broker on each Broker-enabled instance and delete the Broker records.

      bash $HOME/kafka_2.13-3.5.0/bin/kafka-server-stop.sh
      rm -rf $HOME/kafka-logs/
    • Restart Broker on each Broker-enabled instance.

      KAFKA_HEAP_OPTS="-Xmx4G -Xms4G" smc_run bash $HOME/kafka_2.13-3.5.0/bin/kafka-server-start.sh -daemon $HOME/kafka_2.13-3.5.0/config/server.properties --override broker.id=0 --override log.dirs=$HOME/kafka-logs --override zookeeper.connect=<zookeeper ip>:2181
      KAFKA_HEAP_OPTS="-Xmx4G -Xms4G" smc_run bash $HOME/kafka_2.13-3.5.0/bin/kafka-server-start.sh -daemon $HOME/kafka_2.13-3.5.0/config/server.properties --override broker.id=1 --override log.dirs=$HOME/kafka-logs --override zookeeper.connect=<zookeeper ip>:2181
      KAFKA_HEAP_OPTS="-Xmx4G -Xms4G" smc_run bash $HOME/kafka_2.13-3.5.0/bin/kafka-server-start.sh -daemon $HOME/kafka_2.13-3.5.0/config/server.properties --override broker.id=2 --override log.dirs=$HOME/kafka-logs --override zookeeper.connect=<zookeeper ip>:2181
      • Replace <zookeeper ip> with the private IP address of the Zookeeper-enabled instance.

      • If you test the performance without using the eRDMA feature, remove the smc_run parameter from the command.

  5. Obtain the latency information from the results of the two tests to evaluate the performance enhancement that eRDMA contributes to the Kafka cluster.

    Find and view the last Aggregated Pub Latency (ms) entry. avg represents the average latency, 99% represents the P99 latency, and 999% represents the P999 latency.