All Products
Search
Document Center

Elastic Compute Service:Monitor and check eRDMA

Last Updated:Dec 24, 2024

You can monitor and check elastic Remote Direct Memory Access (eRDMA) to identify and resolve issues at the earliest opportunity, ensure system security, and efficiently manage and optimize system resources. This topic describes several methods and tools that you can use to monitor and check eRDMA.

Prerequisites

eRDMA is installed and configured on an Elastic Compute Service (ECS) instance. For information about how to configure eRDMA, see Enable eRDMA on an ECS instance.

Use CloudMonitor to monitor eRDMA

You can use Alibaba Cloud CloudMonitor to monitor the working status of eRDMA. You can specify custom CloudMonitor metrics based on your business requirements to process, report, and alert on eRDMA monitoring data. For more information, see Custom Monitoring.

View the CloudMonitor metrics supported by eRDMA

  1. Log on to the CloudMonitor Metric console.

  2. On the Elastic Compute Service (ECS) page, enter eri in the search box above the metric list to search for the CloudMonitor metrics that are supported by eRDMA.

    image

Use eadm to diagnose and troubleshoot issues in eRDMA

eadm is an in-house, user-space management tool that is automatically installed by an eRDMA driver on an ECS instance to provide diagnostic and real-time monitoring capabilities and help identify faults. eadm provides the following features:

  • Abilities to collect device-wide real-time traffic statistics, including traffic monitoring and assisted diagnostic abilities.

  • Abilities to perform and query configurations, including the abilities to configure the delay ack feature and congestion control (CC) algorithms.

The following section describes several common eadm commands. For information about other eadm commands, run the eadm -h command to obtain command helps.

Warning

eadm is used only for diagnostic and debugging purposes and is subject to changes. eadm is not suitable for all scenarios.

  • View the help documentation for the eadm commands

    eadm -h

    image

  • Monitor real-time traffic of an eRDMA device

    eRDMA devices whose driver versions are 0.2.34 or later support the traffic statistics feature.

    eadm stat -d <ibdev_name> -l

    <ibdev_name> specifies the name of the eRDMA device. You can run the ibv_devinfo command to query the names of eRDMA devices. Replace <ibdev_name> with an actual eRDMA device name. If only one eRDMA device is available in your environment, you can omit the -d <ibdev_name> parameter.

    image

  • Retrieve statistics about an eRDMA device, such as the number of cm and verebs messages and traffic volumes.

    eadm stat -d <ibdev_name>

    <ibdev_name> specifies the name of the eRDMA device. You can run the ibv_devinfo command to query the names of eRDMA devices. Replace <ibdev_name> with an actual eRDMA device name. If only one eRDMA device is available in your environment, you can omit the -d <ibdev_name> parameter.

    image

  • Retrieve the version information of the current eRDMA driver.

    eadm ver
Note

Limits apply when you run other eadm commands, such as info, dump, and conf. We recommend that you do not use other eadm commands.

Use Iproute2 to monitor and check eRDMA

Iproute2 is a suite of tools used to configure and manage Linux networks. Iproute2 provides a series of command-line utilities, such as ip and ss, which are used to manage and configure network interfaces, routing tables, and the traffic control feature. This helps network administrators quickly identify and resolve network connectivity issues. Iproute2 also provides rdma commands that can be used to monitor and check RDMA subsystems.

Note

Iproute2 is pre-installed in most Linux distributions, including Alibaba Cloud Linux 3 and Ubuntu 20.00 and later. For more information, see the official documentation of each operating system.

  • Query the status of eRDMA devices.

    rdma link

    image

  • Query the resource usage of the eRDMA device, such as the number of Completion Queues (CQs), Queue Pairs (QPs), and Memory Regions (MRs).

    Note

    In RDMA network communication, Queue Pair (QP), Completion Queue (CQ), Memory Region (MR), and verbs Opcode are the core components. They play important roles in RDMA communication and ensure high efficiency and low latency of RDMA network communication.

    For more information, see Basic capabilities and specifications of eRDMA.

    rdma res

    image

  • Query the performance statistics about the eRDMA device, such as the number of connections, connection status, and number of sent and received packets.

    rdma -p stat

    image

Use the diagnose tool to check RDMA-related issues and evaluate eRDMA performance

You can use the diagnose tool to check the basic functionality of eRDMA, eRDMA high-performance computing (HPC) environments, and the basic latencies of eRDMA. This helps you effectively use eRDMA.

The diagnose tool may return one of the following results for a check item:

  • PASS: The check item passed the check.

  • SKIP: The check item does not support the check and is skipped.

  • FAIL: The check tool is not installed or the check item failed the check. You can run the commands that are listed in the fail info section to check the FAIL items and troubleshoot issues.

  • Other INFO information: indicates eRDMA-related configuration information, such as the installation mode, driver versions, and CC algorithms.

Install the diagnose tool

Run a command on an ECS instance on which eRDMA is configured to obtain the diagnose tool.

  • Run the following command to obtain the diagnose tool from an internal URL:

    wget http://mirrors.cloud.aliyuncs.com/erdma/tools/diagnose.py
  • Run the following command to obtain the diagnose tool from a public URL:

    wget https://mirrors.aliyun.com/erdma/tools/diagnose.py

View the help of the diagnose tool

python diagnose.py -h

image

Check the basic functionality of eRDMA

You can use the diagnose tool to check the basic functionality of eRDMA, including whether eRDMA drivers are installed as expected, whether network connectivity is good, and how the eRDMA kernel-mode driver is installed. This ensures that the basic functionality of eRDMA works as expected and leverages the high throughput and low latency benefits of eRDMA.

Check items used in an eRMA basic functionality check

Check item

Description

Expected result

Error result and solution

erdma device

Whether eRDMA devices exist.

PASS

FAIL: You may not enable eRDMA or add an eRDMA interface (ERI) as a secondary elastic network interface (ENI) when you created an ECS instance. Enable eRDMA or add an ERI as a secondary ENI. For more information, see Enable eRDMA on an ECS instance.

erdma installed

Whether eRDMA drivers are installed as expected.

PASS

FAIL: eRDMA drivers are not installed as expected. Check the steps that you performed to install eRDMA drivers during eRDMA configuration or re-install the drivers. For more information, see Install eRDMA drivers on an ECS instance.

erdma loaded

Whether eRDMA drivers load as expected.

PASS

FAIL: eRDMA drivers do not load as expected. This issue may occur when the drivers are installed before the instance restarts. Run the modprobe erdma command to resolve the issue.

ibverbs loaded

Whether the ib_verbs driver loads as expected.

PASS

FAIL: The ib_verbs driver does not load as expected. Run the modprobe ib_uverbs command to resolve the issue.

erdma tools

Check whether eRDMA-related tools are installed.

PASS

FAIL: Run the eadm|rdma|ibv_devinfo command to check for missing tools. In most cases, eRDMA-related tools are installed together with eRDMA drivers. Check the steps that you performed to install eRDMA drivers during eRDMA configuration or re-install the drivers. For more information, see the Install eRDMA drivers on an ECS instance of the "Use eRDMA" topic.

hca detected

Whether eRDMA devices are detected by the user-space driver.

PASS

FAIL: eRDMA devices are not detected by the user-space driver. This issue occurs when the erdma device, erdma installed, erdma loaded, and ibverbs loaded check items fail the check. Check whether eRDMA drivers are installed and load as expected.

hca active

Whether the current device is enabled.

PASS

FAIL: This issue occurs when the ENI that corresponds to the current eRDMA device is not in the running state. The issue may occur in specific early kernel versions. Run the dhclient -v ethx command to enable the ENI and check whether the eRDMA device is in the ACTIVE state. For more information, see Check whether eRDMA is configured as expected.

erdma stats

Whether no error statistics about eRDMA devices exist.

PASS

  • SKIP: The operating system may not support the rdma stat command.

  • FAIL: Error statistics about eRDMA devices may exist. When you request technical support, we recommend that you provide the rdma -p stat command output.

network config

Whether network connectivity is good.

PASS

FAIL: If the IP addresses of multiple ENIs fall within the same subnet, eRDMA may not work as expected in specific scenarios.

erdma dmesg

Whether eRDMA-related alerts do not exist in the kernel.

PASS

FAIL: eRDMA-related alerts exist in the kernel. Check the error details of the alerts and reload drivers to resolve the issues.

atomic support

Whether the eRDMA device supports RDMA Atomic Operation.

PASS

FAIL: The current eRDMA device does not support RDMA Atomic Operation. If you do not require RDMA Atomic Operation, ignore the error.

Note

RDMA Atomic Operation is a feature that performs complete and consistent operations on memory at the atomic level and is suitable only for specific scenarios. If you do not require RDMA Atomic Operation, ignore the error.

go-back-n support

Whether the eRDMA device supports the Go-back-N feature.

PASS

  • SKIP: The current eRDMA device may not support queries for Go-back-N configurations.

  • FAIL: The eadm tool may not be installed as expected or the eRDMA device may not support the Go-back-N feature.

Note

Go-back-N is an extension of eRDMA that is suitable only for specific scenarios. If you do not require the Go-back-N feature, ignore the error.

erdma install mode

The mode in which the eRDMA kernel-mode driver is installed.

  • Standard: The eRDMA kernel-mode driver is installed in standard mode and supports only RDMA Connection Manager (CM) connections.

  • Compat: The eRDMA kernel-mode driver is installed in compatible mode and supports RDMA CM connections and out-of-band (OOB) connections.

    Important

FAIL: The installation mode of the eRDMA kernel-mode driver is not detected. This issue may occur when the erdma loaded item does not meet requirements and fails the check. Re-install the eRDMA kernel-mode driver. For more information, see Install eRDMA drivers on an ECS instance.

kernel driver version

The version of the eRDMA kernel-mode driver.

The version number of the eRDMA kernel-mode driver. Example: 0.2.37.

FAIL: The version of the eRDMA kernel-mode driver is not detected. This issue may occur when the erdma loaded or erdma tools item does not meet requirements and fails the check. Check whether the eRDMA driver is installed and loads as expected. For more information, see Check whether eRDMA is configured as expected.

rdma-core version

The version of the eRDMA user-mode driver.

The version number of the eRDMA user-mode driver. Example: 44.1-2.

FAIL: The version of the eRDMA user-mode driver is not detected. This issue may occur when the eRDMA user-mode driver is not installed as expected. Re-install the driver. For more information, see Install eRDMA drivers on an ECS instance.

cc algorithm

The CC algorithm of eRDMA.

The CC algorithm of eRDMA. Example: hpcc_rtt.

FAIL: The CC algorithm of eRDMA is not detected. This issue may occur when the erdma loaded or erdma tools item does not meet requirements and fails the check. Check whether eRDMA drivers are installed and load as expected.

Perform the following steps:

  1. Connect to an ECS instance on which eRDMA is configured.

    For more information, see Use Workbench to connect to a Linux instance over SSH.

  2. Run one of the following commands to obtain the diagnose tool.

    • Run the following command to obtain the diagnose tool from an internal URL:

      wget http://mirrors.cloud.aliyuncs.com/erdma/tools/diagnose.py
    • Run the following command to obtain the diagnose tool from a public URL:

      wget https://mirrors.aliyun.com/erdma/tools/diagnose.py
  3. Run the following command to check the basic functionality of eRDMA:

    python diagnose.py -d

    The following command output is returned, which includes the results of check items. For information about the check items, see Check items used in an eRDMA basic functionality check.

    image

Check an eRDMA HPC environment

If you want to run HPC applications in your eRDMA environment, you may require additional dependencies and configurations. You can use the diagnose tool to check the required dependencies for an eRDMA HPC environment.

Check items for dependencies used in an eRDMA HPC environment check

An eRDMA HPC environment check involves the following check items related to required dependencies: the CC algorithm of eRDMA, whether Go-back-N is supported, DAPL 1.0-related items, and DAPL 2.0-related items. If you do not require the dependencies, ignore the reported errors. For example, if you need only DAPL 2.0, ignore the errors that are reported about DAPL 1.0.

Check item

Description

Expected result

Error result and solution

cc algorithm

The CC algorithm of eRDMA.

The CC algorithm of eRDMA. Example: hpcc_rtt.

FAIL: The CC algorithm of eRDMA is not detected. This issue may occur if the eadm tool is not installed as expected or does not support queries for the CC algorithm of eRDMA.

go-back-n support

Whether the eRDMA device supports the Go-back-N feature.

PASS

  • SKIP: The current eRDMA device may not support queries for Go-back-N configurations.

  • FAIL: The eadm tool may not be installed as expected or the eRDMA device may not support the Go-back-N feature.

If the Go-back-N feature is not supported, HPC applications may be affected. If you do not need the feature, ignore the error.

dapl1 install

Whether DAPL 1.0 is installed as expected.

PASS

FAIL: The shared libraries for DAPL 1.0 or the DAPL 1.0 configuration file does not exist. Check whether DAPL 1.0 is installed as expected. If you do not require DAPL 1.0, ignore the error.

dapl1 config

Whether eRDMA configurations are included in the DAPL 1.0 configuration file.

PASS

FAIL: No eRDMA configurations exist in the DAPL 1.0 configuration file. Check the DAPL 1.0 configuration file and add eRDMA configurations to the file. If you do not require DAPL 1.0, ignore the error.

dapl2 install

Whether DAPL 2.0 is installed as expected.

PASS

FAIL: The shared libraries for DAPL 2.0 or the DAPL 2.0 configuration file does not exist. Check whether DAPL 2.0 is installed as expected. If you do not require DAPL 2.0, ignore the error.

dapl2 config

Whether eRDMA configurations are included in the DAPL 2.0 configuration file.

PASS

FAIL: No eRDMA configurations exist in the DAPL 2.0 configuration file. Check the DAPL 2.0 configuration file and add eRDMA configurations to the file. If you do not require DAPL 2.0, ignore the error.

dapl2 test

Whether the dtest command runs as expected for DAPL 2.0.

PASS

FAIL: The dtest command fails to run. DAPL 2.0 may not be installed or configured as expected.

Perform the following steps:

  1. Connect to an ECS instance on which eRDMA is configured.

    For more information, see Use Workbench to connect to a Linux instance over SSH.

  2. Run one of the following commands to obtain the diagnose tool.

    • Run the following command to obtain the diagnose tool from an internal URL:

      wget http://mirrors.cloud.aliyuncs.com/erdma/tools/diagnose.py
    • Run the following command to obtain the diagnose tool from a public URL:

      wget https://mirrors.aliyun.com/erdma/tools/diagnose.py
  3. Run the following command to check the required dependencies for an eRDMA HPC environment:

    python diagnose.py --hpc-check

    In normal cases, the following command output is returned, which includes the results of check items. For information about the check items, see Check items for dependencies used in an eRDMA HPC environment check.

    image.png

Test eRDMA network performance

You can use the perftest feature of the diagnose tool to test the eRDMA network performance between ECS instances.

  • Prerequisites

    The following requirements are met:

    • eRDMA is installed and deployed as expected on all nodes (ECS instances) that you want to test. For information about how to configure eRDMA, see Enable eRDMA on an ECS instance.

    • Password-free SSH access is allowed between all nodes that you want to test. For more information, see Step 4: Configure password-free SSH logon.

    • Python paramiko dependencies are installed on all nodes that you want to test.

      Note
      • The diagnose tool uses paramiko for connections.

      • To install Python paramiko dependencies, use one of the following sets of commands based on the instance operating system. If you do not have special requirements for the Python version, we recommend that you use Python 3 to reduce configuration workload.

      Alibaba Cloud Linux or CentOS

      # python3
      sudo python3 -m pip install --upgrade pip
      sudo python3 -m pip install paramiko 
      # python2
      # If the Python version is Python 2 and python2-pip is not installed, install python2-pip.
      sudo yum -y install python2-pip
      sudo python2 -m pip install --upgrade pip==20.3.4
      sudo python2 -m pip install paramiko 

      Ubuntu

      # python3
      sudo python3 -m pip install --upgrade pip
      sudo python3 -m pip install paramiko
      # python2
      # If python2-pip is not installed on the current node, install python2-pip.
      sudo apt install software-properties-common
      sudo add-apt-repository universe
      sudo apt update
      sudo apt install python2
      sudo curl https://bootstrap.pypa.io/pip/2.7/get-pip.py --output get-pip.py
      sudo python2 get-pip.py
      sudo python2 -m pip install --upgrade pip==20.3.4
      sudo python2 -m pip install paramiko
  • Procedure

    1. Connect to an ECS instance on which eRDMA is configured.

      For more information, see Use Workbench to connect to a Linux instance over SSH.

    2. Run one of the following commands to obtain the diagnose tool.

      • Run the following command to obtain the diagnose tool from an internal URL:

        wget http://mirrors.cloud.aliyuncs.com/erdma/tools/diagnose.py
      • Run the following command to obtain the diagnose tool from a public URL:

        wget https://mirrors.aliyun.com/erdma/tools/diagnose.py
    3. Run the following command to check eRDMA latency:

      python diagnose.py --perftest --hosts <n1> <n2> --user <username> --key-file </path/to/private_key>

      Take note of the following parameters:

      • --hosts <n1> <n2>: specifies the nodes (ECS instances) that you want to check. Separate the nodes with spaces. Replace <n1> <n2> with the private IP addresses of ERIs on the nodes.

      • --user <username>: specifies the username that is used for password-free SSH logon. Replace <username> with an actual username.

      • --key-file </path/to/private_key>: specifies the absolute path of the private key file that is used for password-free SSH logon. Replace </path/to/private_key> with the actual absolute path of a private key file.

      The following command output is returned, which indicates the eRDMA latency between two ECS instances. For more information, see Test eRDMA network performance.

      Each table in the command output displays the latencies from request initiators to requester responders for an operation. The value in each cell of other columns and rows represents the average latency in microseconds, followed by the 99.9th percentile latency in parentheses.

      image.png