You can monitor and check elastic Remote Direct Memory Access (eRDMA) to identify and resolve issues at the earliest opportunity, ensure system security, and efficiently manage and optimize system resources. This topic describes several methods and tools that you can use to monitor and check eRDMA.
Prerequisites
eRDMA is installed and configured on an Elastic Compute Service (ECS) instance. For information about how to configure eRDMA, see Configure eRDMA on an enterprise-level instance.
Use CloudMonitor to monitor eRDMA
You can use Alibaba Cloud CloudMonitor to monitor the working status of eRDMA. Perform the following steps to view the CloudMonitor metrics that are supported by eRDMA:
Log on to the CloudMonitor Metric console.
Enter eri in the search box above the metric list to search for the CloudMonitor metrics that are supported by eRDMA.
NoteAlternatively, you can customize metrics based on your business requirements to process and receive reports and alerts about eRDMA monitoring data. For more information, see Custom Monitoring.
Use eadm to monitor eRDMA
eadm is an in-house, user-space management tool that is automatically installed by an eRDMA driver on an ECS instance to provide diagnostics and real-time monitoring capabilities and help identify faults. eadm provides the following features:
The abilities to perform device-wide real-time traffic statistics, including the traffic monitoring and assisted diagnostics abilities.
The abilities to perform and query configurations, including the abilities to enable the debugging feature and configure congestion control (CC) algorithms.
The following section describes several common eadm commands. For information about other eadm commands, run the eadm -h
command to obtain command helps.
eadm is used only for diagnostics and debugging purposes and is subject to changes. eadm may not be suitable for all scenarios.
Retrieve the supported primary command codes.
eadm -h
Retrieve real-time traffic information about an eRDMA device.
eadm stat -d <ibdev_name> -l
<ibdev_name>
specifies the name of the eRDMA device. You can run theibv_devinfo
command to query the names of eRDMA devices. Replace <ibdev_name> with an actual eRDMA device name. If only one eRDMA device is available in your environment, you can omit the-d <ibdev_name>
parameter.Retrieve statistics about an eRDMA device, such as the number of cm and verebs messages and traffic volumes.
eadm stat -d <ibdev_name>
<ibdev_name>
specifies the name of the eRDMA device. You can run theibv_devinfo
command to query the names of eRDMA devices. Replace <ibdev_name> with an actual eRDMA device name. If only one eRDMA device is available in your environment, you can omit the-d <ibdev_name>
parameter.Retrieve the version information of the current eRDMA driver.
eadm ver
Limits apply when you run other eadm commands such as info
, dump
, and conf
. We recommend that you do not use other eadm commands.
Use iproute2 to monitor eRDMA
Iproute2 is a next-generation toolkit that is used for TCP/IP networking and traffic control. Iproute2 is pre-installed in recent eRDMA versions and provides RDMA commands that can be used to monitor and check RDMA subsystems.
The simple and well-structured commands in Iproute2 replace the commands in net-tools, such as ifconfig, arp, route, and netstat. You can use Iproute2 to manage network interfaces and route tables, and traffic. This allows administrators to quickly identify and troubleshoot network connectivity issues.
Query statistics about eRDMA devices, such as the number of cm and verebs messages and traffic volumes.
rdma -p stat
Query the resource usage of eRDMA devices.
rdma res
Query the status of eRDMA devices.
rdma link
Use the diagnose tool to check eRDMA
You can use the diagnose tool to check basic eRDMA functionality, eRDMA high-performance computing (HPC) environments, and basic eRDMA latencies. This helps you effectively use eRDMA.
Run the following commands to obtain the diagnose tool:
wget https://mirrors.aliyun.com/erdma/tools/diagnose.py # View how to use the diagnose tool. python diagnose.py -h
Perform a check on eRDMA.
Basic functionality check
Run one of the following commands to check the basic functionality of eRDMA:
python diagnose.py -d
Or
python diagnose.py --diagnose
One of the following results is returned for each check item:
PASS
: The check item passed the check.SKIP
: The check item does not support the check and is skipped.FAIL
: The required check tool is not installed or the check item failed the check. You can run the commands that are listed in thefail info
section to check the FAIL items and troubleshoot issues.Other INFO information: indicates eRDMA-related configuration information, such as the installation mode, driver versions, and CC algorithms.
In normal cases, the following command output that indicates that all check items passed the check is returned.
Sixteen check items are checked in the basic functionality check on eRDMA. The following table describes the check items, the expected check result for each item, and what to do if the items fail the check.
Check item
Description
Expected result
Error result and solution
erdma device
Whether eRDMA devices exist.
PASS
FAIL
: You may not enable the eRDMA feature or add an eRDMA interface (ERI) as a secondary network interface controller (NIC) during instance creation. Enable eRDMA or add an ERI as a secondary NIC. For more information, see Configure eRDMA on an enterprise-level instance.erdma installed
Whether eRDMA drivers are properly installed.
PASS
FAIL
: eRDMA drivers are not properly installed. Check the steps that you performed to install eRDMA drivers during eRDMA configuration or re-install the drivers. For more information, see Configure eRDMA on an enterprise-level instance.erdma loaded
Whether eRDMA drivers are properly loaded.
PASS
FAIL
: eRDMA drivers are not properly loaded. This issue may occur when the drivers are installed before the instance restarts. Run themodprobe erdma
command to resolve the issue.ibverbs loaded
Whether the ib_verbs driver is properly loaded.
PASS
FAIL
: The ib_verbs driver is not properly loaded. Run themodprobe ib_uverbs
command to resolve the issue.erdma tools
Check whether eRDMA-related tools are installed.
PASS
FAIL
: Run theeadm|rdma|ibv_devinfo
command to check for missing tools. In most cases, eRDMA-related tools are installed together with eRDMA drivers. Check the steps that you performed to install eRDMA drivers during eRDMA configuration or re-install the drivers. For more information, see Configure eRDMA on an enterprise-level instance.hca detected
Whether eRDMA devices are detected by the user-space driver.
PASS
FAIL
: eRDMA devices are not detected by the user-space driver. This issue occurs when theerdma device
,erdma installed
,erdma loaded
, andibverbs loaded
check items fail the check. Check that eRDMA drivers are installed and properly loaded.hca active
Whether the current device is enabled.
PASS
FAIL
: This issue occurs when the elastic network interface (ENI) of the current eRDMA device is not in therunning
state. The issue may occur in specific early kernel versions. Run thedhclient -v ethx
command to enable the ENI, and then check whether the eRDMA device is in theACTIVE
state.erdma stats
Whether no error statistics about eRDMA devices exist.
PASS
SKIP
: The operating system may not support therdma stat
command.FAIL
: Error statistics about eRDMA devices may exist. When you ask for technical assistance, we recommend that you provide therdma -p stat
command output.
network config
Whether network connectivity is good.
PASS
FAIL
: If the IP addresses of multiple NICs fall within the same subnet, eRDMA may not work as expected in specific scenarios.erdma dmesg
Whether eRDMA-related alerts do not exist in the kernel.
PASS
FAIL
: eRDMA-related alerts exist in the kernel. Check the error details of the alerts and reload drivers to resolve the issues.atomic support
Whether the eRDMA device supports RDMA Atomic Operation.
PASS
FAIL
: The current eRDMA device does not support RDMA Atomic Operation. If you do not need RDMA Atomic Operation, ignore the error.NoteRDMA Atomic Operation is a feature that performs complete and consistent operations on memory at the atomic level and is suitable only for specific scenarios. If you do not need RDMA Atomic Operation, ignore the error.
go-back-n support
Whether the eRDMA device supports the Go-back-N feature.
PASS
SKIP
: The current eRDMA device may not support queries for Go-back-N configurations.FAIL
: The eadm tool may not be properly installed or the eRDMA device may not support the Go-back-N feature.
NoteGo-back-N is an extension of eRDMA that is suitable only for specific scenarios. If you do not need the Go-back-N feature, ignore the error.
erdma install mode
The eRDMA kernel-mode driver installation mode.
Standard
: The eRDMA kernel-mode driver is installed in standard mode and supports only RDMA Connection Manager (CM) connections.Compat
: The eRDMA kernel-mode driver is installed in compatible mode and supports RDMA CM connections and out-of-band (OOB) connections. The driver uses TCP ports from the port range of 0x7790 to 0x779F.
FAIL
: The installation mode of the eRDMA kernel-mode driver is not detected. This issue may occur when theerdma loaded
item does not meet requirements and fails the check. Re-install the eRDMA kernel-mode driver. For more information, see Configure eRDMA on an enterprise-level instance.kernel driver version
The version of the eRDMA kernel-mode driver.
The version number of the eRDMA kernel-mode driver. Example:
0.2.38
.FAIL
: The version of the eRDMA kernel-mode driver is not detected. This issue may occur when theerdma loaded
orerdma tools
item does not meet requirements and fails the check. Make sure that the eRDMA driver is installed and properly loaded. For more information, see Configure eRDMA on an enterprise-level instance.rdma-core version
The version of the eRDMA user-mode driver.
The version number of the eRDMA user-mode driver. Example:
44.3-1
.FAIL
: The version of the eRDMA user-mode driver is not detected. This issue may occur when the eRDMA user-mode driver is not properly installed. Re-install the driver. For more information, see Configure eRDMA on an enterprise-level instance.cc algorithm
The CC algorithm of eRDMA.
The CC algorithm of eRDMA. Example:
cubic
.FAIL
: The CC algorithm of eRDMA is not detected. This issue may occur when theerdma loaded
orerdma tools
item does not meet requirements and fails the check. Make sure that eRDMA drivers are installed and properly loaded.eRDMA HPC environment check
If you want to run HPC applications in your eRDMA environment, you may need additional dependencies and configurations. You can use the diagnose tool to check the dependencies that are required for an eRDMA HPC environment. If you do not use HPC applications, ignore this section.
Run the following command to check the dependencies that are required for an eRDMA HPC environment:
python diagnose.py --hpc-check
In normal cases, the following command output is returned.
During the eRDMA HPC environment check, the following items about required dependencies are checked: the CC algorithm of eRDMA, whether Go-back-N is supported, DAPL 1.0-related items, and DAPL 2.0-related items. If you do not need the dependences, ignore the reported errors. For example, if you need only DAPL 2.0, ignore the errors that are reported about DAPL 1.0.
Check item
Description
Expected result
Error result and solution
cc algorithm
The CC algorithm of eRDMA.
The CC algorithm of eRDMA. Example: cubic.
FAIL
: The CC algorithm of eRDMA is not detected. This issue may occur when the eadm tool is not properly installed or does not support queries for the CC algorithm of eRDMA.go-back-n support
Whether the eRDMA device supports the Go-back-N feature.
PASS
SKIP
: The current eRDMA device may not support queries for Go-back-N configurations.FAIL
: The eadm tool may not be properly installed or the eRDMA device may not support the Go-back-N feature.
The absence of the Go-back-N feature may affect HPC applications. If you do not need the feature, ignore the error.
dapl1 install
Whether DAPL 1.0 is properly installed.
PASS
FAIL
: The shared libraries for DAPL 1.0 or the DAPL 1.0 configuration file does not exist. Check whether DAPL 1.0 is properly installed. If you do not need DAPL 1.0, ignore the error.dapl1 config
Whether eRDMA configurations are included in the DAPL 1.0 configuration file.
PASS
FAIL
: No eRDMA configurations exist in the DAPL 1.0 configuration file. Check the DAPL 1.0 configuration file and add eRDMA configurations to the file. If you do not need DAPL 1.0, ignore the error.dapl2 install
Whether DAPL 2.0 is properly installed.
PASS
FAIL
: The shared libraries for DAPL 2.0 or the DAPL 2.0 configuration file does not exist. Check whether DAPL 2.0 is properly installed. If you do not need DAPL 2.0, ignore the error.dapl2 config
Whether eRDMA configurations are included in the DAPL 2.0 configuration file.
PASS
FAIL
: No eRDMA configurations exist in the DAPL 2.0 configuration file. Check the DAPL 2.0 configuration file and add eRDMA configurations to the file. If you do not need DAPL 2.0, ignore the error.dapl2 test
Whether the dtest command runs as expected for DAPL 2.0.
PASS
FAIL
: The dtest command fails to run. DAPL 2.0 may not be properly installed or configured.eRDMA latency check
Prerequisites
Before you perform an eRDMA latency check, make sure that the following requirements are met:
eRDMA is properly installed and deployed on all nodes that you want to check. For more information, see Configure eRDMA on an enterprise-level instance.
Password-free SSH access is allowed between all nodes that you want to check. For more information, see Build a Hadoop environment.
Python paramiko dependencies are installed on all nodes that you want to check.
Procedure
Run the following command to check the eRDMA latency:
python diagnose.py --perftest --hosts <n1> <n2> --user <username> --key-file </path/to/private_key>
Take note of the following parameters:
--hosts <n1> <n2>
: specifies the nodes that you want to check. Separate the nodes with spaces. Replace<n1> <n2>
with the private IP addresses of ERIs on the nodes.--user <username>
: specifies the username that is used for password-free SSH logons. Replace <username> with an actual username.--key-file </path/to/private_key>
: specifies the absolute path of the private key file that is used for password-free SSH logons. Replace </path/to/private_key> with the actual absolute path of a private key file.
The following command output that indicates the check results is returned.