You can check whether a node is run as expected based on the health status of the node. The health status is formed based on the check results of multiple health check items. This topic describes how to view the health status of a node and related health check items.
Prerequisites
An E-MapReduce (EMR) cluster is created. For more information, see Create a cluster.
Limits
This topic is applicable only to DataLake, Dataflow, online analytical processing (OLAP), DataServing, and custom clusters.
View the latest health status of nodes
Go to the Nodes tab.
Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
In the top navigation bar, select the region in which your cluster resides and select a resource group based on your business requirements.
On the EMR on ECS page, find the desired cluster and click Nodes in the Actions column.
On the Nodes tab, view the health status of nodes in each node group.
Green number in the Health Status column: indicates the number of nodes in the Good state in the current node group.
Yellow number in the Health Status column: indicates the number of nodes in the Warning state in the current node group.
Red number in the Health Status column: indicates the number of nodes in the Abnormal state in the current node group.
Gray number in the Health Status column: indicates the number of nodes in the Unknown state and nodes in the Stateless state in the current node group.
On the Nodes tab, click the icon on the left of the name of a node group. In the node list that appears, you can view the health status of each node in the Health Status column.
A node may be in the following states: Good, Warning, Abnormal, Unknown, and Stateless. Different states are indicated by different icons.
Icon
Health status
Description
Good
The node is run as expected.
Warning
The node is run as expected, but hidden risks are detected based on the health check items of the node. You need to focus on the hidden risks.
Abnormal
The node is unavailable. Serious issues are detected based on the health check items of the node. You must troubleshoot the issues at the earliest opportunity.
Stateless
No health check is performed on the node after an installation process or a manual stop. You do not need to focus on nodes that are in this state.
Unknown
The results of health check items of the node cannot be obtained. If no issue occurs in the business, you do not need to focus on nodes that are in this state.
View health check items of a node
On the Nodes tab, find the desired node group and click the icon on the left of the name of the node group.
Find the desired node and click View Check Items to the right of the health status in the Health Status column.
In the panel that appears, view the latest results of health check items and the health check history of the current node.
The following table describes the health check items. The value of each check item is indicated by u.
Name
Description
Threshold
Unit
status_alive
Checks whether the node status is normal.
None
-
host_fd_usage
Checks the usage of the file descriptor.
Warning: 95 ≤ u < 99
Abnormal: u ≥ 99
%
host_disk_fault
Checks whether a disk exception occurs on the underlying layer.
None
-
host_system_env
Checks the availability of important configuration files, Java, and Python.
None
-
host_service_env
Checks whether storage directories and package files on which the cluster services depend are available.
None
-
host_network_transmit_drop_rate
Checks the outbound packet loss rate during network transmission.
Warning: 1.0 ≤ u < 2.5
Abnormal: u ≥ 2.5
%
host_network_receive_error_rate
Checks the inbound packet error rate during network transmission.
Warning: 0.1 ≤ u < 0.5
Abnormal: u ≥ 0.5
%
host_disk_io_latency
Checks the average disk read/write latency.
Warning: 400 ≤ u < 800
Abnormal: u ≥ 800
ms
host_network_receive_error_rate
Checks the inbound packet loss rate during network transmission.
Warning: 1.0 ≤ u < 2.5
Abnormal: u ≥ 2.5
%
host_network_transmit_error_rate
Checks the outbound packet error rate during network transmission.
Warning: 0.1 ≤ u < 0.5
Abnormal: u ≥ 0.5
%
host_system_fault
Checks whether a system exception occurs on the underlying layer.
None
-
host_cpu_usage
Checks the CPU load of the node.
Warning: 95 ≤ u < 99
Abnormal: u ≥ 99
%
host_disk_inode_usage
Checks the index node (inode) usage of disks.
Warning: 90 ≤ u < 99
Abnormal: u ≥ 99
%
host_mem_usage
Checks the memory usage of the node.
Warning: 95 ≤ u < 99
Abnormal: u ≥ 99
%
host_disk_space_usage
Checks the disk usage.
Warning: 90 ≤ u < 99
Abnormal: u ≥ 99
%