All Products
Search
Document Center

E-MapReduce:View the health status of nodes

Last Updated:Nov 07, 2024

You can check whether a node is run as expected based on the health status of the node. The health status is formed based on the check results of multiple health check items. This topic describes how to view the health status of a node and related health check items.

Prerequisites

An E-MapReduce (EMR) cluster is created. For more information, see Create a cluster.

Limits

This topic is applicable only to DataLake, Dataflow, online analytical processing (OLAP), DataServing, and custom clusters.

View the latest health status of nodes

  1. Go to the Nodes tab.

    1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.

    2. In the top navigation bar, select the region in which your cluster resides and select a resource group based on your business requirements.

    3. On the EMR on ECS page, find the desired cluster and click Nodes in the Actions column.

  2. On the Nodes tab, view the health status of nodes in each node group.

    • Green number in the Health Status column: indicates the number of nodes in the Good state in the current node group.

    • Yellow number in the Health Status column: indicates the number of nodes in the Warning state in the current node group.

    • Red number in the Health Status column: indicates the number of nodes in the Abnormal state in the current node group.

    • Gray number in the Health Status column: indicates the number of nodes in the Unknown state and nodes in the Stateless state in the current node group.

    On the Nodes tab, click the image.png icon on the left of the name of a node group. In the node list that appears, you can view the health status of each node in the Health Status column.

    A node may be in the following states: Good, Warning, Abnormal, Unknown, and Stateless. Different states are indicated by different icons.

    Icon

    Health status

    Description

    image.png

    Good

    The node is run as expected.

    image.png

    Warning

    The node is run as expected, but hidden risks are detected based on the health check items of the node. You need to focus on the hidden risks.

    image.png

    Abnormal

    The node is unavailable. Serious issues are detected based on the health check items of the node. You must troubleshoot the issues at the earliest opportunity.

    image.png

    Stateless

    No health check is performed on the node after an installation process or a manual stop. You do not need to focus on nodes that are in this state.

    image.png

    Unknown

    The results of health check items of the node cannot be obtained. If no issue occurs in the business, you do not need to focus on nodes that are in this state.

View health check items of a node

  1. On the Nodes tab, find the desired node group and click the image.png icon on the left of the name of the node group.

  2. Find the desired node and click View Check Items to the right of the health status in the Health Status column.

  3. In the panel that appears, view the latest results of health check items and the health check history of the current node.

    The following table describes the health check items. The value of each check item is indicated by u.

    Name

    Description

    Threshold

    Unit

    status_alive

    Checks whether the node status is normal.

    None

    -

    host_fd_usage

    Checks the usage of the file descriptor.

    • Warning: 95 ≤ u < 99

    • Abnormal: u ≥ 99

    %

    host_disk_fault

    Checks whether a disk exception occurs on the underlying layer.

    None

    -

    host_system_env

    Checks the availability of important configuration files, Java, and Python.

    None

    -

    host_service_env

    Checks whether storage directories and package files on which the cluster services depend are available.

    None

    -

    host_network_transmit_drop_rate

    Checks the outbound packet loss rate during network transmission.

    • Warning: 1.0 ≤ u < 2.5

    • Abnormal: u ≥ 2.5

    %

    host_network_receive_error_rate

    Checks the inbound packet error rate during network transmission.

    • Warning: 0.1 ≤ u < 0.5

    • Abnormal: u ≥ 0.5

    %

    host_disk_io_latency

    Checks the average disk read/write latency.

    • Warning: 400 ≤ u < 800

    • Abnormal: u ≥ 800

    ms

    host_network_receive_error_rate

    Checks the inbound packet loss rate during network transmission.

    • Warning: 1.0 ≤ u < 2.5

    • Abnormal: u ≥ 2.5

    %

    host_network_transmit_error_rate

    Checks the outbound packet error rate during network transmission.

    • Warning: 0.1 ≤ u < 0.5

    • Abnormal: u ≥ 0.5

    %

    host_system_fault

    Checks whether a system exception occurs on the underlying layer.

    None

    -

    host_cpu_usage

    Checks the CPU load of the node.

    • Warning: 95 ≤ u < 99

    • Abnormal: u ≥ 99

    %

    host_disk_inode_usage

    Checks the index node (inode) usage of disks.

    • Warning: 90 ≤ u < 99

    • Abnormal: u ≥ 99

    %

    host_mem_usage

    Checks the memory usage of the node.

    • Warning: 95 ≤ u < 99

    • Abnormal: u ≥ 99

    %

    host_disk_space_usage

    Checks the disk usage.

    • Warning: 90 ≤ u < 99

    • Abnormal: u ≥ 99

    %