You can use the health check feature of an E-MapReduce (EMR) cluster to learn the health status of the cluster and resolve issues in the cluster based on suggestions. This can help ensure that the cluster remains in a healthy state.
Precautions
The health diagnostics feature is available only for DataLake, Dataflow, OLAP, DataServing, and custom clusters. For more information, see Create a cluster.
Health diagnostics is used to analyze the health status of nodes and services, such as Hive, HDFS, YARN, and ZooKeeper, in a cluster. You can identify issues based on the diagnostic result and troubleshoot the issues based on suggestions.
View daily cluster reports
Go to the Monitoring and Diagnostics tab.
Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
On the EMR on ECS page, find the desired cluster and click the name of the cluster in the Cluster ID/Name column.
On the page that appears, click the Monitoring and Diagnostics tab.
On the Monitoring and Diagnostics tab, click the Daily Cluster Reports subtab. You can view the list of health check reports of the cluster.
On the Daily Cluster Reports tab, the Health Status column displays the health status of the cluster.
The following table describes the health status that corresponds to each score range.
Score range
Description
0 <= x <= 60
The cluster is in an unhealthy state. Resolve issues in the cluster at the earliest opportunity.
60 < x <= 80
The cluster is in a sub-healthy state. We recommend that you optimize the cluster.
80 < x <= 100
The cluster is in a healthy state and no issues need to be resolved.
NoteThe score indicates the health status of the cluster. Valid values range from 0 to 100.
View the details of a daily cluster report.
Click View Report in the Actions column of a report to view the details of the report for the cluster.
This page displays an overview of the cluster health status and the basic information about the report, such as the health score, cluster ID, report ID, and diagnostics time. The diagnostic items and an analysis overview of the diagnostic items displayed on this page vary based on the type of the cluster. The analysis overview provides a summary of the cluster issues and directly displays the issues. You can refer to the details of diagnostic items to obtain analysis results of a specific issue.
Analysis of computing resources
Analysis details
This tab displays analysis details of computing resources. You can learn the basic information about the computing resource usage of a cluster, such as the computing score, the number of scanned jobs, and the health status distribution of jobs. This tab also displays the identified issues such as low memory usage. You can check the information about the job in which the issue is identified to resolve the issue.
Basic computing information
This section displays the trend charts of cluster computing scores, cluster memory consumed by different types of engines (GB*Sec), and cluster vCPUs consumed by different types of engines (VCore*Sec).
The following table provides information about cluster memory and cluster vCPUs.
Metric | Description |
Cluster Memory (GB*Sec) | The total cluster memory that is consumed by jobs in the cluster. It is an accumulated value and is calculated by using the following formula: |
Cluster vCPU (VCore*Sec) | The total number of cluster vCPUs that are consumed by jobs in the cluster. It is an accumulated value and is calculated by using the following formula: |
Computing information analysis
This section displays the following charts:
Trend chart of compute engine scores
Trend chart of the number of compute engine jobs
Pie chart for memory consumed by different types of engines
Pie chart for vCPUs consumed by different types of engines
Pie chart for memory consumed by jobs that are submitted by different users
Job information
EMR Doctor collects jobs, processes and analyzes the jobs, and displays the key jobs that affect the cluster execution based on analysis results. You can resolve issues that are identified in the jobs to improve the cluster computing efficiency, and increase the cluster utilization rate and the profits.
This section displays the top 50 jobs that consume the most memory (GB*Sec) and the top 50 jobs sorted by scores in ascending order.
Analysis of HDFS storage resources
By default, EMR Doctor does not collect information about storage resources. If you want to analyze the Hadoop Distributed File System (HDFS) or Hive storage resources, you can turn on Collect Information About Storage Resources on the Daily Cluster Report tab of the Health Check tab or perform the operations that are described in the Configuration topic to modify the information about storage resources.
Analysis details
This tab displays analysis details of HDFS storage resources. The analysis details describe basic information about the cluster resources, such as the total number of files and the total volume of stored data. This tab also displays the identified issues such as high proportion of small files and high proportion of stored cold data. In the issue details section, you can view the directory in which a specific issue is identified and the method to resolve the issue.
Basic HDFS information
In the Basic HDFS Information section, you can view the following information in charts:
Trend chart of the volume of stored data
Trend chart of the number of files
Trend chart of HDFS storage scores
Total number of files, total volume of stored data, number of small files, number of very small files, and volume of stored cold data
HDFS usage analysis
In the HDFS Usage Analysis section, you can view the following information in charts:
Pie chart for storage resources consumed by different HDFS users
Pie chart for the number of files used by different HDFS users
Pie chart for storage resources consumed by different HDFS groups
Pie chart for the number of files used by different HDFS groups
Pie chart for the distribution of HDFS files of different sizes
Pie chart for the distribution of cold data and hot data in HDFS
Distribution of data stored in level-1 HDFS directories
Distribution of files of different sizes stored in HDFS directories
Small files in HDFS can cause pressure on NameNode and shard issues. The number of small files in HDFS is an important metric. In the Directory File Size Distribution section, you can view the distribution of empty files, very small files, small files, medium files, and large files in each directory level. EMR Doctor can be used to drill down to up to four levels of directories.
The following table describes the file definitions.
File type | Description |
Empty file | Files whose size is 0. |
Very small file | Files whose size is less than 1 MB. |
Small file | Files whose size is less than 128 MB. |
Medium file | Files whose size is greater than or equal to 128 MB and is less than or equal to 1 GB. |
Large file | Files whose size is greater than 1 GB. |
The Directory File Size Distribution section displays the following information:
Top N directories at a specific level that store the maximum number of empty files
Top N directories at a specific level that store the maximum number of very small files
Top N directories at a specific level that store the maximum number of small files
Top N directories at a specific level that store the maximum number of medium files
Top N directories at a specific level that store the maximum number of large files
Each table displays the information about the top N directories, such as the specific path, volume of stored data, day-to-day comparison, and daily increment.
Distribution of cold data and hot data in directories
Cold data is data that is not accessed for a long period of time. We recommend that you store cold data in cold standby storage mode, such as the Cold Archive storage class in Object Storage Service (OSS). The distribution of cold data and hot data in directories can help you understand cluster usage and reduce costs. In the Directory Cold Data and Hot Data Distribution section, you can view the distribution of very cold data, cold data, warm data, and hot data in each directory level. EMR Doctor can be used to drill down to up to four levels of directories.
Data type | Description |
Very cold data | Data that is not accessed for more than three months. |
Cold data | Data that is not accessed for more than one month but is accessed in three months. |
Warm data | Data that is not accessed for more than seven days but is accessed in one month. |
Hot data | Data that is accessed in recent seven days. |
The Directory Cold Data and Hot Data Distribution section displays the following information:
Top N directories at a specific level that store the maximum volume of very cold data
Top N directories at a specific level that store the maximum volume of cold data
Top N directories at a specific level that store the maximum volume of warm data
Top N directories at a specific level that store the maximum volume of hot data
Each table displays the information about the top N directories, such as the specific path, volume of stored data, day-to-day comparison, and daily increment.
Analysis of HBase storage resources
Analysis details
This tab displays analysis details of HBase storage resources. The analysis details describe basic information about HBase usage, such as the average cluster load, cluster partition balancing degree, and the health status of RegionServers and user tables. This tab also displays the identified issues such as high average cluster load, low cluster partition balancing degree, and abnormal health status of RegionServers and user tables. In the issue details section, you can view the information such as the RegionServer, table, or partition in which a specific issue is identified and the method to resolve the issue.
Cluster overview analysis
In the Cluster Overview section, you can view the following information in charts:
Trend chart of cluster health scores
Trend chart of cluster partition balancing degrees
Pie chart for the number of partitions in the cluster for different RegionServers
Trend chart of the number of cluster requests
Total number of tables, total number of partitions, total number of nodes, average load, total volume of data, total number of read requests, total number of write requests, and total number of requests
RegionServer-related information
The RegionServer Related Information section displays detailed information such as the cache hit ratio, average GC duration, and number of daily read/write requests of a RegionServer.
Ranking of RegionServers sorted by the cache hit ratio in ascending order (table headers: RegionServer and Cache Hit Ratio)
Ranking of RegionServers sorted by the average GC duration (table headers: RegionServer and Average GC Duration)
Ranking of RegionServers sorted by the number of daily read requests (table headers: RegionServer and Number of Daily Read Requests)
Ranking of RegionServers sorted by the day-to-day daily read request increment (table headers: RegionServer and Day-to-Day Daily Read Request Increment)
Ranking of RegionServers sorted by the number of daily write requests (table headers: RegionServer and Number of Daily Write Requests)
Ranking of RegionServers sorted by the day-to-day daily write request increment (table headers: RegionServer and Day-to-Day Daily Write Request Increment)
Table-related information
The Table Related Information section displays detailed information such as the hot partitions in a table, volume of data in a table, number of partitions in a table, and number of read/write requests in a table.
Details of tables that contain hot partitions
Top N tables sorted by the partition balancing degree in ascending order
Top N tables sorted by the average data volume in partitions in ascending order
Top N tables sorted by the volume of stored data
Top N tables sorted by the day-to-day data storage increment
Top N tables sorted by the number of partitions
Top N tables sorted by the day-to-day partition increment
Top N tables sorted by the number of read requests
Top N tables sorted by the day-to-day read request increment
Top N tables sorted by the number of write requests
Top N tables sorted by the day-to-day write request increment
Analysis of Hive storage resources
Analysis details
This tab displays analysis details of Hive storage resources. The analysis details describe basic information about Hive usage, such as the total number of Hive databases, total number of Hive tables, total number of files in Hive tables, and total volume of data stored in Hive. This tab also displays the identified issues such as high proportion of small files, high proportion of stored cold data, and uneven distribution of storage formats. In the issue details section, you can view the database or table in which a specific issue is identified and the method to resolve the issue.
Basic Hive information
This section displays multiple common storage metrics for the usage of Hive storage resources, including the storage resource usage trend, file quantity trend, and score trend.
Hive usage analysis
In the Hive Usage Analysis section, you can view the following information in charts:
Distribution chart for consumed storage resources in different Hive databases
Distribution chart for total volume of data stored by different Hive users
Pie chart for the distribution of files of different sizes in Hive tables
Pie chart for the distribution of cold data and hot data in Hive tables
Pie chart for the distribution of storage formats of Hive tables
Hive details
The Hive Information section displays details of Hive databases and Hive tables.
Hive database information
The Hive Database Information section displays the following information:
Hive database details
Top N Hive databases sorted by the distribution of files of different sizes
Top N Hive databases sorted by the distribution of cold data and hot data
Top N Hive databases sorted by distribution of storage formats
The Hive Database Details section displays the following data:
Ranking of Hive databases sorted by storage resource consumption: name, consumed storage resources, day-to-day comparison, and daily increment
Ranking of Hive databases sorted by the number of files: name, number of files, day-to-day comparison, and daily increment
Score ranking: number of scores
Ranking of Hive databases sorted by the number of partitions: name, number of partitions, day-to-day comparison, and daily increment
You can obtain the following information based on the top N Hive databases sorted by the distribution of files of different sizes:
Top N Hive databases that store the maximum number of empty files
Top N Hive databases that store the maximum number of very small files
Top N Hive databases that store the maximum number of small files
Top N Hive databases that store the maximum number of medium files
Top N Hive databases that store the maximum number of large files
Small files in Hive can cause pressure on NameNode and shard issues. A large number of small files may slow down the computing process. The number of small files in Hive is an important metric.
You can obtain the following information based on the top N Hive databases sorted by the distribution of cold data and hot data:
Top N Hive databases that store the maximum volume of very cold data
Top N Hive databases that store the maximum volume of cold data
Top N Hive databases that store the maximum volume of warm data
Top N Hive databases that store the maximum volume of hot data
Cold data is data that is not accessed for a long period of time. We recommend that you store cold data in cold standby storage mode, such as the Cold Archive storage class in OSS. The distribution of cold data and hot data can help you understand cluster usage and reduce costs.
Hive supports different storage formats. Different storage formats are suitable for different use scenarios. In most cases, the mainstream columnar format reduces storage costs and improves query efficiency.
You can obtain the following information based on the top N Hive databases sorted by the distribution of storage formats:
Top N Hive databases that store the maximum volume of TextFile-formatted data
Top N Hive databases that store the maximum volume of Parquet-formatted data
Top N Hive databases that store the maximum volume of ORC-formatted data
Hive table information
The Hive Table Information section displays the following information:
Hive table details
Top N Hive tables sorted by the distribution of files of different sizes
Top N Hive tables sorted by the distribution of cold data and hot data
Top N Hive tables sorted by distribution of storage formats
For more information, see Hive database information.