How to view daily cluster reports and understand the health status of the cluster - E-MapReduce

You can use the daily cluster report feature to understand the current health status of the cluster and make adjustments based on improvement suggestions to maintain a healthy state.

Precautions

By default, the Hadoop cluster health check feature does not include daily cluster report analysis. To view daily cluster report analysis, you must enable EMR Doctor. For more information, see Enable EMR Doctor (Hadoop cluster type).

View report

Access the monitoring diagnostics page.
1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
2. In the top menu bar, select the region and resource group based on your actual situation.
3. On the Cluster Management page, click the Cluster ID of the target cluster.
4. Click the Monitoring Diagnostics tab at the top.

Click the Daily Cluster Report tab to view all health diagnostic reports for the current cluster.

The Daily Cluster Reports section's Health Status column displays the health status of the cluster. The health status information is shown in the following table.

Health Status	Description
0 <= x <= 60	The cluster is in an unhealthy state. Resolve issues in the cluster at the earliest opportunity.
60 < x <= 80	The cluster is in a sub-healthy state. We recommend that you optimize the cluster.
80 < x <= 100	The cluster is in a healthy state and no issues need to be resolved.

Note

The score indicates the health status of the cluster. Valid values range from 0 to 100.

View the report details.

Click View Report in the Operation column to view detailed inspection information for the current cluster.

This page displays an overview of the cluster health status and the basic information about the report, such as the health score, current cluster ID, report ID, and diagnostics time. The diagnostic items and an analysis overview of the diagnostic items displayed on this page vary based on the type of the cluster. The overview analysis summarizes the issues in the cluster and concisely points out the problems. You can view the detailed analysis of diagnostic items for specific problem analysis.

Resource analysis

Compute resources

Detailed information

This page provides a detailed analysis of computing resource usage within a cluster, including the computing score, the number of scanned jobs, and the distribution of job health statuses. Additionally, it identifies issues such as low memory usage and offers information on the affected jobs to facilitate issue resolution.

Basic computing information

This section presents trend charts for cluster computing scores, memory usage in GB*Sec, and vCPU usage in VCore*Sec. It also includes the overall health score of computing tasks, the distribution of task scores, and related trend charts.

The table below provides data on cluster memory and vCPUs.

Metric	Description
*Cluster Memory (gbhour)**	The total cluster memory consumed by all jobs. The memory consumption of a job is an accumulated value calculated by `allocated memory (GB) × runtime (Hours)`.
*Cluster Vcpus (corehour)**	The total cluster vCPUs consumed by all jobs. The vCPU consumption of a job is an accumulated value calculated by `allocated CPU cores (Cores) × runtime (Hours)`.

Compute engine analysis

The following charts are displayed in this section:

Trend chart of compute engine scores
Trend chart of the number of compute engine jobs
Pie chart and trend chart of compute engine memory
Pie chart and trend chart of compute engine vCPUs

Compute queue information

This section presents the top 20 charts for compute queue memory usage.

Job information

EMR Doctor analyzes jobs, highlighting key jobs that impact cluster performance. Addressing issues identified in these jobs can enhance cluster computing efficiency and utilization, leading to increased profitability.

Displayed here are the top 50 memory-consuming jobs in GB*Sec and the top 50 jobs by ascending score. The table below details each data record.

Parameter	Description
Task Name	The name of the task.
Engine Type	The engine type of the task.
SQL Statement	This parameter needs to be configured only for SQL-type jobs.
APP IDS	For Hive on MR, a statement may have multiple APP IDs.
Username	The user who submitted the job.
Score	The score of the job.
Health Status	Specifies whether to mark the job for governance.
Suggestion	The optimization suggestion for the job.
*Memory (gbsec)**	The total cluster memory consumed by the job.
Memory Usage	The average memory usage of the job.
*CPU (vcoresec)**	The total cluster vCPUs consumed by the job.
CPU Usage	The average CPU utilization of the job.
Current Configuration	The current configuration of the job. You can consider how to adjust the current configuration based on the suggestions.
IO Information	The read/write, Shuffle, and other data of the job.

YARN schedule resources

YARN schedule resource analysis provides precise insights into resource usage and job execution patterns across dimensions such as YARN engine, queue, and user.

Detailed analysis

This page offers a comprehensive analysis of YARN schedule resources, detailing memory usage, job counts, and unhealthy nodes. It highlights periods of peak and trough resource usage and job execution, aiding in understanding resource allocation through usage curves.

Basic YARN information

The Basic YARN Information section provides charts on completed, failed, and killed jobs, memory, vCPUs, and additional metrics.

Trend chart of YARN memory resource usage
Trend chart of YARN CPU resource usage
Trend chart of the number of running YARN jobs

YARN engine information

This section displays charts for:

Pie chart of engine memory distribution
Pie chart of engine job distribution

YARN user information

This section displays charts for:

Pie chart of user memory distribution
Pie chart of user job distribution

YARN queue information

This section displays charts for:

Pie chart of queue memory distribution
Pie chart of queue job distribution

YARN unhealthy node information

The YARN Unhealthy Node Information section lists nodes that were unhealthy throughout the day, including times, duration, and health reports from YARN.

HDFS storage resources

To analyze HDFS or Hive storage resources, enable the Storage Resource Information Collection switch in the Daily Cluster Report under Monitoring Diagnostics, or adjust the storage information collection as per configuration instructions.

Detailed analysis

This page offers an in-depth analysis of HDFS storage resources, covering the overall state of cluster resources, such as file counts and data volume. It also identifies issues like a high proportion of small files and cold data, providing directories and resolution methods for each issue.

Basic HDFS information

The Basic HDFS Information section includes charts on data volume trends, file counts, HDFS storage scores, and more.

Trend chart of stored data volume
Trend chart of file count
Trend chart of HDFS storage scores
Metrics on file counts, data volume, small files, and cold data volume

HDFS usage analysis

The HDFS Usage Analysis section provides charts on:

Storage resource distribution by HDFS user
File count distribution by HDFS user
Storage resource distribution by HDFS group
File count distribution by HDFS group
Distribution of HDFS file sizes
Distribution of cold and hot data in HDFS
Data distribution in top-level HDFS directories

Distribution of files of different sizes stored in HDFS directories

Small files can strain the NameNode and cause shard issues. The number of small files is a crucial metric. This section shows the distribution of file sizes across directory levels, with EMR Doctor enabling drill-down to four levels.

The table below defines file sizes.

Parameter	Description
Empty file	Files whose size is 0.
Very small file	Files whose size is greater than 0 and less than 1 MB.
Small file	Files whose size is greater than or equal to 1 MB and less than 128 MB.
Medium file	Files whose size is greater than or equal to 128 MB and less than or equal to 1 GB.
Large file	Files whose size is greater than 1 GB.

The Directory File Size Distribution section includes:

Top directories with the most empty files at a specific level
Top directories with the most very small files at a specific level
Top directories with the most small files at a specific level
Top directories with the most medium files at a specific level
Top directories with the most large files at a specific level

Each table provides details on the top directories, including specific paths, data volume, day-to-day comparison, and daily increments.

Distribution of cold data and hot data in directories

Cold data refers to data not accessed for an extended period. It's advisable to store cold data in modes like OSS's Cold Archive storage class. This section shows the distribution of cold, warm, and hot data across directory levels, with EMR Doctor enabling drill-down to four levels.

Parameter	Description
Very cold data	Data that is not accessed for more than three months.
Cold data	Data that is not accessed for more than one month but is accessed within three months.
Warm data	Data that is not accessed for more than seven days but is accessed within one month.
Hot data	Data that is accessed in recent seven days.

The Directory Cold Data and Hot Data Distribution section includes:

Top directories with the most very cold data at a specific level
Top directories with the most cold data at a specific level
Top directories with the most warm data at a specific level
Top directories with the most hot data at a specific level

Each table provides details on the top directories, including specific paths, data volume, day-to-day comparison, and daily increments.

HBase storage resources

Detailed analysis

This page provides a detailed analysis of HBase storage resources, including average cluster load, partition balance, and the health of RegionServers and user tables. It also identifies issues such as high load or low balance and offers information on the affected RegionServer, table, or partition, along with resolution methods.

Cluster overview analysis

The Cluster Overview section presents charts detailing cluster health scores, partition balancing degrees, and the distribution of partitions across RegionServers, along with trends in cluster request numbers.

Trend chart of cluster health scores
Trend chart of cluster partition balancing degrees
Pie chart showing the number of partitions per RegionServer
Trend chart of the number of cluster requests
Metrics on table count, partition count, node count, average load, data volume, read requests, write requests, and total requests

RegionServer-related information

This section provides detailed metrics such as cache hit ratios, average GC durations, and daily read and write request counts for RegionServers.

Ranking of RegionServers by cache hit ratio in ascending order
Ranking of RegionServers by average GC duration
Ranking of RegionServers by number of daily read requests
Ranking of RegionServers by day-to-day increment in read requests
Ranking of RegionServers by number of daily write requests
Ranking of RegionServers by day-to-day increment in write requests

Table-related information

The Table-Related Information section provides details on hot partitions, data volume, partition count, and read/write requests for tables.

Details of tables with hot partitions
Top tables by partition balancing degree in ascending order
Top tables by average data volume in partitions in ascending order
Top tables by volume of stored data
Top tables by day-to-day data storage increment
Top tables by number of partitions
Top tables by day-to-day partition increment
Top tables by number of read requests
Top tables by day-to-day read request increment
Top tables by number of write requests
Top tables by day-to-day write request increment

Hive storage resources

Detailed analysis

This tab details the usage of Hive storage resources, including the total number of databases and tables, file counts, and data volume. It also identifies issues such as a high proportion of small files, cold data, and uneven storage format distribution, providing databases or tables where issues are found along with resolution methods.

Basic Hive information

This section displays various metrics for Hive storage resource usage, including trends in storage usage, file quantity, and scores.

Hive usage analysis

The Hive Usage Analysis section includes charts on:

Storage resource distribution across Hive databases
Data volume distribution by Hive user
File size distribution in Hive tables
Cold and hot data distribution in Hive tables
Storage format distribution of Hive tables

Hive details

The Hive Information section provides details on Hive databases and tables.

Hive database information

The Hive Database Information section includes:

Hive database details
Top Hive databases by file size distribution
Top Hive databases by cold and hot data distribution
Top Hive databases by storage format distribution

The Hive Database Details section shows data on:

Hive databases ranked by storage resource consumption
Hive databases ranked by file count
Score ranking for Hive databases
Hive databases ranked by partition count

The top N Hive databases by file size distribution provide:

Top Hive databases with the most empty files
Top Hive databases with the most very small files
Top Hive databases with the most small files
Top Hive databases with the most medium files
Top Hive databases with the most large files

Note

Small files in Hive can impact NameNode performance and shard issues, slowing down computation. The number of small files in Hive is a significant metric.

The top N Hive databases by cold and hot data distribution provide:

Top Hive databases with the most very cold data
Top Hive databases with the most cold data
Top Hive databases with the most warm data
Top Hive databases with the most hot data

Note

Cold data refers to infrequently accessed data. Storing cold data in cold standby modes like OSS's Cold Archive can help manage cluster usage and reduce costs.

Hive supports various storage formats, each suited to different scenarios. Columnar formats like Parquet and ORC generally reduce storage costs and improve query performance.

The top N Hive databases by storage format distribution provide:

Top Hive databases with the most TextFile-formatted data
Top Hive databases with the most Parquet-formatted data
Top Hive databases with the most ORC-formatted data

Hive table information

The Hive Table Information section includes:

Hive table details
Top Hive tables by file size distribution
Top Hive tables by cold and hot data distribution
Top Hive tables by storage format distribution

Note

For more information, see the referenced document.

OSS storage resources

To analyze OSS storage resources, enable the Monitoring Diagnostics > Daily Cluster Report switch for Storage Resource Information Collection and configure OSS storage collection as described in Enable and configure OSS storage analysis.

Detailed analysis

This page offers an in-depth look at OSS (excluding OSS-HDFS) storage resources, detailing the state of OSS Bucket resources, such as file counts and data volumes. It also highlights issues like a high proportion of small files and provides directories and solutions for each identified issue.

Basic OSS information

The Basic OSS Information section displays charts for:

Buckets
Total storage size
Total number of files
Number of small files, including empty and very small files
Trend chart of stored data volume
Trend chart of file count

OSS usage analysis

In the OSS Usage Analysis section, you can view charts for:

File size distribution in OSS
Storage volume distribution in OSS
Trend chart of small and large file proportions in OSS

OSS Bucket summary information

The Bucket Details section presents charts for:

Ranking of Buckets by storage volume
Ranking of Buckets by number of files
Ranking of Buckets by number of empty files
Ranking of Buckets by number of very small files
Ranking of Buckets by number of small files

OSS Bucket directory Top information

The presence of many small files in OSS can slow down tasks and consume computing resources. The Bucket Directory Top Information section lists the top-ranked Buckets by storage volume, file count, and small file count, including specific Bucket and directory names, file counts, and day-to-day changes. EMR Doctor allows for drill-down analysis up to four directory levels.

Below is a table defining file sizes.

Parameter	Description
Empty file	Files whose size is 0.
Very small file	Files whose size is greater than 0 and less than 1 MB.
Small file	Files whose size is greater than or equal to 1 MB and less than 128 MB.
Medium file	Files whose size is greater than or equal to 128 MB and less than or equal to 1 GB.
Large file	Files whose size is greater than 1 GB.

The Bucket Directory Top Information section displays:

Top directories at a specific level with the most storage volume
Top directories at a specific level with the largest daily increase in storage volume
Top directories at a specific level with the most files
Top directories at a specific level with the largest daily increase in file count
Top directories at a specific level with the most very small files
Top directories at a specific level with the largest daily increase in very small files
Top directories at a specific level with the most small files
Top directories at a specific level with the largest daily increase in small files