All Products
Search
Document Center

E-MapReduce:View daily cluster reports and analysis

Last Updated:Feb 27, 2025

You can use the daily cluster report feature to understand the current health status of the cluster and make adjustments based on improvement suggestions to maintain a healthy state.

Precautions

By default, the Hadoop cluster health check feature does not include daily cluster report analysis. To view daily cluster report analysis, you must enable EMR Doctor. For more information, see Enable EMR Doctor (Hadoop cluster type).

View report

  1. Access the monitoring diagnostics page.

    1. Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.

    2. In the top menu bar, select the region and resource group based on your actual situation.

    3. On the Cluster Management page, click the Cluster ID of the target cluster.

    4. Click the Monitoring Diagnostics tab at the top.

  2. Click the Daily Cluster Report tab to view all health diagnostic reports for the current cluster.

    The Daily Cluster Reports section's Health Status column displays the health status of the cluster. The health status information is shown in the following table.

    Health Status

    Description

    0 <= x <= 60

    The cluster is in an unhealthy state. Resolve issues in the cluster at the earliest opportunity.

    60 < x <= 80

    The cluster is in a sub-healthy state. We recommend that you optimize the cluster.

    80 < x <= 100

    The cluster is in a healthy state and no issues need to be resolved.

    Note

    The score indicates the health status of the cluster. Valid values range from 0 to 100.

  3. View the report details.

    Click View Report in the Operation column to view detailed inspection information for the current cluster.

    This page displays an overview of the cluster health status and the basic information about the report, such as the health score, current cluster ID, report ID, and diagnostics time. The diagnostic items and an analysis overview of the diagnostic items displayed on this page vary based on the type of the cluster. The overview analysis summarizes the issues in the cluster and concisely points out the problems. You can view the detailed analysis of diagnostic items for specific problem analysis.

Resource analysis

Compute resources

Detailed information

This page provides a detailed analysis of computing resource usage within a cluster, including the computing score, the number of scanned jobs, and the distribution of job health statuses. Additionally, it identifies issues such as low memory usage and offers information on the affected jobs to facilitate issue resolution.

Basic computing information

This section presents trend charts for cluster computing scores, memory usage in GB*Sec, and vCPU usage in VCore*Sec. It also includes the overall health score of computing tasks, the distribution of task scores, and related trend charts.

The table below provides data on cluster memory and vCPUs.

Metric

Description

Cluster Memory (gb*hour)

The total cluster memory consumed by all jobs. The memory consumption of a job is an accumulated value calculated by allocated memory (GB) × runtime (Hours).

Cluster Vcpus (core*hour)

The total cluster vCPUs consumed by all jobs. The vCPU consumption of a job is an accumulated value calculated by allocated CPU cores (Cores) × runtime (Hours).

Compute engine analysis

The following charts are displayed in this section:

  • Trend chart of compute engine scores

  • Trend chart of the number of compute engine jobs

  • Pie chart and trend chart of compute engine memory

  • Pie chart and trend chart of compute engine vCPUs

Compute queue information

This section presents the top 20 charts for compute queue memory usage.

Job information

EMR Doctor analyzes jobs, highlighting key jobs that impact cluster performance. Addressing issues identified in these jobs can enhance cluster computing efficiency and utilization, leading to increased profitability.

Displayed here are the top 50 memory-consuming jobs in GB*Sec and the top 50 jobs by ascending score. The table below details each data record.

Parameter

Description

Task Name

The name of the task.

Engine Type

The engine type of the task.

SQL Statement

This parameter needs to be configured only for SQL-type jobs.

APP IDS

For Hive on MR, a statement may have multiple APP IDs.

Username

The user who submitted the job.

Score

The score of the job.

Health Status

Specifies whether to mark the job for governance.

Suggestion

The optimization suggestion for the job.

Memory (gb*sec)

The total cluster memory consumed by the job.

Memory Usage

The average memory usage of the job.

CPU (vcore*sec)

The total cluster vCPUs consumed by the job.

CPU Usage

The average CPU utilization of the job.

Current Configuration

The current configuration of the job. You can consider how to adjust the current configuration based on the suggestions.

IO Information

The read/write, Shuffle, and other data of the job.

YARN schedule resources

YARN schedule resource analysis provides precise insights into resource usage and job execution patterns across dimensions such as YARN engine, queue, and user.

Detailed analysis

This page offers a comprehensive analysis of YARN schedule resources, detailing memory usage, job counts, and unhealthy nodes. It highlights periods of peak and trough resource usage and job execution, aiding in understanding resource allocation through usage curves.

Basic YARN information

The Basic YARN Information section provides charts on completed, failed, and killed jobs, memory, vCPUs, and additional metrics.

  • Trend chart of YARN memory resource usage

  • Trend chart of YARN CPU resource usage

  • Trend chart of the number of running YARN jobs

YARN engine information

This section displays charts for:

  • Pie chart of engine memory distribution

  • Pie chart of engine job distribution

YARN user information

This section displays charts for:

  • Pie chart of user memory distribution

  • Pie chart of user job distribution

YARN queue information

This section displays charts for:

  • Pie chart of queue memory distribution

  • Pie chart of queue job distribution

YARN unhealthy node information

The YARN Unhealthy Node Information section lists nodes that were unhealthy throughout the day, including times, duration, and health reports from YARN.

HDFS storage resources

To analyze HDFS or Hive storage resources, enable the Storage Resource Information Collection switch in the Daily Cluster Report under Monitoring Diagnostics, or adjust the storage information collection as per configuration instructions.

Detailed analysis

This page offers an in-depth analysis of HDFS storage resources, covering the overall state of cluster resources, such as file counts and data volume. It also identifies issues like a high proportion of small files and cold data, providing directories and resolution methods for each issue.

Basic HDFS information

The Basic HDFS Information section includes charts on data volume trends, file counts, HDFS storage scores, and more.

  • Trend chart of stored data volume

  • Trend chart of file count

  • Trend chart of HDFS storage scores

  • Metrics on file counts, data volume, small files, and cold data volume

HDFS usage analysis

The HDFS Usage Analysis section provides charts on:

  • Storage resource distribution by HDFS user

  • File count distribution by HDFS user

  • Storage resource distribution by HDFS group

  • File count distribution by HDFS group

  • Distribution of HDFS file sizes

  • Distribution of cold and hot data in HDFS

  • Data distribution in top-level HDFS directories

Distribution of files of different sizes stored in HDFS directories

Small files can strain the NameNode and cause shard issues. The number of small files is a crucial metric. This section shows the distribution of file sizes across directory levels, with EMR Doctor enabling drill-down to four levels.

The table below defines file sizes.

Parameter

Description

Empty file

Files whose size is 0.

Very small file

Files whose size is greater than 0 and less than 1 MB.

Small file

Files whose size is greater than or equal to 1 MB and less than 128 MB.

Medium file

Files whose size is greater than or equal to 128 MB and less than or equal to 1 GB.

Large file

Files whose size is greater than 1 GB.

The Directory File Size Distribution section includes:

  • Top directories with the most empty files at a specific level

  • Top directories with the most very small files at a specific level

  • Top directories with the most small files at a specific level

  • Top directories with the most medium files at a specific level

  • Top directories with the most large files at a specific level

Each table provides details on the top directories, including specific paths, data volume, day-to-day comparison, and daily increments.

Distribution of cold data and hot data in directories

Cold data refers to data not accessed for an extended period. It's advisable to store cold data in modes like OSS's Cold Archive storage class. This section shows the distribution of cold, warm, and hot data across directory levels, with EMR Doctor enabling drill-down to four levels.

Parameter

Description

Very cold data

Data that is not accessed for more than three months.

Cold data

Data that is not accessed for more than one month but is accessed within three months.

Warm data

Data that is not accessed for more than seven days but is accessed within one month.

Hot data

Data that is accessed in recent seven days.

The Directory Cold Data and Hot Data Distribution section includes:

  • Top directories with the most very cold data at a specific level

  • Top directories with the most cold data at a specific level

  • Top directories with the most warm data at a specific level

  • Top directories with the most hot data at a specific level

Each table provides details on the top directories, including specific paths, data volume, day-to-day comparison, and daily increments.

HBase storage resources

Detailed analysis

This page provides a detailed analysis of HBase storage resources, including average cluster load, partition balance, and the health of RegionServers and user tables. It also identifies issues such as high load or low balance and offers information on the affected RegionServer, table, or partition, along with resolution methods.

Cluster overview analysis

The Cluster Overview section presents charts detailing cluster health scores, partition balancing degrees, and the distribution of partitions across RegionServers, along with trends in cluster request numbers.

  • Trend chart of cluster health scores

  • Trend chart of cluster partition balancing degrees

  • Pie chart showing the number of partitions per RegionServer

  • Trend chart of the number of cluster requests

  • Metrics on table count, partition count, node count, average load, data volume, read requests, write requests, and total requests

RegionServer-related information

This section provides detailed metrics such as cache hit ratios, average GC durations, and daily read and write request counts for RegionServers.

  • Ranking of RegionServers by cache hit ratio in ascending order

  • Ranking of RegionServers by average GC duration

  • Ranking of RegionServers by number of daily read requests

  • Ranking of RegionServers by day-to-day increment in read requests

  • Ranking of RegionServers by number of daily write requests

  • Ranking of RegionServers by day-to-day increment in write requests

Table-related information

The Table-Related Information section provides details on hot partitions, data volume, partition count, and read/write requests for tables.

  • Details of tables with hot partitions

  • Top tables by partition balancing degree in ascending order

  • Top tables by average data volume in partitions in ascending order

  • Top tables by volume of stored data

  • Top tables by day-to-day data storage increment

  • Top tables by number of partitions

  • Top tables by day-to-day partition increment

  • Top tables by number of read requests

  • Top tables by day-to-day read request increment

  • Top tables by number of write requests

  • Top tables by day-to-day write request increment

Hive storage resources

Detailed analysis

This tab details the usage of Hive storage resources, including the total number of databases and tables, file counts, and data volume. It also identifies issues such as a high proportion of small files, cold data, and uneven storage format distribution, providing databases or tables where issues are found along with resolution methods.

Basic Hive information

This section displays various metrics for Hive storage resource usage, including trends in storage usage, file quantity, and scores.

Hive usage analysis

The Hive Usage Analysis section includes charts on:

  • Storage resource distribution across Hive databases

  • Data volume distribution by Hive user

  • File size distribution in Hive tables

  • Cold and hot data distribution in Hive tables

  • Storage format distribution of Hive tables

Hive details

The Hive Information section provides details on Hive databases and tables.

Hive database information

The Hive Database Information section includes:

  • Hive database details

  • Top Hive databases by file size distribution

  • Top Hive databases by cold and hot data distribution

  • Top Hive databases by storage format distribution

The Hive Database Details section shows data on:

  • Hive databases ranked by storage resource consumption

  • Hive databases ranked by file count

  • Score ranking for Hive databases

  • Hive databases ranked by partition count

The top N Hive databases by file size distribution provide:

  • Top Hive databases with the most empty files

  • Top Hive databases with the most very small files

  • Top Hive databases with the most small files

  • Top Hive databases with the most medium files

  • Top Hive databases with the most large files

Note

Small files in Hive can impact NameNode performance and shard issues, slowing down computation. The number of small files in Hive is a significant metric.

The top N Hive databases by cold and hot data distribution provide:

  • Top Hive databases with the most very cold data

  • Top Hive databases with the most cold data

  • Top Hive databases with the most warm data

  • Top Hive databases with the most hot data

Note

Cold data refers to infrequently accessed data. Storing cold data in cold standby modes like OSS's Cold Archive can help manage cluster usage and reduce costs.

Hive supports various storage formats, each suited to different scenarios. Columnar formats like Parquet and ORC generally reduce storage costs and improve query performance.

The top N Hive databases by storage format distribution provide:

  • Top Hive databases with the most TextFile-formatted data

  • Top Hive databases with the most Parquet-formatted data

  • Top Hive databases with the most ORC-formatted data

Hive table information

The Hive Table Information section includes:

  • Hive table details

  • Top Hive tables by file size distribution

  • Top Hive tables by cold and hot data distribution

  • Top Hive tables by storage format distribution

Note

For more information, see the referenced document.

OSS storage resources

To analyze OSS storage resources, enable the Monitoring Diagnostics > Daily Cluster Report switch for Storage Resource Information Collection and configure OSS storage collection as described in Enable and configure OSS storage analysis.

Detailed analysis

This page offers an in-depth look at OSS (excluding OSS-HDFS) storage resources, detailing the state of OSS Bucket resources, such as file counts and data volumes. It also highlights issues like a high proportion of small files and provides directories and solutions for each identified issue.

Basic OSS information

The Basic OSS Information section displays charts for:

  • Buckets

  • Total storage size

  • Total number of files

  • Number of small files, including empty and very small files

  • Trend chart of stored data volume

  • Trend chart of file count

OSS usage analysis

In the OSS Usage Analysis section, you can view charts for:

  • File size distribution in OSS

  • Storage volume distribution in OSS

  • Trend chart of small and large file proportions in OSS

OSS Bucket summary information

The Bucket Details section presents charts for:

  • Ranking of Buckets by storage volume

  • Ranking of Buckets by number of files

  • Ranking of Buckets by number of empty files

  • Ranking of Buckets by number of very small files

  • Ranking of Buckets by number of small files

OSS Bucket directory Top information

The presence of many small files in OSS can slow down tasks and consume computing resources. The Bucket Directory Top Information section lists the top-ranked Buckets by storage volume, file count, and small file count, including specific Bucket and directory names, file counts, and day-to-day changes. EMR Doctor allows for drill-down analysis up to four directory levels.

Below is a table defining file sizes.

Parameter

Description

Empty file

Files whose size is 0.

Very small file

Files whose size is greater than 0 and less than 1 MB.

Small file

Files whose size is greater than or equal to 1 MB and less than 128 MB.

Medium file

Files whose size is greater than or equal to 128 MB and less than or equal to 1 GB.

Large file

Files whose size is greater than 1 GB.

The Bucket Directory Top Information section displays:

  • Top directories at a specific level with the most storage volume

  • Top directories at a specific level with the largest daily increase in storage volume

  • Top directories at a specific level with the most files

  • Top directories at a specific level with the largest daily increase in file count

  • Top directories at a specific level with the most very small files

  • Top directories at a specific level with the largest daily increase in very small files

  • Top directories at a specific level with the most small files

  • Top directories at a specific level with the largest daily increase in small files