Store the logs of YARN, MapReduce, and Spark jobs - E-MapReduce

By default, EMR Hadoop clusters store job logs in the Hadoop Distributed File System (HDFS). When a pay-as-you-go cluster is released, HDFS is destroyed along with it — and all job logs are permanently lost. To preserve logs for troubleshooting after a cluster is released, redirect YARN container logs, MapReduce job history, and Spark event logs to JindoFS (JindoFileSystem) or Object Storage Service (OSS) at cluster creation time.

EMR clusters support the pay-as-you-go and subscription billing methods to meet different needs.

Prerequisites

Before you begin, ensure that you have:

An OSS bucket to store logs
(For JindoFS) A JindoFS namespace backed by OSS

How log storage works

YARN aggregates all container logs for a job into a single file per node and writes them to the configured remote log directory when the application finishes. The yarn.nodemanager.remote-app-log-dir parameter controls this destination.

MapReduce uses the Job History Server to archive completed job metadata. The server writes completed job logs to mapreduce.jobhistory.done-dir and buffers in-progress records to mapreduce.jobhistory.intermediate-done-dir.

Spark uses the Spark History Server to replay job execution. It reads event logs from the path set in spark_eventlog_dir.

Configuration reference

The following tables list the parameters for each storage backend. Apply all parameters as custom software configurations at cluster creation.

JindoFS

Set these parameters in the bigboot configuration file:

Parameter	Description	Example value
`jfs.namespaces`	Namespaces supported by JindoFS. Separate multiple namespaces with commas.	`emr-jfs`
`jfs.namespaces.emr-jfs.oss.uri`	OSS storage backend for the `emr-jfs` namespace.	`oss://oss-bucket/oss-dir`
`jfs.namespaces.emr-jfs.mode`	Storage mode for the `emr-jfs` namespace. JindoFS supports block mode and cache mode.	`block`

YARN container logs and MapReduce job history

Configuration file	Parameter	Description	Example value
`yarn-site`	`yarn.nodemanager.remote-app-log-dir`	Remote directory where YARN aggregates and stores container logs after an application finishes. The log aggregation feature of YARN is enabled by default.	`jfs://emr-jfs/emr-cluster-log/yarn-apps-logs` or `oss://${oss-bucket}/emr-cluster-log/yarn-apps-logs`
`mapred-site`	`mapreduce.jobhistory.done-dir`	Directory where the Job History Server stores logs of completed Hadoop jobs.	`jfs://emr-jfs/emr-cluster-log/jobhistory/done` or `oss://${oss-bucket}/emr-cluster-log/jobhistory/done`
`mapred-site`	`mapreduce.jobhistory.intermediate-done-dir`	Directory where the Job History Server buffers logs of Hadoop jobs not yet archived.	`jfs://emr-jfs/emr-cluster-log/jobhistory/done_intermediate` or `oss://${oss-bucket}/emr-cluster-log/jobhistory/done_intermediate`

Spark History Server

Configuration file	Parameter	Description	Example value
`spark-defaults`	`spark_eventlog_dir`	Directory where the Spark History Server stores the logs of Spark jobs.	`jfs://emr-jfs/emr-cluster-log/spark-history` or `oss://${oss-bucket}/emr-cluster-log/spark-history`

Apply the configuration at cluster creation

Pass the configuration as custom software configurations when creating an EMR cluster. The console provides a Custom Software Configuration field for this JSON input, as shown in the following figure.

Store logs in JindoFS

Replace oss-bucket and the directory paths with your actual OSS bucket and preferred paths:

[
  {
    "ServiceName": "BIGBOOT",
    "FileName": "bigboot",
    "ConfigKey": "jfs.namespaces",
    "ConfigValue": "emr-jfs"
  },
  {
    "ServiceName": "BIGBOOT",
    "FileName": "bigboot",
    "ConfigKey": "jfs.namespaces.emr-jfs.oss.uri",
    "ConfigValue": "oss://oss-bucket/jindoFS"
  },
  {
    "ServiceName": "BIGBOOT",
    "FileName": "bigboot",
    "ConfigKey": "jfs.namespaces.emr-jfs.mode",
    "ConfigValue": "block"
  },
  {
    "ServiceName": "YARN",
    "FileName": "mapred-site",
    "ConfigKey": "mapreduce.jobhistory.done-dir",
    "ConfigValue": "jfs://emr-jfs/emr-cluster-log/jobhistory/done"
  },
  {
    "ServiceName": "YARN",
    "FileName": "mapred-site",
    "ConfigKey": "mapreduce.jobhistory.intermediate-done-dir",
    "ConfigValue": "jfs://emr-jfs/emr-cluster-log/jobhistory/done_intermediate"
  },
  {
    "ServiceName": "YARN",
    "FileName": "yarn-site",
    "ConfigKey": "yarn.nodemanager.remote-app-log-dir",
    "ConfigValue": "jfs://emr-jfs/emr-cluster-log/yarn-apps-logs"
  },
  {
    "ServiceName": "SPARK",
    "FileName": "spark-defaults",
    "ConfigKey": "spark_eventlog_dir",
    "ConfigValue": "jfs://emr-jfs/emr-cluster-log/spark-history"
  }
]

Store logs in OSS

Replace oss_bucket and the directory paths with your actual OSS bucket and preferred paths:

[
  {
    "ServiceName": "YARN",
    "FileName": "mapred-site",
    "ConfigKey": "mapreduce.jobhistory.done-dir",
    "ConfigValue": "oss://oss_bucket/emr-cluster-log/jobhistory/done"
  },
  {
    "ServiceName": "YARN",
    "FileName": "mapred-site",
    "ConfigKey": "mapreduce.jobhistory.intermediate-done-dir",
    "ConfigValue": "oss://oss_bucket/emr-cluster-log/jobhistory/done_intermediate"
  },
  {
    "ServiceName": "YARN",
    "FileName": "yarn-site",
    "ConfigKey": "yarn.nodemanager.remote-app-log-dir",
    "ConfigValue": "oss://oss_bucket/emr-cluster-log/yarn-apps-logs"
  },
  {
    "ServiceName": "SPARK",
    "FileName": "spark-defaults",
    "ConfigKey": "spark_eventlog_dir",
    "ConfigValue": "oss://oss_bucket/emr-cluster-log/spark-history"
  }
]

Where to find logs after the cluster is released

After the cluster is released, access logs directly from OSS at the paths you configured:

Log type	OSS path
YARN container logs	`oss://<oss-bucket>/emr-cluster-log/yarn-apps-logs/`
MapReduce completed job logs	`oss://<oss-bucket>/emr-cluster-log/jobhistory/done/`
MapReduce in-progress job logs	`oss://<oss-bucket>/emr-cluster-log/jobhistory/done_intermediate/`
Spark event logs	`oss://<oss-bucket>/emr-cluster-log/spark-history/`