FAQ and solutions for using YARN - E-MapReduce - Alibaba Cloud Documentation Center

This topic answers frequently asked questions about YARN for E-MapReduce (EMR) on ECS.

Cluster operations and configuration

What happens during a stateful cluster restart?

A stateful cluster restart covers both a ResourceManager restart and a NodeManager restart. ResourceManager maintains the basic information and status of applications. NodeManager maintains the information and status of running containers. Both components continuously sync their state to external storage systems such as Apache ZooKeeper, LevelDB, and Hadoop Distributed File System (HDFS).

After a restart, the state is automatically reloaded and recovered. Applications and containers resume without interruption. In most cases, a cluster upgrade or restart is imperceptible to running applications and containers.

How do I enable ResourceManager high availability (HA)?

On the Configure tab of the YARN service page in the EMR console, check or set these parameters:

Parameter	Description	Default value
`yarn.resourcemanager.ha.enabled`	Enable ResourceManager HA. Set to `true`.	`false`
`yarn.resourcemanager.ha.automatic-failover.enabled`	Enable automatic failover.	`true`
`yarn.resourcemanager.ha.automatic-failover.embedded`	Enable embedded automatic failover.	`true`
`yarn.resourcemanager.ha.curator-leader-elector.enabled`	Use Curator for leader election. Set to `true` to use Curator.	`false`
`yarn.resourcemanager.ha.automatic-failover.zk-base-path`	Path where leader information is stored.	`/yarn-leader-electionleader-elector`

How do I check whether the ResourceManager service is normal?

Three methods, in order of increasing thoroughness:

1. Check ResourceManager HA status. In an HA cluster, exactly one ResourceManager process must be Active. Verify that haState is ACTIVE or STANDBY and that haZooKeeperConnectionState is CONNECTED.

Command-line interface (CLI): Run yarn rmadmin -getAllServiceState.
RESTful API: Access http://<rmAddress>/ws/v1/cluster/info.

2. Check YARN application status. Run yarn application -list and look for applications stuck in the SUBMITTED or ACCEPTED state.

3. Submit a test application. Confirm that a new application can run and complete:

hadoop jar <hadoop_home>/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*-tests.jar sleep -m 1 -mt 1000 -r 0

To submit to a specific queue, add -Dmapreduce.job.queuename between sleep and -m. The default queue is default.

What parameters control the maximum resources for a single task or container?

Maximum resources per task or container depend on cluster-level and queue-level scheduler settings:

Parameter	Description	Default value
`yarn.scheduler.maximum-allocation-mb`	Cluster-level maximum memory (MB).	The available memory of the core or task node group with the largest memory size. Equals the `yarn.nodemanager.resource.memory-mb` value for that node group.
`yarn.scheduler.maximum-allocation-vcores`	Cluster-level maximum CPU (vCores).	`32`
`yarn.scheduler.capacity.<queue-path>.maximum-allocation-mb`	Queue-level maximum memory (MB). Overrides the cluster-level setting for this queue.	Not configured by default.
`yarn.scheduler.capacity.<queue-path>.maximum-allocation-vcores`	Queue-level maximum CPU (vCores). Overrides the cluster-level setting for this queue.	Not configured by default.

If a request exceeds the maximum, the exception InvalidResourceRequestException: Invalid resource request... appears in application logs.

What do I do if YARN configuration changes do not take effect?

Different configuration files require different follow-up actions:

Configuration file	Category	Required action
`capacity-scheduler.xml`, `fair-scheduler.xml`	Scheduler configurations	Run the `refresh_queues` operation on ResourceManager. These are hot-update configurations.
`yarn-env.sh`, `yarn-site.xml`, `mapred-env.sh`, `mapred-site.xml`	YARN component configurations	Restart the associated component.

Component restart examples:

Restart ResourceManager after modifying YARN_RESOURCEMANAGER_HEAPSIZE in yarn-env.sh or yarn.resourcemanager.nodes.exclude-path in yarn-site.xml.
Restart NodeManager after modifying YARN_NODEMANAGER_HEAPSIZE in yarn-env.sh or yarn.nodemanager.log-dirs in yarn-site.xml.
Restart MRHistoryServer after modifying MAPRED_HISTORYSERVER_OPTS in mapred-env.sh or mapreduce.jobhistory.http.policys in mapred-site.xml.

Queue and resource management

How do I configure a hot update?

Important

Hot updates require Hadoop 3.2.0 or later.

Step 1: Set key parameters.

On the Configure tab of the YARN service page in the EMR console:

Parameter	Description	Recommended value
`yarn.scheduler.configuration.store.class`	Backing store type. Set to `fs` to use a file system.	`fs`
`yarn.scheduler.configuration.max.version`	Maximum stored configuration files. Excess files are deleted automatically.	`100`
`yarn.scheduler.configuration.fs.path`	Storage path for `capacity-scheduler.xml`. If unset, a path is created automatically. Without a prefix, the relative path of the default file system is used.	`/yarn/<Cluster name>/scheduler/conf`

Important

Replace <Cluster name> with your actual cluster name. Multiple clusters that share a YARN service may use the same distributed storage.

Step 2: View the current scheduler configuration.

RESTful API: Access http://<rm-address>/ws/v1/cluster/scheduler-conf.
HDFS: Browse ${yarn.scheduler.configuration.fs.path}/capacity-scheduler.xml.<timestamp>. The file with the largest timestamp is the latest configuration.

Step 3: Update configurations through the API.

Example: modify yarn.scheduler.capacity.maximum-am-resource-percent and delete yarn.scheduler.capacity.xxx (remove the value field to delete a parameter):

curl -X PUT -H "Content-type: application/json" 'http://<rm-address>/ws/v1/cluster/scheduler-conf' -d '
{
  "global-updates": [
    {
      "entry": [{
        "key":"yarn.scheduler.capacity.maximum-am-resource-percent",
        "value":"0.2"
      },{
        "key":"yarn.scheduler.capacity.xxx"
      }]
    }
  ]
}'

How do I handle uneven resource distribution among applications in a queue?

Important

This requires Hadoop 2.8.0 or later.

Large jobs often consume all resources in a queue, starving smaller jobs. Two changes fix this:

1. Switch the queue ordering policy from FIFO to fair.

Change yarn.scheduler.capacity.<queue-path>.ordering-policy from the default fifo to fair.

First In, First Out (FIFO) scheduler and fair scheduler are two types of schedulers in YARN.

Optionally, set yarn.scheduler.capacity.<queue-path>.ordering-policy.fair.enable-size-based-weight. The default is false, which sorts jobs by resource usage in ascending order. Set to true to sort by the quotient of resource usage divided by resource demand in ascending order.

2. Enable intra-queue resource preemption.

Inter-queue resource preemption is enabled by default and cannot be disabled. To also enable intra-queue preemption, configure these parameters:

Parameter	Description	Recommended value
`yarn.resourcemanager.scheduler.monitor.enable`	Enable preemption. Configure on the yarn-site tab. All other preemption parameters go on the capacity-scheduler tab.	`true`
`yarn.resourcemanager.monitor.capacity.preemption.intra-queue-preemption.enabled`	Enable intra-queue preemption.	`true`
`yarn.resourcemanager.monitor.capacity.preemption.intra-queue-preemption.preemption-order-policy`	Intra-queue preemption policy. Default: `userlimit_first`.	`priority_first`
`yarn.scheduler.capacity.<queue-path>.disable_preemption`	Disable inter-queue preemption for this queue. Default: `false`. Set to `true` to protect the queue from preemption. Child queues inherit this setting.	`true`
`yarn.scheduler.capacity.<queue-path>.intra-queue-preemption.disable_preemption`	Disable intra-queue preemption for this queue. Default: `false`. Set to `true` to disable. Child queues inherit this setting.	`true`

How do I view the resource usage of a queue?

Check the Used Capacity metric on the YARN web UI. Used Capacity is the percentage of resources consumed by a queue relative to all resources allocated to that queue. The percentage of memory resources and the percentage of vCores are calculated separately. The higher value becomes the Used Capacity value.

To view it:

Open the YARN web UI. For details, see Access the web UIs of open source components.
On the All Applications page, click the ID of a specific job.
Click the desired queue in the Queue row.
In the Application Queues section, view the resource usage.

Application status and troubleshooting

How do I obtain the status of an application?

Information	How to access
Basic information (ID, User, Name, Application Type, State, Queue, App-Priority, StartTime, FinishTime, FinalStatus, Running Containers, Allocated CPU VCores, Allocated Memory MB, Diagnostics)	- Applications page: `http://<rmAddress>/cluster/apps` - Application details page: `http://<rmAddress>/cluster/app/<applicationId>` - Application attempt details page: `http://<rmAddress>/cluster/appattempt/<appAttemptId>` - Application RESTful API: `http://<rmAddress>/ws/v1/cluster/apps/<applicationId>` - Application attempt RESTful API: `http://<rmAddress>/ws/v1/cluster/apps/<applicationId>/appattempts`
Queue information	- Schedulers page (expand a leaf node): `http://<rmAddress>/cluster/scheduler` - Scheduler RESTful API: `http://<rmAddress>/ws/v1/cluster/scheduler`
Container logs (running applications)	- NodeManager Log page: `http://<nmHost>:8042/node/containerlogs/<containerId>/<user>` - On-disk subdirectory: find the `<containerId>` subdirectory under `${yarn.nodemanager.local-dirs}`
Container logs (finished applications)	- CLI: `yarn logs -applicationId <applicationId> -appOwner <user> -containerId <containerId>` - HDFS: `hadoop fs -ls /logs/<user>/logs/<applicationId>`

How do I troubleshoot issues for an application?

Follow this progressive workflow:

Step 1: Check the application state.

View the state on the application details page or through the RESTful API (http://<rmAddress>/ws/v1/cluster/apps/<applicationId>). Diagnose based on the state:

Unknown state:

The application either failed before submission to YARN or cannot reach ResourceManager.

Check application submission logs for issues in client components such as BRS and FlowAgent.

If the network connection is abnormal, this error appears on the client:

  com.aliyun.emr.flow.agent.common.exceptions.EmrFlowException: ###[E40001,RESOURCE_MANAGER]: Failed to access to resource manager, cause: The stream is closed

NEW_SAVING:

Application info is being written to the ZooKeeper state store. If stuck here, check whether ZooKeeper is healthy. For ZooKeeper read/write errors, see What do I do if ResourceManager cannot switch from Standby to Active?

SUBMITTED:

Rare under normal conditions. In large clusters, lock contention in Capacity Scheduler can cause this. See YARN-9618 for optimization details.

ACCEPTED -- check the diagnostic message:

Error message	Cause	Solution
`Queue's AM resource limit exceeded`	The sum of used ApplicationMaster (AM) resources and requested AM resources exceeds the queue limit. The constraint is: `${Used Application Master Resources} + ${AM Resource Request} < ${Max Application Master Resources}`.	Increase the AM resource limit. For example, set `yarn.scheduler.capacity.<queue-path>.maximum-am-resource-percent` to `0.5`.
`User's AM resource limit exceeded`	The sum of a specific user's used AM resources and requested AM resources exceeds the per-user queue limit.	Modify `yarn.scheduler.capacity.<queue-path>.user-limit-factor` and `yarn.scheduler.capacity.<queue-path>.minimum-user-limit-percent`.
`AM container is launched, waiting for AM container to Register with RM`	The AM started but initialization is not complete (for example, ZooKeeper connection timed out).	Check the AM logs for the root cause.
`Application is Activated, waiting for resources to be assigned for AM`	Insufficient resources for the AM.	Proceed to Step 3 to analyze resource limits.

RUNNING:

If the application is running but appears stuck, proceed to Step 2 to check whether container resource requests are pending.

FAILED -- check the diagnostic message:

Error message	Cause	Solution
`Maximum system application limit reached, cannot accept submission of application`	Running applications exceed `yarn.scheduler.capacity.maximum-applications` (default: `10000`).	Check Java Management Extensions (JMX) metrics for the number of running applications per queue. Investigate repeatedly submitted applications. Increase the parameter if all applications are legitimate.
`Application XXX submitted by user YYY to unknown queue: ZZZ`	The target queue does not exist.	Submit to an existing leaf queue.
`Application XXX submitted by user YYY to non-leaf queue: ZZZ`	The target queue is a parent queue.	Submit to an existing leaf queue.
`Queue XXX is STOPPED. Cannot accept submission of application: YYY`	The queue is in STOPPED or DRAINING state.	Submit to a queue in the RUNNING state.
`Queue XXX already has YYY applications, cannot accept submission of application: ZZZ`	Applications in the queue reached the upper limit.	1. Check for repeatedly submitted applications. 2. Increase `yarn.scheduler.capacity.<queue-path>.maximum-applications`.
`Queue XXX already has YYY applications from user ZZZ cannot accept submission of application: AAA`	A specific user's applications in the queue reached the upper limit.	1. Check for repeatedly submitted applications. 2. Modify `yarn.scheduler.capacity.<queue-path>.maximum-applications`, `yarn.scheduler.capacity.<queue-path>.minimum-user-limit-percent`, and `yarn.scheduler.capacity.<queue-path>.user-limit-factor`.

Step 2: Verify that YARN resource allocation is pending.

On the applications list page, click the application ID.
On the application details page, click the application attempt ID.
Check the Total Outstanding Resource Requests list for pending resources. You can also query pending resources through the RESTful API for pending requests.
- If no pending resources exist, the YARN resource allocation is complete. Exit this workflow and check the AM resource allocation.
- If pending resources exist, proceed to Step 3.

Step 3: Check resource limits.

Examine cluster and queue resources, specifically the Effective Max Resource and Used Resources values.

Check whether cluster resources, queue resources, or parent queue resources are fully consumed.
Check whether any resource dimension in a leaf queue is close to or at its upper limit.
If the resource usage of a cluster is close to 100%, such as more than 85%, the speed at which resources are allocated to applications may decrease. Two common reasons:
- Most machines have no available resources, causing reservation. When enough machines are reserved, allocation speed drops.
- Memory and CPU resources are imbalanced across machines. Some machines have idle memory but no idle CPU, and vice versa.

Step 4: Check whether allocated containers fail to start.

On the App Attempt page of the YARN web UI, observe the number of allocated containers and changes over a short period. If a container fails to start, troubleshoot from NodeManager logs or the container logs.

Step 5: Enable debug-level logging for deeper investigation.

On the Log Level page of the YARN web UI at http://RM_IP:8088/logLevel, change the log level for org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity to DEBUG.

Important

Enable DEBUG only while reproducing the issue, and revert to INFO after a few tens of seconds. DEBUG logging generates a large volume of logs very quickly.

ResourceManager issues

What do I do if ResourceManager cannot switch from Standby to Active?

First, verify the HA parameters. These three parameters must all be true:

Parameter	Required value
`yarn.resourcemanager.ha.enabled`	`true`
`yarn.resourcemanager.ha.automatic-failover.enabled`	`true`
`yarn.resourcemanager.ha.automatic-failover.embedded`	`true`

If those are correct, investigate ZooKeeper:

Check whether ZooKeeper is healthy.
ZooKeeper client buffer overflow. ResourceManager logs contain Zookeeper error len*** is out of range! or Unreasonable length = ***. Fix: On the Configure tab of the YARN service page in the EMR console, click the yarn-env tab and set yarn_resourcemanager_opts to -Djute.maxbuffer=4194304. Restart ResourceManager.
ZooKeeper server buffer overflow. ZooKeeper logs contain Exception causing close of session 0x1000004d5701b6a: Len error ***. Fix: Add or update the -Djute.maxbuffer= parameter for each ZooKeeper node to increase the buffer limit (unit: bytes).
Ephemeral ZooKeeper node stuck. If neither ResourceManager nor ZooKeeper logs show exceptions, check whether the leader election ephemeral node is occupied by a stale session. Run the stat command in ZooKeeper CLI on the node at ${yarn.resourcemanager.zk-state-store.parent-path}/${yarn.resourcemanager.cluster-id}/ActiveStandbyElectorLock. Fix: Switch to Curator-based leader election. On the yarn-site tab, add or set yarn.resourcemanager.ha.curator-leader-elector.enabled to true. Restart ResourceManager.

What do I do if an out of memory (OOM) issue occurs in ResourceManager?

Identify the OOM type from the ResourceManager logs.

Error: Java heap space, GC overhead limit exceeded, or repeated full garbage collection (GC)

The Java virtual machine (JVM) heap memory is exhausted. ResourceManager keeps many resident objects in memory: clusters, queues, applications, containers, and nodes. The heap memory these objects consume grows with cluster size. Historical application data also accumulates over time. Even a single-node cluster consumes memory for historical data. Minimum recommended heap memory for ResourceManager: 2 GB.

Fixes:

Increase heap memory by modifying YARN_RESOURCEMANAGER_HEAPSIZE in yarn-env.sh.
Reduce stored historical applications by modifying yarn.resourcemanager.max-completed-applications in yarn-site.xml (default: 10000).

Error: unable to create new native thread

The node running ResourceManager has reached the system thread limit. The maximum number of threads depends on the per-user limit and the system process identifier (PID) limit.

Check the limits:

ulimit -a | grep "max user processes"
cat /proc/sys/kernel/pid_max

Find the top 10 processes by thread count:

ps -eLf | awk '{print $2}' | uniq -c | awk '{print $2"\t"$1}' | sort -nrk2 | head

Output shows [Process ID] [Thread count].

Fixes:

If the thread or PID limit is too low, increase it in system settings. Small-spec nodes typically need tens of thousands; large-spec nodes need hundreds of thousands.
If limits are reasonable, investigate the processes that consume the most threads.

What do I do if the RPC version mismatch error appears?

The error message Exception while invoking getClusterNodes...Trying to fail over immediately indicates the active ResourceManager cannot be accessed. ResourceManager logs contain:

WARN org.apache.hadoop.ipc.Server: Incorrect header or version mismatch from 10.33.**.**:53144 got version 6 expected version 9

An early version of Hadoop is in use. The remote procedure call (RPC) version used by the application client is incompatible with the Hadoop version on the cluster.

Fix: Use a Hadoop version compatible with the application client's RPC version.

NodeManager issues

Why does localization fail when a node starts a job, and why are job logs unable to be collected or deleted?

NodeManager logs contain:

java.io.IOException: Couldn't create proxy provider class org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

The HDFS configurations are incorrect. This exception is a wrapper, not the root cause. Enable debug-level logging to find the real issue:

In a Hadoop client CLI, run a command such as hadoop fs -ls /, then enable debugging:
```
  export HADOOP_LOGLEVEL=DEBUG
```

In a runtime environment with Log4j, add this line to the Log4j configuration:

  log4j.logger.org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider=DEBUG

A common root cause: the NameServices configuration was modified (for example, emr-cluster changed to hadoop-emr-cluster), but new nodes from a scale-out still use the original NameServices configuration.

Fix: On the Configure tab of the HDFS service page in the EMR console, verify that the parameters are correctly configured.

How do I handle a resource localization exception?

Symptoms:

The AM container fails to start with this diagnostic:

  Application application_1412960082388_788293 failed 2 times due to AM Container for appattempt_1412960082388_788293_000002 exited with exitCode: -1000 due to: EPERM: Operation not permitted

NodeManager logs show an error when decompressing a downloaded resource package:

  INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs://hadoopnnvip.cm6:9000/user/heyuan.lhy/apv/small_apv_20141128.tar.gz, 1417144849604, ARCHIVE, null },pending,[(container_1412960082388_788293_01_000001)],14170282104675332,DOWNLOADING}
  EPERM: Operation not permitted
          at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method)
          at org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:226)
          at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:629)
          ...

The application resource package contains soft links, which cause the chmod operation to fail.

Fix: Remove the soft links from the application resource package.

What do I do if the error "No space left on device" appears and a container fails to start or run?

Check these causes in order:

Disk space. Verify that the disk has available space.
cgroups cgroup.clone_children setting. Check the value in /sys/fs/cgroup/cpu/hadoop-yarn/ and /sys/fs/cgroup/cpu/.
- If cgroup.clone_children is 0, change it to 1: ``bash echo 1 > /sys/fs/cgroup/cpu/cgroup.clone_children ``
- If the issue persists, check the cpuset.mems or cpuset.cpus file at the same directory level. The value in the hadoop-yarn directory must match the value in the upper-level directory.
cgroups subdirectory limit. Check whether the number of subdirectories under the cgroups directory exceeds the upper limit of 65,535. To check, find the YARN configuration file and look at the yarn.nodemanager.linux-container-executor.cgroups.delete-delay-ms or yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms parameter.

Domain names fail to be resolved in NodeManager or during job execution. What do I do?

The error message:

java.net.UnknownHostException: Invalid host name: local host is: (unknown)

Check these causes in order:

DNS configuration. Verify the Domain Name System (DNS) server is correctly configured:
```
   cat /etc/resolv.conf
```
Firewall rules. Check whether the required rules are configured for Port 53. If the rules are configured, disable the firewall.
NSCD service. Check whether the Name Service Cache Daemon (NSCD) service is running: If NSCD is running, stop it:
```
   systemctl status nscd
```
```
   systemctl stop nscd
```

What do I do if an OOM issue occurs in NodeManager?

Identify the OOM type from the NodeManager logs.

Error: Java heap space, GC overhead limit exceeded, or repeated full GC

NodeManager has few resident objects (the current node, applications, and containers), but the cache and buffer of an external shuffle service can consume significant heap memory. Relevant parameters include:

Spark: spark.shuffle.service.index.cache.size, spark.shuffle.file.buffer
MapReduce: mapreduce.shuffle.ssl.file.buffer.size, mapreduce.shuffle.transfer.buffer.size

The heap memory consumed by the shuffle service is proportional to the number of applications and containers using it. Larger nodes with more tasks require more memory. Minimum recommended heap memory for NodeManager: 1 GB.

Fixes:

Increase heap memory by modifying YARN_NODEMANAGER_HEAPSIZE in yarn-env.sh.
Check whether shuffle service cache settings are reasonable. For example, the Spark shuffle cache should not consume most of the heap.

Error: Direct buffer memory

Off-heap memory overflowed, typically caused by external shuffle services that use NIO DirectByteBuffer for RPCs.

Off-heap memory consumption is proportional to the number of applications and containers using shuffle services. For nodes running many shuffle-intensive tasks, the off-heap memory allocation may be too small.

Fix: Check the XX:MaxDirectMemorySize value in YARN_NODEMANAGER_OPTS in yarn-env.sh. If not configured, the off-heap memory size defaults to the heap memory size. Increase the value if it is too small.

Error: unable to create new native thread

Same root cause and solution as the ResourceManager OOM thread issue. See What do I do if an OOM issue occurs in ResourceManager?

After an ECS instance restarts, NodeManager fails to start because the cgroups directory is missing. What do I do?

Error message:

ResourceHandlerException: Unexpected: Cannot create yarn cgroup Subsystem:cpu Mount point:/proc/mounts User:hadoop Path:/sys/fs/cgroup/cpu/hadoop-yarn

The ECS instance restarted abnormally, likely due to a kernel defect. This is a known issue in kernel version 4.19.91-21.2.al7.x86_64. After restart, the cgroups data in memory is deleted, invalidating the cgroups.

Fix: Add a bootstrap script to your node groups that creates the cgroups directory at startup and persists it through rc.local:

# enable cgroups
mkdir -p /sys/fs/cgroup/cpu/hadoop-yarn
chown -R hadoop:hadoop /sys/fs/cgroup/cpu/hadoop-yarn
# enable cgroups after reboot
echo "mkdir -p /sys/fs/cgroup/cpu/hadoop-yarn" >> /etc/rc.d/rc.local
echo "chown -R hadoop:hadoop /sys/fs/cgroup/cpu/hadoop-yarn" >> /etc/rc.d/rc.local
chmod +x /etc/rc.d/rc.local

What do I do if NodeManager resource configurations do not take effect after saving and restarting?

The yarn.nodemanager.resource.cpu-vcores and yarn.nodemanager.resource.memory-mb parameters were modified and saved, but the changes do not take effect after restarting NodeManager.

CPU cores and memory size can vary across instances in a node group. These parameters must be set at the node group level, not the cluster level.

Fix: In the EMR console, go to the Configure tab of the YARN service page. Select Node Group Configuration from the drop-down list next to the search box. Select the node group that NodeManager belongs to. Then modify yarn.nodemanager.resource.cpu-vcores and yarn.nodemanager.resource.memory-mb. For details, see Manage configuration items.

What do I do if a node is marked as unhealthy?

Two possible causes:

1. Disk health check failure. If the ratio of healthy directories to total directories on a node falls below yarn.nodemanager.disk-health-checker.min-healthy-disks (default: 0.25), the node is marked as unhealthy. For a node with four disks, all four disks must have abnormal directories before the node is marked unhealthy. Otherwise, the report only shows "local-dirs are bad" or "log-dirs are bad". See What do I do if "local-dirs are bad" or "log-dirs are bad" appears?

2. NodeManager health check script. By default, the script is disabled. Enable it by configuring yarn.nodemanager.health-checker.script.path in yarn-site.xml. If this script detects a problem, resolve the issue based on the custom script logic.

What do I do if "local-dirs are bad" or "log-dirs are bad" appears?

The disk health check (enabled by default) periodically checks whether local-dirs (task cache directory for files and intermediate data) and log-dirs (task log directory) meet these conditions:

Readable
Writable
Executable
Disk usage is below yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage (default: 90%)
Remaining available disk space exceeds yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb (default: 0)

If any condition is not met, the directory is marked as bad.

Fixes:

Insufficient disk space (most common). Scale out disks if any of these apply: the node has high specs with many tasks, disk space is small, task data or intermediate data is large, or task logs are large.
Cache too large. Check yarn.nodemanager.localizer.cache.target-size-mb in yarn-site.xml. If the cache-to-disk ratio is too high, cached task data occupies most of the disk because the cache is only cleaned when the threshold is exceeded.
Damaged disk. See Replace a damaged local disk for an EMR cluster.

Component log management

Why do a large number of .out log files accumulate for YARN service components?

Certain Hadoop dependency libraries call Java logging APIs instead of Log4j, so they do not support log rotation. The standard error (stderr) output of daemon components is redirected to .out log files. Without an automatic cleanup mechanism, these logs accumulate and can fill the data disk.

Fix: Use the head and tail commands with timestamp information in the logs to identify which Java logging API logs are consuming the most space. In most cases, these are INFO-level logs from dependency libraries that do not affect component function. Disable the generation of INFO-level logs for the offending packages.

Example: Disable Jersey log generation for Timeline Server.

Monitor .out log files for timelineserver- in the YARN log directory. The output contains records generated by the com.sun.jersey package.
- DataLake cluster path: /var/log/emr/yarn/
- Hadoop cluster path: /mnt/disk1/log/hadoop-yarn
```
   tail /var/log/emr/yarn/*-hadoop-timelineserver-*.out
```

On the Timeline Server node, create a configuration file as root to disable Jersey logging:

   sudo su root -c "echo 'com.sun.jersey.level = OFF' > $HADOOP_CONF_DIR/off-logging.properties"

On the Configure tab of the YARN service page in the EMR console, find YARN_TIMELINESERVER_OPTS (or yarn_timelineserver_opts in a Hadoop cluster). Add this to the value:
```
   -Djava.util.logging.config.file=off-logging.properties
```
Save the configuration and restart Timeline Server. If Timeline Server starts normally and .out log files no longer contain com.sun.jersey entries, the fix is working.

Web UI and RESTful API issues

What do I do if the error "User [dr.who] is not authorized to view the logs for application" appears?

Access control list (ACL) rules are checked when accessing the NodeManager Log page. If ACL rules are enabled, a remote user must meet one of these conditions to view application logs:

The remote user is the admin user.
The remote user is the owner of the application.
The remote user meets the ACL rules customized for the application.

Fix: Verify that the remote user meets one of these conditions.

What do I do if "HTTP ERROR 401 Authentication required" or "HTTP ERROR 403 Unauthenticated users are not authorized to access this page" appears?

YARN uses simple authentication and does not allow anonymous access by default. See Authentication for Hadoop HTTP web-consoles.

Fixes:

Method 1: Append a user.name=*** URL parameter to specify the remote user.
Method 2: In the Configuration Filter section on the Configure tab of the HDFS service page in the EMR console, search for hadoop.http.authentication.simple.anonymous.allowed and set it to true to allow anonymous access. Then restart the HDFS service. For details, see Restart a service.

Why is the display value of TotalVcore incorrect?

In the cluster or metrics RESTful API section of the YARN web UI (upper-right corner), the TotalVcore value may be wrong. This is a computing logic bug in Apache Hadoop versions earlier than 2.9.2. See YARN-8443.

The issue is fixed in EMR V3.37.x, EMR V5.3.x, and their later minor versions.

What do I do if application information on the Tez web UI is incomplete?

Open the Developer tools of your browser and check which requests are failing:

Request to http://<rmAddress>/ws/v1/cluster/apps/APPID fails. ResourceManager has cleared the application data. By default, ResourceManager retains information for a maximum of 1,000 applications. Applications beyond this limit are cleared in the order they started.
Request to http://<tsAddress>/ws/v1/applicationhistory/... returns error code 500 (application not found). The application information either failed to be stored or was cleared by the Timeline store.
- Check yarn.resourcemanager.system-metrics-publisher.enabled to determine whether storing failed.
- Check the time to live (TTL) of LevelDB to determine whether the data was cleared.

Request to http://<tsAddress>/ws/v1/timeline/... returns code 200 but shows NotFound. Check the AM syslog initialization output. Normal initialization looks like: If the following warning appears, yarn.timeline-service.enabled is set to false for the running AM. A possible cause is a FlowAgent issue. A Hive job implemented through FlowAgent (using a Hive command or a Beeline command) has yarn.timeline-service.enabled set to false by default in FlowAgent.

   [INFO] [main] |history.HistoryEventHandler|: Initializing HistoryEventHandler withrecoveryEnabled=true, historyServiceClassName=org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService
   [INFO] [main] |ats.ATSHistoryLoggingService|: Initializing ATSHistoryLoggingService with maxEventsPerBatch=5, maxPollingTime(ms)=10, waitTimeForShutdown(ms)=-1, TimelineACLManagerClass=org.apache.tez.dag.history.ats.acls.ATSHistoryACLPolicyManager

   [WARN] [main] |ats.ATSHistoryLoggingService|: org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService is disabled due to Timeline Service being disabled, yarn.timeline-service.enabled set to false

Why is a Spark Thrift JDBC/ODBC Server task running on the YARN web UI?

If you selected Spark when creating the cluster, a default Spark Thrift Server service starts automatically. This service occupies one YARN driver resource. By default, tasks running in Spark Thrift Server request resources from YARN through this driver.

Timeline Server issues

Can the `yarn.timeline-service.leveldb-timeline-store.path` parameter be set to an OSS bucket URL?

No. The yarn.timeline-service.leveldb-timeline-store.path parameter cannot be set to an Object Storage Service (OSS) bucket URL.

In a Hadoop cluster, the default value of yarn.timeline-service.leveldb-timeline-store.path equals the value of hadoop.tmp.dir. Do not modify hadoop.tmp.dir, because changes affect yarn.timeline-service.leveldb-timeline-store.path.

What do I do if the connection to Timeline Server times out or the CPU load or memory usage is extremely high?

With a large number of Apache Tez jobs, writing data to Timeline Server can time out because the Timeline Server process consumes excessive CPU resources and the CPU load on its node reaches the upper limit. This can affect job development and non-core services such as report generation.

As an immediate measure, stop the Timeline Server process to reduce the node's CPU load. Then apply these configuration changes:

Step 1: Set the Tez event flush timeout.

On the tez-site.xml tab on the Configure tab of the Tez service page in the EMR console, add a configuration item:

Name: tez.yarn.ats.event.flush.timeout.millis
Value: 60000

This sets the timeout for a Tez job writing events to Timeline Server.

Step 2: Optimize Timeline Server storage.

On the yarn-site.xml tab on the Configure tab of the YARN service page in the EMR console, add or modify these configuration items:

Parameter	Value	Description
`yarn.timeline-service.store-class`	`org.apache.hadoop.yarn.server.timeline.RollingLevelDBTimelineStore`	Event storage class for Timeline Server.
`yarn.timeline-service.rolling-period`	`daily`	Event rolling period.
`yarn.timeline-service.leveldb-timeline-store.read-cache-size`	`4194304`	Read cache size for the LevelDB store.
`yarn.timeline-service.leveldb-timeline-store.write-buffer-size`	`4194304`	Write buffer size for the LevelDB store.
`yarn.timeline-service.leveldb-timeline-store.max-open-files`	`500`	Maximum open files in the LevelDB store.

After making these changes, restart Timeline Server on the Status tab of the YARN service page. For details on managing configurations, see Manage configuration items.

Cluster operations and configuration

What happens during a stateful cluster restart?

How do I enable ResourceManager high availability (HA)?

How do I check whether the ResourceManager service is normal?

What parameters control the maximum resources for a single task or container?

What do I do if YARN configuration changes do not take effect?

Queue and resource management

How do I configure a hot update?

How do I handle uneven resource distribution among applications in a queue?

How do I view the resource usage of a queue?

Application status and troubleshooting

How do I obtain the status of an application?

How do I troubleshoot issues for an application?

ResourceManager issues

What do I do if ResourceManager cannot switch from Standby to Active?

What do I do if an out of memory (OOM) issue occurs in ResourceManager?

What do I do if the RPC version mismatch error appears?

NodeManager issues

Why does localization fail when a node starts a job, and why are job logs unable to be collected or deleted?

How do I handle a resource localization exception?

What do I do if the error "No space left on device" appears and a container fails to start or run?

Domain names fail to be resolved in NodeManager or during job execution. What do I do?

What do I do if an OOM issue occurs in NodeManager?

After an ECS instance restarts, NodeManager fails to start because the cgroups directory is missing. What do I do?

What do I do if NodeManager resource configurations do not take effect after saving and restarting?

What do I do if a node is marked as unhealthy?

What do I do if "local-dirs are bad" or "log-dirs are bad" appears?

Component log management

Why do a large number of .out log files accumulate for YARN service components?

Web UI and RESTful API issues

What do I do if the error "User [dr.who] is not authorized to view the logs for application" appears?

What do I do if "HTTP ERROR 401 Authentication required" or "HTTP ERROR 403 Unauthenticated users are not authorized to access this page" appears?

Why is the display value of TotalVcore incorrect?

What do I do if application information on the Tez web UI is incomplete?

Why is a Spark Thrift JDBC/ODBC Server task running on the YARN web UI?

Timeline Server issues

Can the yarn.timeline-service.leveldb-timeline-store.path parameter be set to an OSS bucket URL?

What do I do if the connection to Timeline Server times out or the CPU load or memory usage is extremely high?

Can the `yarn.timeline-service.leveldb-timeline-store.path` parameter be set to an OSS bucket URL?