This topic describes how to troubleshoot issues related to Hive jobs.
Troubleshooting
If a performance exception occurs when running a job on the Hive client, follow these steps to locate the exception:
-
Check the Hive client logs.
-
Client logs for jobs submitted via Hive CLI are located on the cluster or Gateway node at /tmp/hive/$USER/hive.log or /tmp/$USER/hive.log.
-
Logs for jobs submitted through Hive Beeline or JDBC can be found in the HiveServer service logs, typically in the /var/log/emr/hive or /mnt/disk1/log/hive directory.
-
Check the YARN Application logs for Hive jobs. Retrieve the logs using the yarn command.
yarn logs -applicationId application_xxx_xxx -appOwner userName
Memory-related issues
An out-of-memory (OOM) error occurs due to insufficient container memory
Error message: java.lang.OutOfMemoryError: GC overhead limit exceeded
or java.lang.OutOfMemoryError: Java heap space
.
Solution: Increase the memory of the container. For Hive jobs running on MapReduce (MR), also increase the Java virtual machine (JVM) heap memory.
-
For Hive on MR: In the YARN service configuration page, click the mapred-site.xml tab to increase the memory for mappers and reducers.
mapreduce.map.memory.mb=4096
mapreduce.reduce.memory.mb=4096
Simultaneously, adjust the JVM parameters mapreduce.map.java.opts and mapreduce.reduce.java.opts with -Xmx
in mapreduce.map.memory.mb and mapreduce.reduce.memory.mb to 80% of the memory settings.
mapreduce.map.java.opts=-Xmx3276m (other parameters remain unchanged)
mapreduce.reduce.java.opts=-Xmx3276m (other parameters remain unchanged)
-
For Hive on Tez:
-
To increase Tez container memory, on the Hive service configuration page, click the hive-site.xml tab.
hive.tez.container.size=4096
-
To increase Tez am memory, on the Tez service configuration page, click the tez-site.xml tab.
tez.am.resource.memory.mb=4096
-
For Hive on Spark: Increase the Spark Executor memory in spark-defaults.conf
of the Spark service.
The container is killed by YARN due to excessive memory usage
Error message: Container killed by YARN for exceeding memory limits
.
Cause: The memory used by a Hive task exceeds the requested amount from YARN, including JVM heap memory, JVM off-heap memory, and memory used by child processes. For instance, if the heap size of the Map Task JVM process of a Hive job running on MR is 4 GB (mapreduce.map.java.opts=-Xmx4g), and the memory requested from YARN is 3 GB (mapreduce.map.memory.mb=3072), YARN NodeManager will kill the container.
Solution:
-
For Hive on MR jobs, you can increase the values of mapreduce.map.memory.mb and mapreduce.reduce.memory.mb parameters. Make sure they exceed 1.25 times the values specified by the JVM parameters mapreduce.map.java.opts and mapreduce.reduce.java.opts for the -Xmx
setting.
-
For Hive jobs on Spark, increase the spark.executor.memoryOverhead parameter value to at least 25% of the spark.executor.memory parameter value.
An OOM error occurs because the sort buffer size is excessively large
-
Error message:
Error running child: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:986)
-
Cause: The sort buffer size exceeds the memory allocated by a Hive task to the container. For example, the container memory size is 1300 MB, but the sort buffer size is 1024 MB.
-
Solution: Increase the container memory size or decrease the sort buffer size.
tez.runtime.io.sort.mb (Hive on Tez)
mapreduce.task.io.sort.mb (Hive on MR)
An OOM error occurs due to GroupBy statements
An OOM error occurs when Snappy files are read
-
Cause: The format of standard Snappy files written by services like Log Service differs from Hadoop Snappy files. By default, EMR processes Hadoop Snappy files, leading to an OOM error when processing standard Snappy files.
-
Solution: Configure the following parameter for the Hive job.
set io.compression.codec.snappy.native=true;
Metadata-related errors
The operation for dropping a large partitioned table timed out
Jobs fail due to dynamic partitions in INSERT OVERWRITE
-
Error message: When you perform an insert overwrite
operation on dynamic partitions or execute a job that involves an operation similar to insert overwrite
, the error Exception when loading xxx in table
is reported, and the following error message appears in the HiveServer logs.
Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Directory oss://xxxx could not be cleaned up.;
-
Cause: Metadata is inconsistent with the data. Metadata contains information about a partition, but the partition directory cannot be found in the data storage system, leading to an error during the cleanup operation.
-
Solution: Troubleshoot the metadata issue and rerun the job.
The error "java.lang.IllegalArgumentException: java.net.UnknownHostException: emr-header-1.xxx" occurs when a Hive job reads or deletes a table
-
Cause: When the EMR cluster uses DLF unified metadata or a unified meta database (an old feature), the initial path of the created database is the HDFS path of the current EMR cluster (for example, hdfs://master-1-1.xxx:9000/user/hive/warehouse/test.db
or hdfs://emr-header-1.cluster-xxx:9000/user/hive/warehouse/test.db
). The path of a Hive table inherits the path of the database and also uses the HDFS path of the current cluster (for example, hdfs://master-1-1.xxx:9000/user/hive/warehouse/test.db/test_tbl
). If you use Hive in a cluster in the new EMR console to read data from or write data to a Hive table or database that is created by a cluster in the old EMR console, the new cluster may fail to connect to the old cluster. In addition, if the old cluster is released, the error "java.net.UnknownHostException" is returned.
-
Solution:
-
Solution 1: If the data in the Hive table is temporary or test data, you can try to modify the path of the Hive table to an OSS path and then call the drop table or drop database command again.
-- Hive SQL
alter table test_tbl set location 'oss://bucket/not/exists'
drop table test_tbl;
alter table test_pt_tbl partition (pt=xxx) set location 'oss://bucket/not/exists';
alter table test_pt_tbl drop partition pt=xxx);
alter database test_db set location 'oss://bucket/not/exists'
drop datatabase test_db
-
Solution 2: If data in the Hive table of the old EMR cluster is valid but inaccessible from the new cluster, and the data is stored in HDFS, migrate the data to OSS and create a new table.
hadoop fs -cp hdfs://emr-header-1.xxx/old/path oss://bucket/new/path
hive -e "create table new_tbl like old_tbl location 'oss://bucket/new/path'"
Issues related to Hive UDFs and third-party packages
A conflict occurs due to third-party packages that are placed in the Hive lib directory
-
Cause: Placing third-party packages or replacing Hive packages in the Hive lib directory ($HIVE_HOME/lib) often causes conflicts. Avoid such operations.
-
Solution: Remove the third-party packages from $HIVE_HOME/lib and restore the original Hive JAR packages.
Hive fails to use the reflect function
Jobs run slowly due to custom UDFs
-
Cause: If a job runs slowly but no error logs are returned, the issue may be due to low performance of custom Hive UDFs.
-
Solution: Identify the performance bottleneck using the thread dump of a Hive task and optimize the custom Hive UDFs.
Issues related to the grouping() function
-
Symptom: The following error message is returned when using the grouping()
function.
grouping() requires at least 2 argument, got 1
This error indicates an exception occurs during parameter parsing for the grouping()
function.
-
Cause: The issue stems from a recognized bug in the open-source version of Hive, which is case-sensitive when parsing the grouping()
function. Using the lowercase grouping()
may lead Hive to misidentify the function, resulting in an error during parameter parsing.
-
Solution: Change the grouping()
function in the SQL statement to uppercase GROUPING()
to resolve the issue.
Issues related to engine compatibility
The execution result differs because the time zone of Hive is different from that of Spark
-
Symptom: Hive uses UTC, while Spark uses the local time zone, resulting in different execution results.
-
Solution: Change Spark's time zone to UTC. Insert the following code in Spark SQL:
set spark.sql.session.timeZone=UTC;
Alternatively, modify the Spark configuration file to include the following:
spark.sql.session.timeZone=UTC
Issues related to Hive versions
Hive jobs run slowly on Spark because dynamic resource allocation is enabled (known defect)
-
Cause: The open source Hive has defects, and the spark.dynamicAllocation.enabled parameter is set to true when connecting to Spark using Beeline, resulting in the shuffle partition count being 1.
-
Solution: Disable dynamic resource allocation for Hive jobs running on Spark or run Hive jobs on Tez.
spark.dynamicAllocation.enabled=false
Tez throws an exception when the hive.optimize.dynamic.partition.hashjoin parameter is set to true (known defect)
MapJoinOperator throws the NullPointerException exception (known defect)
Hive throws the IllegalStateException exception when a Hive job runs on Tez (known defect)
-
Error message:
java.lang.RuntimeException: java.lang.IllegalStateException: Was expecting dummy store operator but found: FS[17]
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
-
Cause: The open source Hive has defects. This issue occurs when the tez.am.container.reuse.enabled parameter is set to true.
-
Solution: Set the tez.am.container.reuse.enabled parameter to false for Hive jobs.
set tez.am.container.reuse.enabled=false;
Other issues
The result of SELECT COUNT(1)
is 0
-
Cause: The select count(1)
statement uses the table's statistics information, which may be inaccurate.
-
Solution: Set the hive.compute.query.using.stats parameter to false.
hive.compute.query.using.stats=false
Alternatively, recalculate the table's statistics using the analyze command.
analyze table <table_name> compute statistics;
An error occurs when a Hive job is submitted on a self-managed Elastic Compute Service (ECS) instance
If you submit a Hive job on a self-managed ECS instance, an error occurs. Submit Hive jobs on an EMR gateway cluster or by using EMR CLI. For more information, see Use EMR CLI to deploy a gateway environment.
An exception occurs on a job due to data skew
-
Symptom:
-
Shuffle data exhausts the disk space.
-
Some tasks take an excessively long time to complete.
-
OOM errors occur in some tasks or containers.
-
Solution:
-
Activate skew join optimization in Hive.
set hive.optimize.skewjoin=true;
-
Boost the concurrency of mappers and reducers.
-
Augment the container memory. For details, see An out-of-memory (OOM) error occurs due to insufficient container memory.
The error "Too many counters: 121 max=120" occurs
-
Symptom: An error is reported when running a job on the Tez or MR engine using Hive SQL.
-
Cause: The number of counters in the job exceeds the default maximum limit.
-
Solution: Navigate to the Configuration tab of the YARN Service in the EMR console, search for the mapreduce.job.counters.max parameter, and increase its value. After adjusting, resubmit the Hive job. If you are using Beeline or JDBC to submit the job, restart the HiveServer service.