Troubleshoot issues related to Hive jobs

Troubleshooting

If a performance exception occurs when running a job on the Hive client, follow these steps to locate the exception:

Check the Hive client logs.
- Client logs for jobs submitted via Hive CLI are located on the cluster or Gateway node at /tmp/hive/$USER/hive.log or /tmp/$USER/hive.log.
- Logs for jobs submitted through Hive Beeline or JDBC can be found in the HiveServer service logs, typically in the /var/log/emr/hive or /mnt/disk1/log/hive directory.
Check the YARN Application logs for Hive jobs. Retrieve the logs using the yarn command.
```
yarn logs -applicationId application_xxx_xxx -appOwner userName
```

Memory-related issues

An out-of-memory (OOM) error occurs due to insufficient container memory

Error message: java.lang.OutOfMemoryError: GC overhead limit exceeded or java.lang.OutOfMemoryError: Java heap space.

Solution: Increase the memory of the container. For Hive jobs running on MapReduce (MR), also increase the Java virtual machine (JVM) heap memory.

For Hive on MR: In the YARN service configuration page, click the mapred-site.xml tab to increase the memory for mappers and reducers.
```
mapreduce.map.memory.mb=4096
mapreduce.reduce.memory.mb=4096
```
Simultaneously, adjust the JVM parameters mapreduce.map.java.opts and mapreduce.reduce.java.opts with -Xmx in mapreduce.map.memory.mb and mapreduce.reduce.memory.mb to 80% of the memory settings.
```
mapreduce.map.java.opts=-Xmx3276m (other parameters remain unchanged)
mapreduce.reduce.java.opts=-Xmx3276m (other parameters remain unchanged)
```
For Hive on Tez:
- To increase Tez container memory, on the Hive service configuration page, click the hive-site.xml tab.
```
hive.tez.container.size=4096
```
- To increase Tez am memory, on the Tez service configuration page, click the tez-site.xml tab.
```
tez.am.resource.memory.mb=4096
```
For Hive on Spark: Increase the Spark Executor memory in spark-defaults.conf of the Spark service.
```
spark.executor.memory=4g
```

The container is killed by YARN due to excessive memory usage

Error message: Container killed by YARN for exceeding memory limits.

Cause: The memory used by a Hive task exceeds the requested amount from YARN, including JVM heap memory, JVM off-heap memory, and memory used by child processes. For instance, if the heap size of the Map Task JVM process of a Hive job running on MR is 4 GB (mapreduce.map.java.opts=-Xmx4g), and the memory requested from YARN is 3 GB (mapreduce.map.memory.mb=3072), YARN NodeManager will kill the container.

Solution:

For Hive on MR jobs, you can increase the values of mapreduce.map.memory.mb and mapreduce.reduce.memory.mb parameters. Make sure they exceed 1.25 times the values specified by the JVM parameters mapreduce.map.java.opts and mapreduce.reduce.java.opts for the -Xmx setting.
For Hive jobs on Spark, increase the spark.executor.memoryOverhead parameter value to at least 25% of the spark.executor.memory parameter value.

An OOM error occurs because the sort buffer size is excessively large

Error message:

Error running child: java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:986)

Cause: The sort buffer size exceeds the memory allocated by a Hive task to the container. For example, the container memory size is 1300 MB, but the sort buffer size is 1024 MB.
Solution: Increase the container memory size or decrease the sort buffer size.
```
tez.runtime.io.sort.mb (Hive on Tez)
mapreduce.task.io.sort.mb (Hive on MR)
```

An OOM error occurs due to GroupBy statements

Error message:

22/11/28 08:24:43 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 0)
java.lang.OutOfMemoryError: GC overhead limit exceeded
    at org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:611)
    at org.apache.hadoop.hive.ql.exec.GroupByOperator.processHashAggr(GroupByOperator.java:813)
    at org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:719)
    at org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:787)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
    at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
    at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130)
    at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:148)
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:547)

Cause: The hash tables generated by GroupBy statements occupy excessive memory.
Solution:
1. Decrease the split size to 128 MB, 64 MB, or smaller, and increase the job concurrency: mapreduce.input.fileinputformat.split.maxsize=134217728 or mapreduce.input.fileinputformat.split.maxsize=67108864.
2. Increase the concurrency of mappers and reducers.
3. Increase the container memory. For more information, see An out-of-memory (OOM) error occurs due to insufficient container memory.

An OOM error occurs when Snappy files are read

Cause: The format of standard Snappy files written by services like Log Service differs from Hadoop Snappy files. By default, EMR processes Hadoop Snappy files, leading to an OOM error when processing standard Snappy files.
Solution: Configure the following parameter for the Hive job.
```
set io.compression.codec.snappy.native=true;
```

Metadata-related errors

The operation for dropping a large partitioned table timed out

Error message:

FAILED: Execution ERROR, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timeout

Cause: The partitioned table contains too many partitions, causing the drop operation to take a long time and the network to time out when the Hive client accesses the Metastore service.
Solution:
1. On the Hive service configuration page in the EMR console, click the hive-site.xml tab to increase the client access timeout to the metastore.
```
hive.metastore.client.socket.timeout=1200s
```
2. Batch drop multiple partitions by repeatedly executing statements to drop partitions that meet specific conditions.
```
alter table [TableName] DROP IF EXISTS PARTITION (ds<='20220720')
```

Jobs fail due to dynamic partitions in `INSERT OVERWRITE`

Error message: When you perform an insert overwrite operation on dynamic partitions or execute a job that involves an operation similar to insert overwrite, the error Exception when loading xxx in table is reported, and the following error message appears in the HiveServer logs.
```
Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Directory oss://xxxx could not be cleaned up.;
```
Cause: Metadata is inconsistent with the data. Metadata contains information about a partition, but the partition directory cannot be found in the data storage system, leading to an error during the cleanup operation.
Solution: Troubleshoot the metadata issue and rerun the job.

The error "java.lang.IllegalArgumentException: java.net.UnknownHostException: emr-header-1.xxx" occurs when a Hive job reads or deletes a table

Cause: When the EMR cluster uses DLF unified metadata or a unified meta database (an old feature), the initial path of the created database is the HDFS path of the current EMR cluster (for example, hdfs://master-1-1.xxx:9000/user/hive/warehouse/test.db or hdfs://emr-header-1.cluster-xxx:9000/user/hive/warehouse/test.db). The path of a Hive table inherits the path of the database and also uses the HDFS path of the current cluster (for example, hdfs://master-1-1.xxx:9000/user/hive/warehouse/test.db/test_tbl). If you use Hive in a cluster in the new EMR console to read data from or write data to a Hive table or database that is created by a cluster in the old EMR console, the new cluster may fail to connect to the old cluster. In addition, if the old cluster is released, the error "java.net.UnknownHostException" is returned.

Solution:

Solution 1: If the data in the Hive table is temporary or test data, you can try to modify the path of the Hive table to an OSS path and then call the drop table or drop database command again.

-- Hive SQL
alter table test_tbl set location 'oss://bucket/not/exists'
drop table test_tbl;

alter table test_pt_tbl partition (pt=xxx) set location 'oss://bucket/not/exists';
alter table test_pt_tbl drop partition pt=xxx);

alter database test_db set location 'oss://bucket/not/exists'
drop datatabase test_db

Solution 2: If data in the Hive table of the old EMR cluster is valid but inaccessible from the new cluster, and the data is stored in HDFS, migrate the data to OSS and create a new table.
```
hadoop fs -cp hdfs://emr-header-1.xxx/old/path oss://bucket/new/path
hive -e "create table new_tbl like old_tbl location 'oss://bucket/new/path'"
```

Issues related to Hive UDFs and third-party packages

A conflict occurs due to third-party packages that are placed in the Hive lib directory

Cause: Placing third-party packages or replacing Hive packages in the Hive lib directory ($HIVE_HOME/lib) often causes conflicts. Avoid such operations.
Solution: Remove the third-party packages from $HIVE_HOME/lib and restore the original Hive JAR packages.

Hive fails to use the reflect function

Cause: Ranger authentication is enabled.
Solution: Remove the reflect function from the blacklist and configure it in hive-site.xml.
```
hive.server2.builtin.udf.blacklist=empty_blacklist
```

Jobs run slowly due to custom UDFs

Cause: If a job runs slowly but no error logs are returned, the issue may be due to low performance of custom Hive UDFs.
Solution: Identify the performance bottleneck using the thread dump of a Hive task and optimize the custom Hive UDFs.

Issues related to the grouping() function

Symptom: The following error message is returned when using the grouping() function.
```
grouping() requires at least 2 argument, got 1
```
This error indicates an exception occurs during parameter parsing for the grouping() function.
Cause: The issue stems from a recognized bug in the open-source version of Hive, which is case-sensitive when parsing the grouping() function. Using the lowercase grouping() may lead Hive to misidentify the function, resulting in an error during parameter parsing.
Solution: Change the grouping() function in the SQL statement to uppercase GROUPING() to resolve the issue.

Issues related to engine compatibility

The execution result differs because the time zone of Hive is different from that of Spark

Symptom: Hive uses UTC, while Spark uses the local time zone, resulting in different execution results.
Solution: Change Spark's time zone to UTC. Insert the following code in Spark SQL:
```
set spark.sql.session.timeZone=UTC;
```
Alternatively, modify the Spark configuration file to include the following:
```
spark.sql.session.timeZone=UTC
```

Issues related to Hive versions

Hive jobs run slowly on Spark because dynamic resource allocation is enabled (known defect)

Cause: The open source Hive has defects, and the spark.dynamicAllocation.enabled parameter is set to true when connecting to Spark using Beeline, resulting in the shuffle partition count being 1.
Solution: Disable dynamic resource allocation for Hive jobs running on Spark or run Hive jobs on Tez.
```
spark.dynamicAllocation.enabled=false
```

Tez throws an exception when the hive.optimize.dynamic.partition.hashjoin parameter is set to true (known defect)

Error message:

Vertex failed, vertexName=Reducer 2, vertexId=vertex_1536275581088_0001_5_02, diagnostics=[Task failed, taskId=task_1536275581088_0001_5_02_000009, diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( failure ) : attempt_1536275581088_0001_5_02_000009_0:java.lang.RuntimeException: java.lang.RuntimeException: cannot find field _col1 from [0:key, 1:value]
    at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296)
    at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)

Cause: The open source Hive has defects.
Solution: Set the hive.optimize.dynamic.partition.hashjoin parameter to false.
```
hive.optimize.dynamic.partition.hashjoin=false
```

MapJoinOperator throws the NullPointerException exception (known defect)

Error message:
Cause: The hive.auto.convert.join.noconditionaltask parameter is set to true.
Solution: Set the hive.auto.convert.join.noconditionaltask parameter to false.
```
hive.auto.convert.join.noconditionaltask=false
```

Hive throws the IllegalStateException exception when a Hive job runs on Tez (known defect)

Error message:

java.lang.RuntimeException: java.lang.IllegalStateException: Was expecting dummy store operator but found: FS[17]
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296)
        at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
        at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
        at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
        at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
        at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
        at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)

Cause: The open source Hive has defects. This issue occurs when the tez.am.container.reuse.enabled parameter is set to true.
Solution: Set the tez.am.container.reuse.enabled parameter to false for Hive jobs.
```
set tez.am.container.reuse.enabled=false;
```

Other issues

The result of `SELECT COUNT(1)` is 0

Cause: The select count(1) statement uses the table's statistics information, which may be inaccurate.
Solution: Set the hive.compute.query.using.stats parameter to false.
```
hive.compute.query.using.stats=false
```
Alternatively, recalculate the table's statistics using the analyze command.
```
analyze table <table_name> compute statistics;
```

An error occurs when a Hive job is submitted on a self-managed Elastic Compute Service (ECS) instance

If you submit a Hive job on a self-managed ECS instance, an error occurs. Submit Hive jobs on an EMR gateway cluster or by using EMR CLI. For more information, see Use EMR CLI to deploy a gateway environment.

An exception occurs on a job due to data skew

Symptom:
- Shuffle data exhausts the disk space.
- Some tasks take an excessively long time to complete.
- OOM errors occur in some tasks or containers.
Solution:
1. Activate skew join optimization in Hive.
```
set hive.optimize.skewjoin=true;
```
2. Boost the concurrency of mappers and reducers.
3. Augment the container memory. For details, see An out-of-memory (OOM) error occurs due to insufficient container memory.

The error "Too many counters: 121 max=120" occurs

Symptom: An error is reported when running a job on the Tez or MR engine using Hive SQL.
Cause: The number of counters in the job exceeds the default maximum limit.
Solution: Navigate to the Configuration tab of the YARN Service in the EMR console, search for the mapreduce.job.counters.max parameter, and increase its value. After adjusting, resubmit the Hive job. If you are using Beeline or JDBC to submit the job, restart the HiveServer service.

Troubleshooting

Memory-related issues

An out-of-memory (OOM) error occurs due to insufficient container memory

The container is killed by YARN due to excessive memory usage

An OOM error occurs because the sort buffer size is excessively large

An OOM error occurs due to GroupBy statements

An OOM error occurs when Snappy files are read

Metadata-related errors

The operation for dropping a large partitioned table timed out

Jobs fail due to dynamic partitions in `INSERT OVERWRITE`

The error "java.lang.IllegalArgumentException: java.net.UnknownHostException: emr-header-1.xxx" occurs when a Hive job reads or deletes a table

Issues related to Hive UDFs and third-party packages

A conflict occurs due to third-party packages that are placed in the Hive lib directory

Hive fails to use the reflect function

Jobs run slowly due to custom UDFs

Issues related to the grouping() function

Issues related to engine compatibility

The execution result differs because the time zone of Hive is different from that of Spark

Issues related to Hive versions

Hive jobs run slowly on Spark because dynamic resource allocation is enabled (known defect)

Tez throws an exception when the hive.optimize.dynamic.partition.hashjoin parameter is set to true (known defect)

MapJoinOperator throws the NullPointerException exception (known defect)

Hive throws the IllegalStateException exception when a Hive job runs on Tez (known defect)

Other issues

The result of `SELECT COUNT(1)` is 0

An error occurs when a Hive job is submitted on a self-managed Elastic Compute Service (ECS) instance

An exception occurs on a job due to data skew

The error "Too many counters: 121 max=120" occurs

Sales Support

Technical Support

Connect & Report Abuse

About Alibaba Cloud

Our Global Network

Quick Start

Global Offices

Olympic Games Paris 2024 New

Stade Roland Garros – Glitz from the Past New

Place de la Concorde – “Breaking” the Barriers New

Vaires-sur-Marne Nautical Stadium – Sports with Sustainability New

International Broadcast Center – Images, Sounds, and Data that Captivate Billions New

Customer Success Stories New

Trust Center

Security & Compliance Center

Cloud Compliance Resources

Security Compliance FAQs

Product & Feature Update New

Cloud Forward

Press Room

Alibaba Cloud e-Magazine New

Alibaba Cloud in Analyst Research

Notice

Go Global Service New

Go Global Alliance with Alibaba Cloud

China Gateway Hot

Information Compliance

China Gateway - MLPS 2.0 Compliance New

China Gateway - Networking

China Gateway - Global Application Acceleration New

China Gateway - Security

China Gateway - Data Security New

ICP Support Hot

China Gateway - Omnichannel Data Mid-End New

China Gateway - Organizational Data Mid-End New

China Gateway - Business Mid-End New

China Gateway - AI Service for Conversational Chatbots New

China Gateway - Online Education

China Gateway - Domain Registration

Work at Alibaba Cloud

Experienced Professionals

Students and Graduates

Free Trial

Pricing

Promo Center

Price Reduction

Pay Less and Deploy More

FinOps

Elastic Compute Service (ECS)

Simple Application Server (SAS)

Elastic GPU Service

Elastic Desktop Service (EDS)

Object Storage Service (OSS)

Cloud Enterprise Network (CEN)

Web Application Firewall (WAF)

Domain Names

Container Compute Service (ACS)

Secure Access Service Edge (SASE)

Intelligent Media Services(IMS)

Edge Security Acceleration (ESA)(Original DCDN)

Intelligent Media Management

DingTalk Enterprise

YiDA

Alibaba Cloud Model Studio

Apsara Prime - For Easy Cloud Product Selection

Alibaba Cloud ECS - Cater All Your Cloud Hosting Needs

1TB CDN—Get Free 1 TB Outbound Traffic Plan Now

Security—Under Attack? Get Free Security Support

Short Message Service - Free Testing is Available

Elastic Compute Service (ECS) Hot

CloudBox

Compute Nest

Dedicated Host Hot

ECS Bare Metal Instance

Elastic Desktop Service (EDS) Featured

Elastic GPU Service Featured

Simple Application Server (SAS) Hot

Auto Scaling

Cloud Phone Beta

Elastic Desktop Service (EDS) Featured

Batch Compute

Elastic High Performance Computing (E-HPC)

Super Computing Cluster (SCC)