Hive - E-MapReduce - Alibaba Cloud Documentation Center

Hive is a data warehouse framework based on Hadoop that supports extract, transform, and load (ETL) operations and metadata management in big data scenarios.

Hive components

Name	Description
HiveServer2	A HiveQL query server that receives SQL requests from a Java Database Connectivity (JDBC) client over the Thrift or HTTP protocol. It supports concurrent access from multiple clients and identity authentication.
Hive MetaStore	The metadata management component. It stores metadata, such as databases and tables, for other engines. For example, both Spark and Presto use this component for metadata management.
Hive Client	The Hive client. It submits SQL jobs and converts them into MapReduce, Tez, or Spark jobs based on the configured execution engine. This component is installed on all nodes of an EMR cluster.

Feature enhancements

For more information about the compatibility between EMR, Hadoop, and Hive versions, see Release versions. The following tables describe the features that are enhanced for Hive in different EMR versions.

EMR 5.x series

EMR version	Component version	Feature enhancement
EMR-5.20.0	Hive 3.1.3	Optimized the performance of adding fields to partitioned tables.
EMR-5.17.4	Hive 3.1.3	Supports the deployment of Master-Extend node groups.
EMR-5.12.1	Hive 3.1.3	By default, OSS-HDFS is used to store data in Hive warehouse files.
EMR-5.9.0	Hive 3.1.3	Kerberos authentication is supported.
EMR-5.8.0	Hive 3.1.2	LDAP authentication can be enabled with one click.
EMR-5.6.0	Hive 3.1.2	The following issue is fixed: After speculative execution is enabled for Hive on Tez, both the original task and the speculative task are committed.
EMR-5.5.0	Hive 3.1.2	The issue about batch deletion that occurs on Hive Jindo is fixed. The out of memory (OOM) issue that occurs on HiveServer2 is fixed. Hive on Spark is optimized. Hive is adapted to JindoSDK.
EMR-5.4.0	Hive 3.1.2	In JindoFS in block storage mode, the metadata of multiple Hive tables can be optimized at the same time. By default, this feature is disabled.
EMR-5.3.0	Hive 3.1.2	In JindoFS in block storage mode, the metadata of multiple Hive tables can be optimized at the same time.
EMR-5.2.1	Hive 3.1.2	The issue that the output of the show create table command based on Data Lake Formation (DLF) metadata is inaccurate is fixed. The default parameters of Hive are optimized to improve the performance of Hive jobs. In the EMR console, the parameter names on the hive-env tab of the Configure tab for the Hive service are changed to uppercase. This facilitates the use of the parameters. The issue that user-defined functions (UDFs) cause HiveServer2 memory leak is fixed. The error message that is reported because of the incompatibility between the file system and Hive metastore when you write data to a Hive table is optimized.

EMR 3.x series

EMR version	Component version	Feature enhancement
MR-3.51.4	Hive 2.3.9	Supports the deployment of Master-Extend node groups.
EMR-3.46.1	Hive 2.3.9	By default, OSS-HDFS is used to store data in Hive warehouse files.
EMR-3.40.0	Hive 2.3.8	The following issue is fixed: After speculative execution is enabled for Hive on Tez, both the original task and the speculative task are committed. The following issue is fixed: User-defined functions (UDFs) can be called only after you reload the functions.
EMR-3.39.1	Hive 2.3.8	Hive is adapted to JindoSDK.
EMR-3.36.1	Hive 2.3.8	Hive is updated to 2.3.8. The issue that the output of the `show create table` command based on Data Lake Formation (DLF) metadata is inaccurate is fixed. The default parameters of Hive are optimized to improve the performance of Hive jobs. In the EMR console, the parameter names on the hive-env tab of the Configure tab for the Hive service are changed to uppercase. This facilitates the use of the parameters. The error message that is reported because of the incompatibility between the file system and Hive metastore when you write data to a Hive table is optimized.
EMR-3.35.0	Hive 2.3.7	Fixed community-reported issues related to fetch tasks.
EMR-3.34.0	Hive 2.3.7	Some default configurations are optimized. Performance is optimized. The cost-based optimization (CBO) feature is enhanced. LDAP authentication can be enabled or disabled with a click. Calcite is updated to 1.12.0. The hive.security.authorization.sqlstd.confwhitelist.append parameter is added.
EMR-3.33.0	Hive 2.3.7	Hive is updated to 2.3.7. Metadata from Alibaba Cloud Data Lake Formation (DLF) in an HCatalog table is supported. Hive metadata and job running information can be sent to DataWorks.
EMR-3.32.0	Hive 2.3.5	The connection leak issue of the HiveServer connection pool is fixed. The data collection feature of JindoTable can be enabled or disabled. The performance of `ADD COLUMN` is optimized. The issue that causes data read from Hudi tables to be invalid is fixed. The default configurations can be adjusted based on the sizes of cluster nodes.
EMR-3.30.0	Hive 2.3.5	Metadata from Alibaba Cloud DLF is supported. The issue caused when you read an empty Delta table directory and write data into a dummy file is fixed. Has dependencies are updated to 2.0.1.
EMR-3.29.0	Hive 2.3.5	Hive is updated to 2.3.5.6.0. A third-party metastore is supported. The datalake metastore-client parameter is added.
EMR-3.28.0	Hive 2.3.5	Supports Delta Lake 0.6.0.
EMR-3.27.2	Hive 2.3.5	The magic committer in an HCatalog table is supported. Some outdated default configurations are removed.
EMR-3.26.3	Hive 2.3.5	HCatalog tables support the direct committer.
EMR-3.25.0	Hive 2.3.5	Fixed an issue where MapReduce jobs failed in automatic LOCAL mode.
EMR-3.24.0	Hive 2.3.5	SQL statement compatibility can be checked. Hive 2.3.5 and Hadoop 2.8.5 are released as a combination. When Hive is restarted, the content in hiveserver2-site.xml is not synchronized to hive-site.xml in the spark-conf folder. The MSCK command can be used to add incremental directories. The bug triggered by the reuse of a Tez container in Hive is fixed. The MSCK command can be used to optimize column directories.
EMR-3.23.0	Hive 2.3.5	Removed Hive hooks configured in earlier versions of Hive. Supports using multiple COUNT(DISTINCT) for hive.groupby.skew in data optimization. Fixed the issue of data loss when joining tables with different bucket versions.
Versions earlier than EMR-3.23.0	Hive 2.x	The external unified database is saved to the Hive metastore. All clusters that use the external Hive metastore share the same metadata.

EMR 4.x series

EMR version	Component version	Feature enhancement
EMR-4.10.0	Hive 3.1.2	The issue that garbled characters are displayed when Hue is used to query historical records is fixed. The UI display exception that occurs when you use Hue together with Oozie is fixed. The issue that YARN Job Browser sometimes cannot present or terminate jobs is fixed. YARN Job Browser is accessible by default. The Presto protocol is supported by default.
EMR-4.8.0	Hive 3.1.2	Some default configurations are optimized. Performance is optimized. The cost-based optimization (CBO) feature is enhanced. LDAP authentication can be enabled or disabled with a click.
EMR-4.6.0	Hive 3.1.2	Metadata from Alibaba Cloud Data Lake Formation (DLF) in an HCatalog table is supported. Hive metadata and job running information can be sent to DataWorks.
EMR-4.5.0	Hive 3.1.2	Metadata stored in Alibaba Cloud DLF is supported. Ownership-related permissions of Ranger are supported.
EMR-4.4.1	Hive 3.1.2	Optimized the default parameter configurations.
EMR-4.4.0	Hive 3.1.2	Hive is updated to 3.1.2. JindoFS is optimized. Metastore consistency check (MSCK) is optimized. The Jindo Job Committer in an HCatalog table is supported. Has dependencies are updated.
EMR-4.3.0	Hive 3.1.1	Supports custom deployments.

Hive syntax

To ensure a consistent user experience, EMR retains the syntax of open source components as much as possible. EMR Hive is fully compatible with the syntax of Apache Hive.

For more information about Apache Hive, visit the Apache Hive official website.

References

For more information about connecting to Hive with a Hive client, see Hive connection methods.
For more information about identity authentication for the Hive service, see Use Kerberos authentication and Use LDAP authentication.
For information about accessing data lake data using Hive, see Use Hive to access Delta Lake and Hudi data.
For more information about common optimization methods for Hive jobs, see Hive job optimization.
For information about how to troubleshoot common issues with Hive jobs, see Troubleshoot exceptions for Hive jobs.