You can use a Hive connector to query and analyze data in a Hive data warehouse.
Background information
A Hive data warehouse system consists of the following parts:
Data files in various formats. In most cases, the data files are stored in Hadoop Distributed File System (HDFS) or an object storage system, such as Alibaba Cloud Object Storage Service (OSS).
Metadata about how the data files are mapped to schemas and tables. The metadata is stored in a database such as a MySQL database. You can access the metadata by using a Hive metastore.
A query language called HiveQL. This query language is executed on a distributed computing framework, such as MapReduce or Tez.
This topic describes the following information about Hive connectors:
Prerequisites
A data lake cluster or Hadoop cluster is created, and the Presto service is selected. For more information, see Create a cluster.
Limits
To configure a Hive connector, you must first configure a Hive metastore.
Hive connectors support a variety of distributed storage systems, such as HDFS, Alibaba Cloud OSS, and OSS compatible systems. You can use a Hive connector to query data from the distributed storage systems. Make sure that the coordinator node and all the worker nodes of Presto can access the Hive metastore that you configured and the distributed storage system whose data you want to query. By default, you can use the Thrift protocol to access the Hive metastore over port 9083.
Modify the configurations of a Hive connector
This section describes how to modify the configurations of a Hive connector. For more information, see Configure connectors.
Default configurations of a Hive connector
Go to the Presto service page in the EMR console, click the Configure tab, and then click the hive.properties tab. The parameters described in the following table are displayed. Modify these parameters based on your business requirements.
Parameter | Description |
hive.recursive-directories | Specifies whether data can be read from the subdirectories for a table or a partition. This parameter is similar to the |
hive.metastore.uri | The uniform resource identifier (URI) that is used to access the Hive metastore based on the Thrift protocol. By default, this parameter is set to a value in the format of |
hive.config.resources | The HDFS configuration file. If you want to specify multiple configuration files, separate the file names with commas (,). Make sure that the configuration files exist on all the hosts where Presto is running. Important Set this parameter only if you need to access HDFS. |
hive.delta-table-enabled | Specifies whether Presto can read data from a Delta Lake table. Valid values:
|
hive.delta-compatible-mode-enabled | Specifies whether the compatibility mode is enabled for a Delta Lake table. Valid values:
|
hive.hdfs.impersonation.enabled | Specifies whether to enable user impersonation. Valid values:
|
Configurations for multiple Hive clusters
If you have multiple Hive clusters, you can create multiple configuration files in the etc/catalog directory. Make sure that the file name extension is .properties.
For example, if you create a configuration file named sales.properties, Presto uses the connector configured in the file to create a catalog named sales.
HDFS configuration
In most cases, the HDFS client is automatically configured for Presto, and you do not need to specify a configuration file. However, in some special scenarios, you must specify the HDFS client if you want to access an HDFS cluster. For example, if you enable HDFS federation or deploy NameNode in high-availability mode, you must add the hive.config.resources
property to reference the required HDFS configuration file.
We recommend that you specify a configuration file only if it is necessary. This way, the number of configuration files is reduced, and the number of included properties is minimized, which reduces the possibility of property incompatibilities.
Make sure that the configuration files exist on all the hosts where Presto is running. If you want to reference an existing Hadoop configuration file, make sure that the configuration file is copied to all the Presto nodes on which Hadoop is not running.
HDFS username and permissions
Before you execute the CREATE TABLE
or CREATE TABLE AS
statement in Presto to create a Hive table, check whether the account that you use to access HDFS from Presto has permissions to access the warehouse directory of Hive. The Hive warehouse directory is specified by the configuration variable hive.metastore.warehouse.dir
in the hive-site.xml file. The default value is /user/hive/warehouse
.
Supported file types
Hive connectors support the file types that are described in the following table.
File type | Remarks |
ORC | N/A |
Parquet | N/A |
Avro | N/A |
RCText | Refers to RCFile that uses the |
RCBinary | Refers to RCFile that uses the |
SequenceFile | N/A |
JSON | The |
CSV | The |
TextFile | N/A |
Supported table types
Hive connectors support the table types that are described in the following table.
Table type | Description |
Atomicity, consistency, isolation, durability (ACID) table | If a Hive metastore of the 3.X version is used, you can use a Hive connector to read data from or write data to insert-only and ACID tables. In this case, partitioning and bucketing are fully supported. You can perform row-level DELETE and UPDATE operations on ACID tables. You are not allowed to perform UPDATE operations on partition key columns or bucket columns or use Hive Streaming Ingest to create ACID tables. For more information, see Streaming Data Ingest. |
Materialized view | You can use a Hive connector to read data from materialized views of Hive. In Presto, materialized views are presented as regular, read-only tables. |
Hive views
Hive views are defined in HiveQL and stored in a Hive metastore.
A Hive connector supports Hive views in the following modes: Disabled, Legacy, and Experimental.
Mode | Description |
Disabled | In this mode, the business logic and data encoded in the views are invisible in Presto. Hive views are ignored by default. |
Legacy | In this mode, Hive views are simple, and you can read data in Presto. To enable this mode, you can specify If you want to temporarily enable this traditional access mode for a specific catalog, you can set the catalog session property HiveQL is similar to SQL. A HiveQL query that defines a view is interpreted as if the query is written in SQL. No translation is required. This mode is suitable for simple Hive views but may cause problems for complex queries. For example, if a HiveQL function has the same signature as SQL but has different behavior, the returned results may differ. In extreme scenarios, the query may fail or may not be parsed and executed. |
Experimental | In this mode, you can analyze, process, and rewrite Hive views, including the expressions and statements that the views contain. To enable this mode, you can specify If you use this mode, the following features are not supported:
|
Configuration properties
Hive configuration properties
Hive connectors support query acceleration based on JindoTable. Each EMR cluster has two built-in Hive connectors, which are hive.properties
and hive-acc.properties
. The native engine of JindoTable is built in the hive-acc.properties
connector. The native engine accelerates queries of ORC or Parquet files. For more information about query acceleration based on JindoTable, see the related topic for your SmartData version. For example, if you use SmartData 3.6.X, see Enable query acceleration based on a native engine.
The following table describes the configuration properties of a Hive connector.
Property | Description |
hive.config.resources | The HDFS configuration file. If you want to specify multiple configuration files, separate the file names with commas (,). Make sure that the configuration files exist on all the hosts where Presto is running. Note Configure this property only if you need to access HDFS. |
hive.recursive-directories | Specifies whether data can be read from the subdirectories for a table or a partition. This property is similar to the |
hive.ignore-absent-partitions | Specifies whether to ignore a partition rather than report a query failure if the system file path specified for the partition does not exist. If a partition is ignored, some data in the table may be skipped. Default value: false. |
hive.storage-format | The default file format that is used when you create a table. Default value: ORC. |
hive.compression-codec | The file encoding method that is used when you write data to files. Valid values: NONE, SNAPPY, LZ4, ZSTD, and GZIP. Default value: GZIP. |
hive.force-local-scheduling | Specifies whether to forcefully schedule splits on the node on which Hadoop DataNode processes data of the splits. This improves the installation efficiency in scenarios where Presto is collocated with each DataNode. Default value: false. |
hive.respect-table-format | Specifies whether data in new partitions is written in the existing table format or the default Presto format. Valid values:
|
hive.immutable-partitions | Specifies whether new data can be inserted into an existing partition. If you set this property to true, the Default value: false. |
hive.insert-existing-partitions-behavior | The mode in which data is inserted into an existing partition. Valid values:
|
hive.create-empty-bucket-files | Specifies whether to create an empty file for a bucket that stores no data. Valid values:
|
hive.max-partitions-per-writers | The maximum number of partitions per writer. Default value: 100. |
hive.max-partitions-per-scan | The maximum number of partitions for a single table scan. Default value: 100000. |
hive.hdfs.authentication.type | The HDFS authentication mode. Valid values:
|
hive.hdfs.impersonation.enabled | Specifies whether to enable HDFS user impersonation. Valid values:
|
hive.hdfs.trino.principal | The Kerberos principal that is used when Presto connects to HDFS. |
hive.hdfs.trino.keytab | The path of the key file for the HDFS client. |
hive.dfs.replication | The HDFS replication factor. |
hive.security | The security property. Default value: legacy. For more information, see Hive connector security configuration. |
security.config-file | The path of the configuration file. This property is required if you set the |
hive.non-managed-table-writes-enabled | Specifies whether to enable data writes to unmanaged (external) Hive tables. Default value: false. |
hive.non-managed-table-creates-enabled | Specifies whether to enable the creation of unmanaged (external) Hive tables. Default value: true. |
hive.collect-column-statistics-on-write | Specifies whether to enable the automatic collection of column-level statistics during data writes. For more information, see Configuration properties. Default value: true. |
hive.file-status-cache-tables | Specifies the tables that are cached. For example, |
hive.file-status-cache-size | The maximum number of cached file status entries. Default value: 1000000. |
hive.file-status-cache-expire-time | The validity period of a cached directory list. Default value: 1. Unit: minutes. |
hive.rcfile.time-zone | The time zone to which binary-encoded timestamp values are adjusted. Default value: JVM default. Note If you use Hive 3.1 or a later version, you must set this property to UTC. |
hive.timestamp-precision | The precision of a Hive column of the TIMESTAMP type. Valid values:
Default value: MILLISECONDS. Note Values with higher precision are rounded. |
hive.temporary-staging-directory-enabled | Specifies whether to use the temporary staging directory that is specified by the Default value: true. |
hive.temporary-staging-directory-path | The path of the temporary staging directory that is used for data writes. Default value: Note You can use the ${USER} placeholder to specify a unique location for each user. |
hive.translate-hive-views | Specifies whether to enable translation for Hive views. Default value: false. |
hive.legacy-hive-view-translation | Specifies whether to use a traditional algorithm to translate Hive views. You can specify the catalog session property Default value: false. |
hive.parallel-partitioned-bucketed-writes | Specifies whether to improve the parallelism of data writes to partitioned tables and bucketed tables. Default value: true. Note If you set this property to false, the number of write threads cannot exceed the number of buckets. |
Configuration properties for ORC files
The following table describes the properties that are configured when you use a Hive connector to read data from or write data to ORC files.
Property | Description |
hive.orc.time-zone | The default time zone for an ORC file that is of an early ORC version and does not declare a time zone. Default value: JVM default. |
hive.orc.use-columns-names | Specifies whether to access the columns of an ORC file by name. By default, the columns in an ORC file are accessed based on their sequential positions in the Hive table definition. The equivalent catalog session property is Default value: false. |
Configuration properties for Parquet files
The following table describes the properties that are configured when you use a Hive connector to read data from or write data to Parquet files.
Property | Description |
hive.parquet.time-zone | The time zone to which timestamp values are adjusted. Default value: JVM default. Note If you use Hive 3.1 or a later version, you must set this property to UTC. |
hive.parquet.use-columns-names | Specifies whether to access the columns of a Parquet file by name. Valid values:
The equivalent catalog session property is |
Configuration properties for a Hive metastore
The following table describes the configuration properties for a Hive metastore. You can use dedicated properties to configure a Hive metastore that is connected based on the Thrift protocol. For more information, see Configuration properties for a Thrift-based Hive metastore.
Property | Description |
hive.metastore | The type of the Hive metastore that is used. Presto supports the default Thrift-based Hive metastore (thrift) and its derivatives. Default value: thrift. |
hive.metastore-cache-ttl | The validity period of the metastore data cached in the Hive metastore. Default value: 0. Unit: seconds. |
hive.metastore-cache-maximum-size | The maximum number of metastore data objects cached in the Hive metastore. Default value: 10000. |
hive.metastore-refresh-interval | The interval at which the cached metastore data is asynchronously refreshed after a data access operation. Only unexpired data is refreshed. The data refresh feature ensures that the latest data is obtained in subsequent accesses. |
hive.metastore-refresh-max-threads | The maximum number of threads that are used to refresh the cached metastore data. Default value: 10. |
hive.metastore-timeout | The timeout period of a Hive metastore request. Default value: 10. Unit: seconds. |
Configuration properties for a Thrift-based Hive metastore
The following table describes the configuration properties for a Thrift-based Hive metastore of a Hive connector.
Property | Description |
hive.metastore.uri | The URI that is used to access the Hive metastore based on the Thrift protocol. If multiple URIs are configured, the first URI is used by default. The Hive metastores that correspond to the other URIs are considered secondary metastores. This property is required. Example: |
hive.metastore.username | The username used by Presto to access the Hive metastore. |
hive.metastore.authentication.type | The authentication mode of the Hive metastore. Valid values:
|
hive.metastore.thrift.impersonation.enabled | Specifies whether to enable user impersonation for the Hive metastore. |
hive.metastore.thrift.delegation-token.cache-ttl | The validity period of the delegation token cache for the Hive metastore. Default value: 1. Unit: hours. |
hive.metastore.thrift.delegation-token.cache-maximum-size | The maximum size of the delegation token cache. Default value: 1000. |
hive.metastore.thrift.client.ssl.enabled | Specifies whether to enable SSL when the Hive metastore is connected. Valid values:
|
hive.metastore.thrift.client.ssl.key | The path of the private key and client certificate in the key store. |
hive.metastore.thrift.client.ssl.key-password | The password of the private key. |
hive.metastore.thrift.client.ssl.trust-certificate | The path of the server certificate chain in the trust store. Note This property is required when SSL is enabled. |
hive.metastore.thrift.client.ssl.trust-certificate-password | The password of the server certificate chain. |
hive.metastore.service.principal | The Kerberos principal of the Hive metastore. |
hive.metastore.client.principal | The Kerberos principal that is used when Presto connects to the Hive metastore. |
hive.metastore.client.keytab | The path of the keytab file on the Hive metastore client. |
Configuration properties for performance tuning
The following table describes the configuration properties for performance tuning of a Hive connector.
If you modify the default settings of the properties in the following table, the performance of the Hive connector may become unstable or deteriorate. Proceed with caution.
Property | Description |
hive.max-outstanding-splits | The maximum number of cached splits for each table scan in a query before the scheduler tries to pause. Default value: 1000. |
hive.max-splits-per-second | The maximum number of splits generated per second in each table scan. This property can be used to reduce the load on the storage system. By default, no limit is specified, and Presto maximizes the data access parallelism. |
hive.max-initial-splits | The maximum number of initial splits. For each table scan, the coordinator node first assigns the initial splits, each of which cannot exceed the value of max-initial-split-size in size. After the coordinator node assigns initial splits, the maximum number of splits that the coordinator code can continue to assign is determined by the max-split-size property. Default value: 200. |
hive.max-initial-split-size | The maximum size of each split that is assigned to a worker node if the number of splits that have been assigned is less than or equal to the value of the max-initial-splits property. If the split size is small, the data access parallelism is high. This accelerates the speed of small queries. Default value: 32. Unit: MB. |
hive.max-split-size | The maximum size of a single split that is assigned to a worker node. If the split size is small, the data access parallelism is high. This reduces the latency but increases the overheads and system loads. Default value: 64. Unit: MB. |
Table statistics
You can use a Hive connector to collect and manage table statistics. Then, you can improve the query performance based on the statistics.
When you use a Hive connector to write data, the connector collects basic information, such as the number of files, the number of rows, the size of raw data, and the total data size. The connector also collects the column-level statistics described in the following table.
Column type | Collectable information |
TINYINT | Number of null values, number of distinct values, and maximum or minimum values |
SMALLINT | Number of null values, number of distinct values, and maximum or minimum values |
INTEGER | Number of null values, number of distinct values, and maximum or minimum values |
BIGINT | Number of null values, number of distinct values, and maximum or minimum values |
DOUBLE | Number of null values, number of distinct values, and maximum or minimum values |
REAL | Number of null values, number of distinct values, and maximum or minimum values |
DECIMAL | Number of null values, number of distinct values, and maximum or minimum values |
DATE | Number of null values, number of distinct values, and maximum or minimum values |
TIMESTAMP | Number of null values, number of distinct values, and maximum or minimum values |
VARCHAR | Number of null values and number of distinct values |
CHAR | Number of null values and number of distinct values |
VARBINARY | Number of null values |
BOOLEAN | Number of null values and number of true or false values |