By default, EMR-HOOK is integrated with Hive that is deployed in an E-MapReduce (EMR) cluster. EMR-HOOK can collect the SQL information of jobs, such as data lineage and access frequency. You can use EMR-HOOK to collect the frequency of access to tables or partitions based on metadata managed in Data Lake Formation (DLF). You can also use DataWorks to manage data lineage. This topic describes how to configure EMR-HOOK for Hive.
Prerequisites
A DataLake or custom cluster is created and the Hive service is selected when you create the cluster. For more information, see Create a cluster.
Limits
You cannot use EMR-HOOK to collect the SQL information of jobs in a gateway that is deployed by using EMR-CLI.
In a minor version earlier than EMR V5.16.0 or EMR V3.50.0, the settings of the hive.exec.post.hooks parameter that is configured for Hive and the park.sql.queryExecutionListeners parameter that is configured for Spark cannot be synchronized to a gateway. In EMR V5.16.0, EMR V3.50.0, or a minor version later than EMR V5.16.0 or EMR V3.50.0, the settings of the preceding parameters can be synchronized to a gateway, and the hive_aux_jars_path_gateway_only parameter is introduced. You can configure the hive_aux_jars_path_gateway_only parameter to independently use a JAR file with a custom extension on the gateway to enhance functionality.
Precautions
EMR-HOOK is enabled by default in a minor version earlier than EMR V5.14.0 or EMR V3.48.0.
EMR-HOOK is disabled by default in EMR V5.14.0, EMR V3.48.0, or a minor version later than EMR V5.14.0 or EMR V3.48.0. If you want to use EMR-HOOK, you must manually enable EMR-HOOK.
Procedure
Go to the Services tab.
Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
On the EMR on ECS page, find the desired cluster and click Services in the Actions column.
Configure EMR-HOOK.
On the Services tab, find the Hive service and click Configure.
On the Configure tab, modify or add the following EMR-HOOK-related configuration items on specific subtabs.
Subtab
Parameter
Description
hive-site.xml
hive.exec.post.hooks
Listens to the SQL information of Hive, including the data lineage and access frequency.
If EMR-HOOK is enabled, set this parameter to
com.aliyun.emr.meta.hive.hook.LineageLoggerHook
.If EMR-HOOK is disabled, leave this parameter empty.
dlf.emrhook.webtracking
Specifies whether to enable access frequency reporting. Valid values:
true
false
hivemetastore-site.xml
hive.metastore.event.listeners
Listens to the event information about metadata changes in Hive, including data lineage.
If EMR-HOOK is enabled, set this parameter to
com.aliyun.emr.meta.hive.listener.MetaStoreListener
.If EMR-HOOK is disabled, leave this parameter empty.
hive.metastore.pre.event.listeners
Listens to the event information before a metadata change in Hive, including data lineage.
If EMR-HOOK is enabled, set this parameter to
com.aliyun.emr.meta.hive.listener.MetaStorePreAuditListener
.If EMR-HOOK is disabled, leave this parameter empty.
NoteIf EMR-HOOK is disabled, the Data Overview tab of a specific table in the DLF console no longer displays the data in the following columns: File Visits within Last Day, File Visits within Last Seven Days, and File Visits within Last 30 Days.
Save the configurations.
On the Configure tab, click Save.
In the dialog box that appears, configure the Execution Reason parameter and click Save.
Restart Hive.
In the upper-right corner of the Configure tab, choose More > Restart.
In the dialog box that appears, configure the Execution Reason parameter and click OK.
In the Confirm message, click OK.
References
For information about how to configure EMR-HOOK for Spark, see Use the Spark SQL extension feature to record data lineage and historical access information.