By default, EMR-HOOK is integrated with Spark 2 or Spark 3 that is deployed in an E-MapReduce (EMR) cluster. EMR-HOOK can collect the SQL information about jobs, such as data lineage and access frequency. You can use EMR-HOOK to collect access frequency to tables or partitions based on metadata managed in Data Lake Formation (DLF). You can also use EMR-HOOK to manage data lineage in DataWorks. This topic describes how to configure EMR-HOOK for Spark.
Prerequisites
A DataLake or custom cluster is created and the Spark service is selected when you create the cluster. For more information, see Create a cluster.
Limits
You cannot use EMR-HOOK to collect the SQL information of jobs in a gateway that is deployed by using EMR-CLI.
In a minor version earlier than EMR V5.16.0 or EMR V3.50.0, the settings of the hive.exec.post.hooks parameter that is configured for Hive and the park.sql.queryExecutionListeners parameter that is configured for Spark cannot be synchronized to a gateway. In EMR V5.16.0, EMR V3.50.0, or a minor version later than EMR V5.16.0 or EMR V3.50.0, the settings of the preceding parameters can be synchronized to a gateway, and the hive_aux_jars_path_gateway_only parameter is introduced. You can configure the hive_aux_jars_path_gateway_only parameter to independently use a JAR file with a custom extension on the gateway to enhance functionality.
Precautions
EMR-HOOK is enabled by default in a minor version earlier than EMR V5.14.0 or EMR V3.48.0.
EMR-HOOK is disabled by default in EMR V5.14.0, EMR V3.48.0, or a minor version later than EMR V5.14.0 or EMR V3.48.0. If you want to use EMR-HOOK, you must manually enable EMR-HOOK.
Procedure
Go to the Services tab.
Log on to the EMR console. In the left-side navigation pane, click EMR on ECS.
In the top navigation bar, select the region where your cluster resides and select a resource group based on your business requirements.
On the EMR on ECS page, find the desired cluster and click Services in the Actions column.
Configure EMR-HOOK.
On the Services tab, find the Spark 2 or Spark 3 service and click Configure.
On the Configure tab, modify or add the following EMR-HOOK-related configuration items on specific subtabs.
Subtab
Parameter
Description
spark-defaults.conf
spark.sql.queryExecutionListeners
Listens to the SQL information of Spark, including the data lineage and access frequency.
If EMR-HOOK is enabled, set this parameter to
com.aliyun.emr.meta.spark.listener.EMRQueryLogger
.If EMR-HOOK is disabled, leave this parameter empty.
hive-site.xml
dlf.emrhook.webtracking
Specifies whether to enable access frequency reporting. Valid values:
true
false
NoteIf EMR-HOOK is disabled, the Data Overview tab of a specific table in the DLF console no longer displays the data in the following columns: File Visits within Last Day, File Visits within Last Seven Days, and File Visits within Last 30 Days.
Save the configurations.
On the Configure tab, click Save.
In the dialog box that appears, configure the Execution Reason parameter and click Save.
Restart Spark.
In the upper-right corner of the Configure tab, choose More > Restart.
In the dialog box that appears, configure the Execution Reason parameter and click OK.
In the Confirm message, click OK.
References
For information about how to configure EMR-HOOK for Hive, see Use the Hive extension feature to record data lineage and historical access information.