To prevent an error from being reported when you run an E-MapReduce (EMR) node in
DataWorks, make sure that the key configurations of the related EMR data lake cluster
meet the requirements. For example, you must configure the settings of Lightweight
Directory Access Protocol (LDAP), a Ranger whitelist, and a security policy to authenticate
the identity of the account that you use to run the EMR node in DataWorks in the EMR
data lake cluster. This topic describes how to configure the key items for an EMR
data lake cluster.
Limits
The EMR data lake cluster must be of V3.41.0 or a later minor version, or V5.7.0 or
a later minor version. If the EMR data lake cluster is of a minor version earlier
than 3.41.0 or 5.7.0, specific DataWorks features cannot be used.
Configure an EMR data lake cluster
- Optional:Enable LDAP.
If you want to associate the EMR data lake cluster as an EMR compute engine instance
with a DataWorks workspace in security mode and enable user authentication, you must
enable LDAP for the EMR data lake cluster.
- Add the required properties of DataWorks to the Hive property whitelist on the Ranger
service page of the EMR data lake cluster.
If you integrate Hive with Ranger in EMR, you must add the required properties of
DataWorks to the Hive property whitelist on the Ranger service page of the EMR data
lake cluster and restart Hive before you develop EMR Hive nodes in DataWorks. Otherwise,
the error message
Cannot modify spark.yarn.queue at runtime
or
Cannot modify SKYNET_BIZDATE at runtime
is returned when you run EMR Hive nodes.
- Add the required properties of DataWorks to the Hive property whitelist on the Ranger
service page of the EMR data lake cluster.
Add a custom parameter that consists of a key and a value. The following sample code
provides an example on a custom parameter that is configured for the Hive component
in an EMR data lake cluster:
hive.security.authorization.sqlstd.confwhitelist.append=tez.*|spark.*|mapred.*|mapreduce.*|ALISA.*|SKYNET.*
Note In the preceding code, ALISA.*
and SKYNET.*
are supported only for DataWorks.
- Restart the Hive service.
After the whitelist is configured, you must restart the Hive service to make the configurations
take effect.
- Change the default priority based on which an EMR node is run in the
yarn-site.xml
file. If you want to change the priority of an EMR node that is run in DataWorks, you must
add the configuration item
yarn.cluster.max-application-priority
to the
yarn-site.xml
file for the EMR cluster in the EMR console and specify a higher priority instead
of the default value 0. Otherwise, the priority that you specified for the EMR node
in DataWorks does not take effect.
Note After the change, you must restart the YARN service to make the configuration take
effect.