In E-MapReduce (EMR) V3.43.0, V5.9.0, or a minor version that is later than V3.43.0 or V5.9.0, you can create a high-security cluster in which open source components are started in Kerberos security mode. This ensures that only authenticated clients can access a cluster service, such as Hadoop Distributed File System (HDFS).
Background information
After you enable Kerberos authentication for a cluster, the following benefits are provided:
Clients: Trusted clients are authenticated to properly submit jobs. Malicious users cannot be disguised as other users to access the cluster. This effectively prevents malicious users from submitting jobs as clients.
Servers: All services in the cluster are trusted and can use keys to communicate with each other. This prevents service impersonation attacks.
Kerberos authentication helps improve cluster security but increases the complexity of using a cluster.
After Kerberos authentication is enabled, you must add content related to Kerberos authentication to the jobs that you want to submit.
Before you enable Kerberos authentication, you must understand how Kerberos works and how to efficiently use Kerberos.
After you enable Kerberos authentication for a cluster, Kerberos authentication is required for communication between services in the cluster. Kerberos authentication takes some time. A job in the cluster requires longer processing time than a job in a cluster of the same specifications for which Kerberos authentication is disabled.
Kerberos authentication
Kerberos is an identity authentication protocol based on symmetric-key cryptography that supports single sign-on (SSO). After a client is authenticated, the client can access multiple services, such as HBase and HDFS.
Kerberos authentication involves the following objects:
Key Distribution Center (KDC): a Kerberos server.
Client: a client (principal) that needs to access a service. The KDC and the service authenticate the identity of the client.
Service: a service of a cluster that is integrated with Kerberos, such as HDFS, YARN, and HBase.
The following figure shows the Kerberos authentication process.
The Kerberos authentication process is divided into two stages:
Stage 1: KDC-based identity authentication
Before a client can access a service in a cluster that is integrated with Kerberos, the client must be authenticated by the KDC.
After the client is authenticated, the client obtains a Ticket Granting Ticket (TGT). Then, the client can use the TGT to access services in the cluster.
Stage 2: Service-based identity authentication
After the client obtains a TGT, the client can use the TGT to access services in the cluster.
The client uses the TGT and the name of the service that you want to access to obtain a Service Granting Ticket (SGT) from the KDC, and uses the SGT to access the service. For example, the service can be HDFS. Then, the service uses relevant information to authenticate the client. After the client is authenticated, the client can access the service.
Enable Kerberos authentication
When you create a cluster, you can turn on Kerberos Authentication in the Advanced Settings section of the Software Configuration step.
You can specify a KDC source for EMR V3.43.1, EMR V5.9.1, and minor versions later than EMR V3.43.1 or EMR V5.9.1.
The following KDC sources are supported:
Self-managed KDC: the KDC created by the cluster. This is the default value.
External KDC: an external KDC. In this case, you must configure the parameters related to the external KDC. For more information, see Connect to an external KDC.
References
For information about how to use Kerberos, see Basic operations on Kerberos.
For information about how to connect to an external KDC, see Connect to an external KDC.