DataLake clusters - E-MapReduce - Alibaba Cloud Documentation Center

A DataLake cluster is a big data computing cluster that allows you to analyze data in a flexible, reliable, and efficient manner. You can create a DataLake cluster only in the new E-MapReduce (EMR) console. You can easily build a scalable data pipeline based on DataLake clusters. This topic describes the features supported by DataLake clusters and the differences between DataLake clusters and Hadoop clusters.

Features

Reliability

If you enable high availability for a DataLake cluster, three master nodes that are distributed to different underlying physical servers are deployed for the cluster. This way, risks caused by hardware failures can be reduced. You are no longer allowed to deploy only two master nodes in a DataLake cluster because of the extended period of time required to recover from node failures. If you enable high availability for a DataLake cluster, you cannot use an on-premises MySQL database deployed in standalone mode as a Hive metastore database. You can use only a DataLake Formation (DLF) metadatabase or an ApsaraDB RDS database as a Hive metastore database.

When you create or scale out a DataLake cluster, the system checks the health status of Elastic Compute Service (ECS) instances to prevent abnormal ECS instances from being added to the DataLake cluster. During the running of a DataLake cluster, the EMR cluster manager can identify serious issues, such as disk damage and long-term read or write failures, and enable the supplementation mechanism.

Flexibility

In the new EMR console, all services are optional for a DataLake cluster. You can select services for a DataLake cluster based on your business requirements. For example, you can select only HDFS to create an independent distributed storage cluster or select only Presto to create an independent cluster for ad hoc analysis.

DataLake clusters support the Alibaba Cloud DNS PrivateZone service. Communication among nodes within a DataLake cluster no longer depends on the hosts file. This prevents issues that are caused by the dependency on the hosts file.

Additional security groups and assignment of public IP addresses

By default, a Hadoop cluster has an easy-to-use security group. However, you cannot use the security group to manage the open ports of a cluster in a fine-grained manner. For example, you want to open port 50070 of HDFS only on the master nodes in a cluster. If you add an inbound rule for the port, the rule is valid for all nodes in the cluster. In the new EMR console, you can associate at most two additional security groups with each node group in a DataLake cluster. This allows you to manage the inbound and outbound rules of ports in a fine-grained manner.

In the new EMR console, you can enable the assignment of public IP addresses at the node group level for a DataLake cluster. You can use this feature together with additional security groups to manage public IP addresses in a fine-grained manner.

Deployment of Spark

You can use the following solutions to deploy Spark and Hadoop in a DataLake cluster: Hadoop 2 + Spark 2, Hadoop 3 + Spark 3, Hadoop 2 + Spark 3, or Hadoop 3 + Spark 2. Hadoop clusters support only the first two solutions. You can select a solution based on your business requirements. DataLake clusters support the Kyuubi service. Kyuubi is an enterprise-level DataLake computing engine that provides a unified multi-tenant JDBC interface for Spark SQL to manage various computing resources and process large amounts of data.

Domain names

The format of domain names for nodes in a cluster is optimized. For example, the domain names of nodes in a Hadoop cluster are in the emr-header-1.cluster-13*** format, but the domain names of nodes in a DataLake cluster are in the master-1-1.c-494bea2977d9*** format.

If you enable high availability for a DataLake cluster, the domain names of nodes in the cluster are in the master-1-{1-3}.Cluster ID format. The display name of the host is in the emr-user@master-1-1({IP address}) format. This helps you easily obtain the IP address of the current node and perform O&M operations.

Logon user and private key

When you log on to a cluster by using a private key, the default username is changed from root to emr-user. We recommend that you perform O&M operations on nodes as the user emr-user. If you still want to use the root user to log on to a DataLake cluster, you can run the sudo command to switch to the root user after you log on to the cluster as the user emr-user.

You can use a private key to log on to all nodes of a DataLake cluster instead of only the master node. If you use a password for logon, you can still log on to the DataLake cluster as the root user.

`emr-metadata`

You can run the emr-metadata command on each node of a DataLake cluster. This command outputs metadata information about the current node of the cluster, such as the cluster ID, the role of the node, the instance ID, and network and hardware configurations. This way, you can obtain the information of the current node that is required when you use the bootstrap action script.

Differences between DataLake clusters and Hadoop clusters

Category	Comparison item	DataLake cluster	Hadoop cluster
Cluster	Time required to create a cluster	The average time is less than 5 minutes.	The average time is less than 10 minutes.
	Time required to add a node to a cluster	The average time is less than 3.5 minutes.	The average time is less than 10 minutes.
	API	Supported.	Supported.
	Domain name	You can use the Alibaba Cloud DNS PrivateZone service to resolve private domain names to IP addresses.	You can use the hosts file to resolve domain names to IP addresses.
	Disk capacity expansion	Hot disk capacity expansion is supported. You can expand the capacity of your disk without the need to restart the related service.	Hot disk capacity expansion is not supported. You must restart the related service after you expand the capacity of your disk.
	Service adding	Supported.	Supported.
Node group	vSwitch	You can select a vSwitch when you add a node group.	You can select a vSwitch only when you create an EMR cluster. After the cluster is created, the selected vSwitch cannot be changed.
	Assignment of public IP addresses	You can specify whether to assign public IP addresses to node groups when you create an EMR cluster. You can assign public IP addresses to all types of node groups.	Public IP addresses can be assigned only at the cluster level. You can turn on Assign Public IP Address to enable the assignment of public IP addresses when you create a cluster. If you want to access a cluster over the Internet but no public IP addresses are assigned to the cluster, you can apply for elastic IP addresses (EIPs) for ECS instances in the cluster. For information about how to apply for an EIP, see Elastic IP addresses. You can assign public IP addresses only to master node groups.
	Additional security group	Supported.	Not supported.
	Deployment set	You can specify whether to use deployment sets when you create an EMR cluster. You can specify whether to use deployment sets when you add a core node group.	Not supported.
	Status of node groups	Supported.	Not supported.
	Instance type	Nodes of the same specifications but different instance types are supported in a node group.	Only nodes of the same instance type are supported in a node group. If you enable auto scaling for the cluster, nodes of different instance types can be deployed.
Auto scaling	Auto scaling	Auto scaling is decoupled from auto scaling groups and can be implemented at the node group level. This facilitates scaling.	A dedicated auto scaling group is required for auto scaling. This node group cannot be manually scaled.
	Auto scaling rule	Auto scaling rules are not limited by the states of auto scaling activities. You can flexibly modify the configurations of an auto scaling rule. The modification takes effect only once. If multiple auto scaling rules of a node group are triggered at the same time, the rules take effect based on the order that you specified.	Auto scaling rules are limited by the states of auto scaling activities. The configurations of an auto scaling rule cannot immediately take effect after a modification. If multiple auto scaling rules of a node group are triggered at the same time, the rules randomly take effect.
	Auto scaling history	You can view the auto scaling history, the causes of triggering auto scaling activities, and information about the added or removed nodes.	You can view only the auto scaling history.
	Metric collection frequency	Metrics are collected every 30 seconds.	Metrics are collected every 30 seconds.
	Time required for an auto scaling activity to take effect	1 to 30 seconds after an auto scaling rule is applied.	1 to 2 minutes after an auto scaling rule is applied.
Scaling	Scaling activity	Auto scaling activities and manual scaling activities have the same implementation mechanism but different trigger conditions: Auto scaling activities are triggered by auto scaling rules. Manual scaling activities are triggered by users. Auto scaling activities can be paused. The auto scaling activities of different task node groups are independent of each other. Nodes are removed from a node group based on the loads and creation time of the nodes. This way, the impact on your business is reduced.	Auto scaling activities and manual scaling activities have different implementation mechanisms. Auto scaling activities cannot be paused. Auto scaling can be performed for only one node group at a time. Nodes are randomly removed from a node group.
High availability and software	High availability	You cannot use an on-premises MySQL database as a Hive metastore database.	You can use an on-premises MySQL database as a Hive metastore database.
		Deployment sets are used. The three master nodes of a DataLake cluster are distributed to different underlying physical servers. This way, the risks that are related to hardware failures can be reduced.	Deployment sets are not used.
		NameNode and ResourceManager are deployed on three nodes. A high-availability cluster with only two master nodes is no longer supported.	NameNode and ResourceManager are deployed only on two nodes. A high-availability cluster with only two master nodes is supported.
	Service	All services are optional.	Required services and optional services are provided.
	Combination of Spark 2 and Hadoop 3	Supported.	Not supported.
	Combination of Spark 3 and Hadoop 2	Supported.	Supported in EMR V3.38.0 and later.