A DataLake cluster is a big data computing cluster that allows you to analyze data in a flexible, reliable, and efficient manner. You can create a DataLake cluster only in the new E-MapReduce (EMR) console. You can easily build a scalable data pipeline based on DataLake clusters. This topic describes the features supported by DataLake clusters and the differences between DataLake clusters and Hadoop clusters.
Features
Reliability
If you enable high availability for a DataLake cluster, three master nodes that are distributed to different underlying physical servers are deployed for the cluster. This way, risks caused by hardware failures can be reduced. You are no longer allowed to deploy only two master nodes in a DataLake cluster because of the extended period of time required to recover from node failures. If you enable high availability for a DataLake cluster, you cannot use an on-premises MySQL database deployed in standalone mode as a Hive metastore database. You can use only a DataLake Formation (DLF) metadatabase or an ApsaraDB RDS database as a Hive metastore database.
When you create or scale out a DataLake cluster, the system checks the health status of Elastic Compute Service (ECS) instances to prevent abnormal ECS instances from being added to the DataLake cluster. During the running of a DataLake cluster, the EMR cluster manager can identify serious issues, such as disk damage and long-term read or write failures, and enable the supplementation mechanism.
Flexibility
In the new EMR console, all services are optional for a DataLake cluster. You can select services for a DataLake cluster based on your business requirements. For example, you can select only HDFS to create an independent distributed storage cluster or select only Presto to create an independent cluster for ad hoc analysis.
DataLake clusters support the Alibaba Cloud DNS PrivateZone service. Communication among nodes within a DataLake cluster no longer depends on the hosts file. This prevents issues that are caused by the dependency on the hosts file.
Additional security groups and assignment of public IP addresses
By default, a Hadoop cluster has an easy-to-use security group. However, you cannot use the security group to manage the open ports of a cluster in a fine-grained manner. For example, you want to open port 50070 of HDFS only on the master nodes in a cluster. If you add an inbound rule for the port, the rule is valid for all nodes in the cluster. In the new EMR console, you can associate at most two additional security groups with each node group in a DataLake cluster. This allows you to manage the inbound and outbound rules of ports in a fine-grained manner.
In the new EMR console, you can enable the assignment of public IP addresses at the node group level for a DataLake cluster. You can use this feature together with additional security groups to manage public IP addresses in a fine-grained manner.
Deployment of Spark
You can use the following solutions to deploy Spark and Hadoop in a DataLake cluster: Hadoop 2 + Spark 2, Hadoop 3 + Spark 3, Hadoop 2 + Spark 3, or Hadoop 3 + Spark 2. Hadoop clusters support only the first two solutions. You can select a solution based on your business requirements. DataLake clusters support the Kyuubi service. Kyuubi is an enterprise-level DataLake computing engine that provides a unified multi-tenant JDBC interface for Spark SQL to manage various computing resources and process large amounts of data.
Domain names
The format of domain names for nodes in a cluster is optimized. For example, the domain names of nodes in a Hadoop cluster are in the emr-header-1.cluster-13***
format, but the domain names of nodes in a DataLake cluster are in the master-1-1.c-494bea2977d9***
format.
If you enable high availability for a DataLake cluster, the domain names of nodes in the cluster are in the master-1-{1-3}.Cluster ID format. The display name of the host is in the emr-user@master-1-1({IP address}) format. This helps you easily obtain the IP address of the current node and perform O&M operations.
Logon user and private key
When you log on to a cluster by using a private key, the default username is changed from root to emr-user. We recommend that you perform O&M operations on nodes as the user emr-user. If you still want to use the root user to log on to a DataLake cluster, you can run the sudo command to switch to the root user after you log on to the cluster as the user emr-user.
You can use a private key to log on to all nodes of a DataLake cluster instead of only the master node. If you use a password for logon, you can still log on to the DataLake cluster as the root user.
emr-metadata
You can run the emr-metadata
command on each node of a DataLake cluster. This command outputs metadata information about the current node of the cluster, such as the cluster ID, the role of the node, the instance ID, and network and hardware configurations. This way, you can obtain the information of the current node that is required when you use the bootstrap action script.
Differences between DataLake clusters and Hadoop clusters
Category | Comparison item | DataLake cluster | Hadoop cluster |
Cluster | Time required to create a cluster | The average time is less than 5 minutes. | The average time is less than 10 minutes. |
Time required to add a node to a cluster | The average time is less than 3.5 minutes. | The average time is less than 10 minutes. | |
API | Supported. | Supported. | |
Domain name | You can use the Alibaba Cloud DNS PrivateZone service to resolve private domain names to IP addresses. | You can use the hosts file to resolve domain names to IP addresses. | |
Disk capacity expansion | Hot disk capacity expansion is supported. You can expand the capacity of your disk without the need to restart the related service. | Hot disk capacity expansion is not supported. You must restart the related service after you expand the capacity of your disk. | |
Service adding | Supported. | Supported. | |
Node group | vSwitch | You can select a vSwitch when you add a node group. | You can select a vSwitch only when you create an EMR cluster. After the cluster is created, the selected vSwitch cannot be changed. |
Assignment of public IP addresses |
|
| |
Additional security group | Supported. | Not supported. | |
Deployment set |
| Not supported. | |
Status of node groups | Supported. | Not supported. | |
Instance type | Nodes of the same specifications but different instance types are supported in a node group. |
| |
Auto scaling | Auto scaling | Auto scaling is decoupled from auto scaling groups and can be implemented at the node group level. This facilitates scaling. | A dedicated auto scaling group is required for auto scaling. This node group cannot be manually scaled. |
Auto scaling rule |
|
| |
Auto scaling history | You can view the auto scaling history, the causes of triggering auto scaling activities, and information about the added or removed nodes. | You can view only the auto scaling history. | |
Metric collection frequency | Metrics are collected every 30 seconds. | Metrics are collected every 30 seconds. | |
Time required for an auto scaling activity to take effect | 1 to 30 seconds after an auto scaling rule is applied. | 1 to 2 minutes after an auto scaling rule is applied. | |
Scaling | Scaling activity |
|
|
High availability and software | High availability | You cannot use an on-premises MySQL database as a Hive metastore database. | You can use an on-premises MySQL database as a Hive metastore database. |
Deployment sets are used. The three master nodes of a DataLake cluster are distributed to different underlying physical servers. This way, the risks that are related to hardware failures can be reduced. | Deployment sets are not used. | ||
NameNode and ResourceManager are deployed on three nodes. A high-availability cluster with only two master nodes is no longer supported. | NameNode and ResourceManager are deployed only on two nodes. A high-availability cluster with only two master nodes is supported. | ||
Service | All services are optional. | Required services and optional services are provided. | |
Combination of Spark 2 and Hadoop 3 | Supported. | Not supported. | |
Combination of Spark 3 and Hadoop 2 | Supported. | Supported in EMR V3.38.0 and later. |