To use E-MapReduce (EMR), you must select appropriate clusters. When you select configurations for EMR clusters, you must consider the use of big data in your enterprise and your financial budget, and estimate the amount of data and the reliability of services that you require.
Big data scenarios
The following table describes the use scenarios and core services of different types of EMR clusters.
The services displayed in the EMR console prevail for different types of EMR clusters.
Cluster type | Use scenario | Core service |
DataLake | DataLake clusters use the Hive and Spark services to compute and analyze offline data in data lake scenarios. Data lake formats, such as Delta Lake, Hudi, and Iceberg, are supported. | HDFS, YARN, Hive, Spark, Presto, Impala, JindoData, Delta Lake, Hudi, Iceberg, OpenLDAP, Knox, and Kyuubi |
Dataflow | Dataflow clusters are used for real-time data processing. The core service of a Dataflow cluster is Flink, which is an enterprise-level big data computing platform provided by Alibaba Cloud based on Apache Flink and EMR Hadoop. Kafka provides a comprehensive service monitoring system and a metadata management mechanism. The Kafka service applies to scenarios such as log collection and monitoring data aggregation and can be used for offline data processing, stream computing, and real-time data analysis. | Flink, Kafka, and YARN |
OLAP | Online analytical processing (OLAP) clusters are used for data analysis. The core service of an OLAP cluster is ClickHouse, which is an open source column-oriented database management system (DBMS) for OLAP. ClickHouse is more lightweight than Hadoop and Spark. ClickHouse supports linear scaling and is convenient, highly reliable, and fault-tolerant. StarRocks is an open source massively parallel processing (MPP) database for OLAP. StarRocks can respond to query requests for data in sub-seconds and join multiple tables. | ClickHouse, StarRocks, and ZooKeeper |
DataServing | DataServing clusters are more flexible, reliable, and efficient than other types of clusters. DataServing clusters provide the HBase service and are separated from data storage based on OSS-HDFS (JindoFS). In addition, data is cached by using JindoData to improve the read and write performance of DataServing clusters. | HBase, ZooKeeper, and JindoData |
EMR nodes
An EMR cluster consists of three types of nodes: master, core, and task nodes. For more information, see Node categories.
You can select ultra disks, local disks, standard SSDs, or local SSDs for EMR storage. These disks are ranked in descending order of performance: local SSDs > standard SSDs > local disks > ultra disks.
EMR underlying storage supports OSS (OSS Standard storage only) and HDFS. OSS has a higher data availability than HDFS. The data availability of OSS is 99.99999999%, while the data availability of HDFS depends on the reliability of cloud disk or local disk storage. Before you use EMR to compute data, you must restore the data that is stored to OSS Archive or Cold Archive storage. Then, you can store the data to OSS Standard storage.
Storage prices:
For information about the billing of disks, visit the Pricing tab on the Elastic Compute Service product page.
Select configurations for EMR
Select master node configurations.
Master nodes are used to deploy the master processes of Hadoop, such as NameNode and ResourceManager.
EMR components such as HDFS, YARN, Hive, and HBase use the high availability architecture. When you create a production cluster, we recommend that you enable high availability. If you do not enable high availability when you create an EMR cluster, you cannot enable high availability after you create the cluster.
Master nodes are used to store HDFS metadata and component log files. These nodes are compute-intensive with low disk I/O requirements. HDFS metadata is stored in memory. The minimum recommended memory size is 16 GB based on the number of files.
Select core node configurations.
The difference between core nodes and task nodes is that core nodes can run both the DataNode and NodeManager processes. We recommend that you store data to OSS or OSS-HDFS. HDFS in a cluster is used only as a temporary storage for distribution of YARN tasks. You can select a general-purpose Elastic Compute Service (ECS) instance type that uses cloud disks as the configuration of a core node. For example, you can select ecs.g7.4xlarge as the node type and set the data disk size to 100GB * 4 Disks.
Select task node configurations.
Task nodes are used when the computing capabilities of vCPUs and the memory of core nodes are insufficient. Task nodes do not store data or run DataNode. You can estimate the number of task nodes based on your vCPU and memory requirements.
EMR lifecycle
EMR supports auto scaling. You can scale out a cluster in an efficient and flexible manner. You can also upgrade the configurations of ECS instances in a cluster based on your business requirements.
Select a zone
To ensure high efficiency, we recommend that you deploy EMR and your business system in the same zone of the same region.