EMR is an abbreviation of E-MapReduce Service. EMR is a big data processing solution provided by Alibaba Cloud. EMR is built on Alibaba Cloud Elastic Compute Service (ECS) and developed based on both open-source Apache Hadoop and Apache Spark. It allows you to conveniently use peripheral systems in the Hadoop and Spark ecosystems to analyze and process data. EMR can also read data from or write data to other Alibaba Cloud storage systems and database systems, such as Object Storage Service (OSS) and ApsaraDB RDS.
SmartData is a storage service for the EMR Jindo engine. SmartData provides centralized storage, caching, and computing optimization for EMR computing engines and extends storage features.
If you use an open-source distributed processing system, such as Hadoop or Spark, to process data without using EMR, you must perform all the steps in the following figure.
In this procedure, only the last three steps are related to your application logic. The first seven steps are all preparations, which are complex and time-consuming. EMR integrates all the required cluster management tools to provide the following features: host selection, environment deployment, cluster building, cluster configuration, cluster running, job configuration, job running, cluster management, and performance monitoring. This frees you from all the tedious procurement, preparation, and O&M work required to build clusters. You need only focus on the processing logic of your applications.
EMR also offers different combinations of cluster services to meet your business requirements. For example, to perform daily data measurement and batch computing, you need only to run the Hadoop service for EMR. If you also want to perform stream computing and real-time computing, you can add the Spark service.
Clusters are the core user-oriented component of EMR. An EMR cluster is a Hadoop or Spark cluster that is deployed on one or more ECS instances. For example, a Hadoop cluster consists of some daemon processes, such as NameNode, DataNode, ResouceManager, and NodeManager. These daemon processes run on the ECS instances of the cluster. Each ECS instance corresponds to a node. The NameNode and ResourceManager processes run on master nodes, whereas the DataNode and NodeManager processes run on core and task nodes.
The following figure shows an EMR cluster that consists of one master node and three core and task nodes.
EMR clusters are created based on the Hadoop ecosystem. EMR clusters can exchange data seamlessly with Alibaba Cloud services such as Object Storage Service (OSS) and ApsaraDB Relational Database Service (RDS). This enables you to share and transmit data among multiple systems to meet different business demands.
EMR provides an integrated solution to manage clusters, which frees you up from the complex management of clusters. EMR has some practical strength over self-managed clusters.
It's important to understand the process of Big Data Analytics on Alibaba Cloud E-MapReduce. Similarly, it's equally important to manage the environment that you are using for everything as well. Managing a Hadoop cluster, similar to maintaining high availability, starting and stopping of services, and scaling out for computational issues, is a mandatory piece of providing a smooth way to process big data with uninterrupted services. These actions are made easier in Alibaba Cloud, of course, because you manage everything by using the web interface in a convenient fashion.
For people who are new to using Alibaba E-MapReduce, this article specifically addresses EMR cluster management. In contrast to the previous article, Diving into Big Data: Getting Started with OSS and EMR, in which we have seen how to create a cluster in EMR as an initial step, this article will additionally consider various methods for creating an EMR cluster, as well as services running on initiating a cluster, expanding a cluster, releasing a cluster, among other things.
EMR takes care of most of the basic tasks required for cluster creation and provisioning, while at the same time it provides an integrated framework for managing and using clusters. It utilizes the complete capabilities of Hadoop and Spark, so you need not provision Hadoop right from scratch. Based on Spark –means you can even stream large volumes of data. It easily integrates with other products of Alibaba Cloud such as Alibaba Elastic Computing Services (ECS) and OSS.
EMR is an all-in-one enterprise-ready big data platform that provides cluster, job, and data management services based on open-source ecosystems, such as Hadoop, Spark, Kafka, Flink, and Storm.
How to Set up Global Accelerator with Source IP Address Persistence to Accelerate Your Application
2,599 posts | 762 followers
FollowAlibaba Clouder - April 13, 2021
Alibaba Clouder - April 14, 2021
Alibaba EMR - November 4, 2020
Alibaba Clouder - March 31, 2021
Alibaba EMR - July 9, 2021
Alibaba Cloud MaxCompute - September 23, 2019
2,599 posts | 762 followers
FollowA Big Data service that uses Apache Hadoop and Spark to process and analyze data
Learn MoreA cloud solution for smart technology providers to quickly build stable, cost-efficient, and reliable ubiquitous platforms
Learn MoreProvides secure and reliable communication between devices and the IoT Platform which allows you to manage a large number of devices on a single IoT Platform.
Learn MoreFully managed, locally deployed Alibaba Cloud infrastructure and services with consistent user experience and management APIs with Alibaba Cloud public cloud.
Learn MoreMore Posts by Alibaba Clouder