Hadoop is an open source, distributed, Java-based software framework that is developed by the Apache Foundation. Hadoop allows users to develop distributed programs and make full use of cluster capacity for high-speed computing and storage without the need to understand the underlying details of the distributed system. This topic describes how to build a Hadoop distributed environment on an Elastic Compute Service (ECS) instance that runs a Linux operating system.
Prerequisites
An ECS instance that meets the following requirements is created:
The ECS instance is associated with a system-assigned public IP address or an elastic IP address (EIP).
The ECS instance runs a Linux operating system.
Inbound rules are added to the security groups to which the ECS instance belongs to open ports 22, 443, 8088, and 9870. Port 8088 is the default web UI port for Hadoop Yet Another Resource Negotiator (YARN). Port 9870 is the default web UI port for Hadoop NameNode. For information about how to add an inbound security group rule, see Add a security group rule.
Background information
The Apache Hadoop software library is a framework that allows you to process large data sets across clusters of computers in a distributed manner by using simple programming models. The framework can scale up from a single server to thousands of machines that provide local computation and storage capabilities. Hadoop does not rely on hardware to deliver high availability. It is designed to detect and handle failures at the application layer to deliver a highly available service on top of a cluster of computers that are prone to failures.
Hadoop Distributed File System (HDFS) and MapReduce are vital components of Hadoop.
HDFS is a distributed file system that is used for distributed storage of application data and provides access to application data.
MapReduce is a distributed computing framework that distributes computing jobs across servers in a Hadoop cluster. Computing jobs are split into map tasks and reduce tasks. JobTracker schedules the tasks for distributed processing.
For more information, visit the Apache Hadoop website.
Hadoop is integrated with the Java Development Kit (JDK). Different Hadoop versions require the Java versions that correspond to different JDK versions.
Hadoop 3.3: Java 8 and Java 11
Hadoop 3.0.x to 3.2.x: Java 8
Hadoop 2.7.x to 2.10.x: Java 7 and Java 8
In this topic, Hadoop 3.2.4 and Java 8 are used. To use other Hadoop and Java versions, visit the official Hadoop website for instructions. For more information, see Hadoop Java Versions.
Step 1: Install JDK
Connect to the ECS instance.
For more information, see Connect to a Linux instance by using a password or key.
Run the following command to download the JDK 1.8 installation package:
wget https://download.java.net/openjdk/jdk8u41/ri/openjdk-8u41-b04-linux-x64-14_jan_2020.tar.gz
Run the following command to decompress the downloaded JDK 1.8 installation package:
tar -zxvf openjdk-8u41-b04-linux-x64-14_jan_2020.tar.gz
Run the following command to move and rename the folder to which the JDK installation files are extracted.
In this example, the folder is renamed
java8
. You can specify a different name for the folder based on your business requirements.sudo mv java-se-8u41-ri/ /usr/java8
Run the following commands to configure Java environment variables.
If the name you specified for the folder to which the JDK installation files are extracted is not java8, replace
java8
in the following commands with the actual folder name:sudo sh -c "echo 'export JAVA_HOME=/usr/java8' >> /etc/profile" sudo sh -c 'echo "export PATH=\$PATH:\$JAVA_HOME/bin" >> /etc/profile' source /etc/profile
Run the following command to check whether JDK is installed:
java -version
The following command output indicates that JDK is installed.
Step 2: Install Hadoop
Run the following command to download the Hadoop installation package:
wget https://mirrors.bfsu.edu.cn/apache/hadoop/common/hadoop-3.2.4/hadoop-3.2.4.tar.gz
Run the following commands to decompress the Hadoop installation package to the
/opt/hadoop
path:sudo tar -zxvf hadoop-3.2.4.tar.gz -C /opt/ sudo mv /opt/hadoop-3.2.4 /opt/hadoop
Run the following commands to configure Hadoop environment variables:
sudo sh -c "echo 'export HADOOP_HOME=/opt/hadoop' >> /etc/profile" sudo sh -c "echo 'export PATH=\$PATH:/opt/hadoop/bin' >> /etc/profile" sudo sh -c "echo 'export PATH=\$PATH:/opt/hadoop/sbin' >> /etc/profile" source /etc/profile
Run the following commands to modify the
yarn-env.sh
andhadoop-env.sh
configuration files:sudo sh -c 'echo "export JAVA_HOME=/usr/java8" >> /opt/hadoop/etc/hadoop/yarn-env.sh' sudo sh -c 'echo "export JAVA_HOME=/usr/java8" >> /opt/hadoop/etc/hadoop/hadoop-env.sh'
Run the following command to check whether Hadoop is installed:
hadoop version
The following command output indicates that Hadoop is installed.
Step 3: Configure Hadoop
Modify the
core-site.xml
configuration file of Hadoop.Run the following command to open the core-site.xml configuration file:
sudo vim /opt/hadoop/etc/hadoop/core-site.xml
Press the
I
key to enter Insert mode.In the
<configuration></configuration>
section, add the following content:<property> <name>hadoop.tmp.dir</name> <value>file:/opt/hadoop/tmp</value> <description>location to store temporary files</description> </property> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property>
Press the
Esc
key to exit Insert mode and enter:wq
to save and close the file.
Modify the
hdfs-site.xml
configuration file of Hadoop.Run the following command to open the hdfs-site.xml configuration file:
sudo vim /opt/hadoop/etc/hadoop/hdfs-site.xml
Press the
I
key to enter Insert mode.In the
<configuration></configuration>
section, add the following content:<property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/opt/hadoop/tmp/dfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/opt/hadoop/tmp/dfs/data</value> </property>
Press the
Esc
key to exit Insert mode and enter:wq
to save and close the file.
Step 4: Configure password-free SSH logon
Run the following command to create a pair of public and private keys:
ssh-keygen -t rsa
The following command output indicates that the public and private keys are created:
Run the following commands to add the public key to the
authorized_keys
file:cd .ssh cat id_rsa.pub >> authorized_keys
Step 5: Start Hadoop
Run the following command to initialize
NameNode
:hadoop namenode -format
Start Hadoop.
ImportantTo ensure system security and stability, Apache does not recommend that you start Hadoop as the root user. You may fail to start Hadoop as the root user due to a permissions issue. You can start Hadoop as a non-root user, such as
ecs-user
.If you want to start Hadoop as the root user, you must be familiar with the instructions for managing Hadoop permissions and the associated risks and modify the following configuration files.
Take note that when you start Hadoop as the root user, severe security risks may arise. The security risks include, but not limited to, data leaks, vulnerabilities that can be exploited by malware to obtain the highest system permissions, and unexpected permissions issues or operations. For information about Hadoop permissions, see Hadoop in Secure Mode.
Run the following command to start the HDFS service.
The start-dfs.sh command starts the HDFS service by starting components such as NameNode, SecondaryNameNode, and DataNode.
start-dfs.sh
The following command output indicates that the HDFS service is started.
Run the following command to start the YARN service.
The start-yarn.sh command starts the YARN service by starting components such as ResourceManager, NodeManager, and ApplicationHistoryServer.
start-yarn.sh
The following command output indicates that the YARN service is started.
Run the following command to view the processes that are started:
jps
The processes that are shown in the following figure are started.
Enter
http://<Public IP address of the ECS instance>:8088
in the address bar of a web browser on your computer to access the web UI of YARN.The web UI shows information about the entire cluster, including resource usage, status of applications (such as MapReduce jobs), and queue information.
ImportantMake sure that an inbound rule is added to a security group to which the ECS instance belongs to open port 8088. Otherwise, you cannot access the web UI. For more information, see Add a security group rule.
Enter
http://<Public IP address of the ECS instance>:9870
in the address bar of a web browser on your computer to access the web UI of NameNode.The web UI shows information about the HDFS file system, including the file system status, cluster health, active nodes, and NameNode logs.
The page shown in the following figure indicates that the distributed Hadoop environment is built.
ImportantMake sure that an inbound rule is added to a security group to which the ECS instance belongs to open port 9870. Otherwise, you cannot access the web UI. For information about how to add an inbound security group rule, see Add a security group rule.