Build a Hadoop environment - Elastic Compute Service - Alibaba Cloud Documentation Center

Hadoop is an open source, distributed, Java-based software framework that is developed by the Apache Foundation. Hadoop allows users to develop distributed programs and make full use of cluster capacity for high-speed computing and storage without the need to understand the underlying details of the distributed system. This topic describes how to build a Hadoop distributed environment on an Elastic Compute Service (ECS) instance that runs a Linux operating system.

Prerequisites

An ECS instance that meets the following requirements is created:

The ECS instance is associated with a system-assigned public IP address or an elastic IP address (EIP).
The ECS instance runs a Linux operating system.
Inbound rules are added to the security groups to which the ECS instance belongs to open ports 22, 443, 8088, and 9870. Port 8088 is the default web UI port for Hadoop Yet Another Resource Negotiator (YARN). Port 9870 is the default web UI port for Hadoop NameNode. For information about how to add an inbound security group rule, see Add a security group rule.

Background information

The Apache Hadoop software library is a framework that allows you to process large data sets across clusters of computers in a distributed manner by using simple programming models. The framework can scale up from a single server to thousands of machines that provide local computation and storage capabilities. Hadoop does not rely on hardware to deliver high availability. It is designed to detect and handle failures at the application layer to deliver a highly available service on top of a cluster of computers that are prone to failures.

Hadoop Distributed File System (HDFS) and MapReduce are vital components of Hadoop.

HDFS is a distributed file system that is used for distributed storage of application data and provides access to application data.
MapReduce is a distributed computing framework that distributes computing jobs across servers in a Hadoop cluster. Computing jobs are split into map tasks and reduce tasks. JobTracker schedules the tasks for distributed processing.

For more information, visit the Apache Hadoop website.

Hadoop is integrated with the Java Development Kit (JDK). Different Hadoop versions require the Java versions that correspond to different JDK versions.

Hadoop 3.3: Java 8 and Java 11
Hadoop 3.0.x to 3.2.x: Java 8
Hadoop 2.7.x to 2.10.x: Java 7 and Java 8

In this topic, Hadoop 3.2.4 and Java 8 are used. To use other Hadoop and Java versions, visit the official Hadoop website for instructions. For more information, see Hadoop Java Versions.

Step 1: Install JDK

Connect to the ECS instance.
For more information, see Connect to a Linux instance by using a password or key.

Run the following command to download the JDK 1.8 installation package:

wget https://download.java.net/openjdk/jdk8u41/ri/openjdk-8u41-b04-linux-x64-14_jan_2020.tar.gz

Run the following command to decompress the downloaded JDK 1.8 installation package:
```
tar -zxvf openjdk-8u41-b04-linux-x64-14_jan_2020.tar.gz
```
Run the following command to move and rename the folder to which the JDK installation files are extracted.
In this example, the folder is renamed java8. You can specify a different name for the folder based on your business requirements.
```
sudo mv java-se-8u41-ri/ /usr/java8
```
Run the following commands to configure Java environment variables.
If the name you specified for the folder to which the JDK installation files are extracted is not java8, replace java8 in the following commands with the actual folder name:
```
sudo sh -c "echo 'export JAVA_HOME=/usr/java8' >> /etc/profile"
sudo sh -c 'echo "export PATH=\$PATH:\$JAVA_HOME/bin" >> /etc/profile'
source /etc/profile
```
Run the following command to check whether JDK is installed:
```
java -version
```
The following command output indicates that JDK is installed.

Step 2: Install Hadoop

Run the following command to download the Hadoop installation package:

wget http://mirrors.cloud.aliyuncs.com/apache/hadoop/common/hadoop-3.2.4/hadoop-3.2.4.tar.gz

Run the following commands to decompress the Hadoop installation package to the /opt/hadoop path:
```
sudo tar -zxvf hadoop-3.2.4.tar.gz -C /opt/
sudo mv /opt/hadoop-3.2.4 /opt/hadoop
```

Run the following commands to configure Hadoop environment variables:

sudo sh -c "echo 'export HADOOP_HOME=/opt/hadoop' >> /etc/profile"
sudo sh -c "echo 'export PATH=\$PATH:/opt/hadoop/bin' >> /etc/profile"
sudo sh -c "echo 'export PATH=\$PATH:/opt/hadoop/sbin' >> /etc/profile"
source /etc/profile

Run the following commands to modify the yarn-env.sh and hadoop-env.sh configuration files:

sudo sh -c 'echo "export JAVA_HOME=/usr/java8" >> /opt/hadoop/etc/hadoop/yarn-env.sh'
sudo sh -c 'echo "export JAVA_HOME=/usr/java8" >> /opt/hadoop/etc/hadoop/hadoop-env.sh'

Run the following command to check whether Hadoop is installed:
```
hadoop version
```
The following command output indicates that Hadoop is installed.

Step 3: Configure Hadoop

Modify the core-site.xml configuration file of Hadoop.

Run the following command to open the core-site.xml configuration file:
```
sudo vim /opt/hadoop/etc/hadoop/core-site.xml
```
Press the I key to enter Insert mode.

In the <configuration></configuration> section, add the following content:

    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/opt/hadoop/tmp</value>
        <description>location to store temporary files</description>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>

Press the Esc key to exit Insert mode and enter :wq to save and close the file.

Modify the hdfs-site.xml configuration file of Hadoop.

Run the following command to open the hdfs-site.xml configuration file:
```
sudo vim /opt/hadoop/etc/hadoop/hdfs-site.xml
```
Press the I key to enter Insert mode.

In the <configuration></configuration> section, add the following content:

    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/opt/hadoop/tmp/dfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/opt/hadoop/tmp/dfs/data</value>
    </property>

Press the Esc key to exit Insert mode and enter :wq to save and close the file.

Step 4: Configure password-free SSH logon

Run the following command to create a pair of public and private keys:
```
ssh-keygen -t rsa
```
The following command output indicates that the public and private keys are created:
Run the following commands to add the public key to the authorized_keys file:
```
cd .ssh
cat id_rsa.pub >> authorized_keys
```

Step 5: Start Hadoop

Run the following command to initialize NameNode:
```
hadoop namenode -format
```
Start Hadoop.
Important
- To ensure system security and stability, Apache does not recommend that you start Hadoop as the root user. You may fail to start Hadoop as the root user due to a permissions issue. You can start Hadoop as a non-root user, such as ecs-user.
- If you want to start Hadoop as the root user, you must be familiar with the instructions for managing Hadoop permissions and the associated risks and modify the following configuration files.
  Take note that when you start Hadoop as the root user, severe security risks may arise. The security risks include, but not limited to, data leaks, vulnerabilities that can be exploited by malware to obtain the highest system permissions, and unexpected permissions issues or operations. For information about Hadoop permissions, see Hadoop in Secure Mode.
Modify configuration files to allow you to start Hadoop as the root user.
In most cases, the following configuration files are stored in the /opt/hadoop/sbin directory.
Add the following parameters to the start-dfs.sh and stop-dfs.sh configuration files:
HDFS_DATANODE_USER=root HADOOP_SECURE_DN_USER=hdfs HDFS_NAMENODE_USER=root HDFS_SECONDARYNAMENODE_USER=root
Add the following parameters to the start-yarn.sh and stop-yarn.sh configuration files:
YARN_RESOURCEMANAGER_USER=root HADOOP_SECURE_DN_USER=yarn YARN_NODEMANAGER_USER=root
1. Run the following command to start the HDFS service.
  The start-dfs.sh command starts the HDFS service by starting components such as NameNode, SecondaryNameNode, and DataNode.
```
start-dfs.sh
```
  The following command output indicates that the HDFS service is started.
2. Run the following command to start the YARN service.
  The start-yarn.sh command starts the YARN service by starting components such as ResourceManager, NodeManager, and ApplicationHistoryServer.
```
start-yarn.sh
```
  The following command output indicates that the YARN service is started.
Run the following command to view the processes that are started:
```
jps
```
The processes that are shown in the following figure are started.
Enter http://<Public IP address of the ECS instance>:8088 in the address bar of a web browser on your computer to access the web UI of YARN.
The web UI shows information about the entire cluster, including resource usage, status of applications (such as MapReduce jobs), and queue information.
Important
Make sure that an inbound rule is added to a security group to which the ECS instance belongs to open port 8088. Otherwise, you cannot access the web UI. For more information, see Add a security group rule.
Enter http://<Public IP address of the ECS instance>:9870 in the address bar of a web browser on your computer to access the web UI of NameNode.
The web UI shows information about the HDFS file system, including the file system status, cluster health, active nodes, and NameNode logs.
The page shown in the following figure indicates that the distributed Hadoop environment is built.
Important
Make sure that an inbound rule is added to a security group to which the ECS instance belongs to open port 9870. Otherwise, you cannot access the web UI. For information about how to add an inbound security group rule, see Add a security group rule.