Hadoop is a free, open-source, scalable, and fault-tolerant framework written in Java that provides an efficient framework for running jobs on multiple nodes of clusters. Hadoop contains three main components: HDFS, MapReduce and YARN.
Since Hadoop is written in Java, you will need to install Java to your server first. You can install it by just running the following command:
apt-get install default-jdk -y
Then you can create a new user account for Hadoop and set up the SSH key-based authentication.
Next, download the latest version of the Hadoop from their official website and extract the downloaded file.
Next, move the extracted directory to the /opt
with the following command:
mv hadoop-3.1.0 /opt/hadoop
Next, change the ownership of the hadoop directory using the following command:
chown -R hadoop:hadoop /opt/hadoop/
Next, you will need to set and initialize environment variables. Then log in with hadoop user and create a directory for hadoop file system storage:
mkdir -p /opt/hadoop/hadoopdata/hdfs/namenode
mkdir -p /opt/hadoop/hadoopdata/hdfs/datanode
First, you will need to edit core-site.xml file. This file contains the Hadoop port number information, file system allocated memory, data store memory limit and the size of Read/Write buffers.
nano /opt/hadoop/etc/hadoop/core-site.xml
Make the following changes:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Save the file, then open the hdfs-site.xml
file. This file contains the replication data value, namenode path and datanode path for local file systems.
nano /opt/hadoop/etc/hadoop/hdfs-site.xml
Make the following changes:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///opt/hadoop/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///opt/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>
Save the file, then open the mapred-site.xml file.
nano /opt/hadoop/etc/hadoop/mapred-site.xml
Make the following changes:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Save the file, then open the yarn-site.xml file:
nano /opt/hadoop/etc/hadoop/yarn-site.xml
Make the following changes:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Save and close the file, when you are finished.
Hadoop is now installed and configured. It's time to initialize HDFS file system. You can do this by formatting Namenode:
hdfs namenode -format
Next, change the directory to the /opt/hadoop/sbin and start the Hadoop cluster using the following command:
cd /opt/hadoop/sbin/
start-dfs.sh
Next, check the status of the service using the following command:
jps
Now Hadoop is installed, you can access Hadoop different services through web browser. By default, Hadoop NameNode service started on port 9870. You can access it by visiting the URL http://192.168.0.104:9870 in your web browser.
To test Hadoop file system cluster. Create a directory in the HDFS file system and copy a file from local file system to HDFS storage. For details, you can go to How to Setup Hadoop Cluster Ubuntu 16.04.
Docker is a very popular containerization tool with which you can create containers where software or other dependencies that are installed run the application.
Apache Hadoop is a core big data framework written in Java to store and process Big Data. The storage component of Hadoop is called Hadoop Distributed File system (usually abbreviated HDFS) and the processing component of Hadoop is called MapReduce. Next, there are several daemons that will run inside a Hadoop cluster, which include NameNode, DataNode, Secondary Namenode, ResourceManager, and NodeManager.
This article shows you how to set up Docker to be used to launch a single-node Hadoop cluster inside a Docker container on Alibaba Cloud.
Hadoop User Experience (HUE) is an open-source Web interface used for analysing data with Hadoop Ecosystem applications. Hue provides interfaces to interact with HDFS, MapReduce, Hive and even Impala queries. In this article, we will explore how to access, browse, and interact with the files in Hadoop Distributed File System, and how using these can be simpler and easy.
Data Integration is an all-in-one data synchronization platform. The platform supports online real-time and offline data exchange between all data sources, networks, and locations.
Data Integration leverages the computing capability of Hadoop clusters to synchronize the HDFS data from clusters to MaxCompute. This is called Mass Cloud Upload. Data Integration can transmit up to 5TB of data per day. The maximum transmission rate is 2 GB/s.
Alibaba Cloud Object Storage Service (OSS) is an encrypted, secure, cost-effective, and easy-to-use object storage service that enables you to store, back up, and archive large amounts of data in the cloud, with a guaranteed reliability of 99.999999999%. RESTful APIs allow storage and access to OSS anywhere on the Internet. You can elastically scale the capacity and processing capability, and choose from a variety of storage types to optimize the storage cost.
Join Us at the Alibaba Cloud Activate Online Conference - Autumn 2019 Edition
2,599 posts | 762 followers
FollowAlibaba Clouder - August 26, 2019
Alibaba Cloud Community - December 29, 2021
Alibaba Clouder - September 29, 2019
Alibaba Clouder - July 9, 2018
Alibaba Clouder - May 22, 2019
Alibaba Clouder - May 23, 2019
2,599 posts | 762 followers
FollowConduct large-scale data warehousing with MaxCompute
Learn MoreA Big Data service that uses Apache Hadoop and Spark to process and analyze data
Learn MoreMore Posts by Alibaba Clouder