Setup a Single-Node Hadoop Cluster Using Docker

By Avi Anish, Alibaba Cloud Community Blog author.

In this article, we will look at how you can set up Docker to be used to launch a single-node Hadoop cluster inside a Docker container on an Alibaba ECS instance.

Background Information

Before we get into the main part of this tutorial, let's look at some of the major products discussed in this blog.

First, there's Docker, which is a very popular containerization tool with which you can create containers where software or other dependencies that are installed run the application. You might have heard about virtual machines. Well, Docker containers are basically a light-weight version of a virtual machine. Creating a docker container to run an application is very easy, and you can launch them on the fly.

Next, there's Apache Hadoop, which is a core big data framework to store and process Big Data. The storage component of Hadoop is called Hadoop Distributed File system (usually abbreviated HDFS) and the processing component of Hadoop is called MapReduce. Next, there are several daemons that will run inside a Hadoop cluster, which include NameNode, DataNode, Secondary Namenode, ResourceManager, and NodeManager.

Main Tutorial

Now that you know a bit more about what Docker and Hadoop are, let's look at how you can set up a single node Hadoop cluster using Docker.

First, for this tutorial, we will be using an Alibaba Cloud ECS instance with Ubuntu 18.04 installed. Next, as part of this tutorial, let's assume that you have docker installed on this ubuntu system. Below are the details of this setup:

As a preliminary part of this tutorial, you'll want to confirm everything's up and running as it should. First, to confirm that docker is installed on the instance system, you can run the below command to check the version of docker installed on the machine.

root@alibaba-docker:~# docker -v
Docker version 18.09.6, build 481bc77

Next, to check that docker is running correctly, launch a simple hello world container.

root@alibaba-docker:~# docker container run hello-world
Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
1b930d010525: Pull complete 
Digest: sha256:41a65640635299bab090f783209c1e3a3f11934cf7756b09cb2f1e02147c6ed8
Status: Downloaded newer image for hello-world:latest

Hello from Docker!

This message shows that your installation appears to be working correctly.

To generate this message, Docker will have completed the following operations in the background:

The Docker client contacted the Docker daemon.
The Docker daemon pulled the "hello-world" image from the Docker Hub (amd64).
The Docker daemon created a new container from that image which runs the executable that produces the output you are currently reading.
The Docker daemon streamed that output to the Docker client, which sent it to your terminal.

To try something more ambitious, you can also run an Ubuntu-installed container with this command below:

 $ docker run -it ubuntu bash

Next, you can share images, automate workflows, and do more with a free Docker ID, which you can find at Docker's official website. And, for more examples and ideas, check out Docker's Get Started guide.

If you are getting the output you see above, the docker is running properly on your instance, and now we can setup Hadoop inside a docker container. To do so, run a pull command to get the docker image on Hadoop. More specifically, what you'll see is a Docker Image a file with multiple layers, which you'll use to deploy containers.

root@alibaba-docker:~# docker pull sequenceiq/hadoop-docker:2.7.0
2.7.0: Pulling from sequenceiq/hadoop-docker
b253335dcf03: Pulling fs layer 
a3ed95caeb02: Pulling fs layer 
69623ef05416: Pulling fs layer 
63aebddf4bce: Pulling fs layer 
46305a4cda1d: Pulling fs layer 
70ff65ec2366: Pulling fs layer 
72accdc282f3: Pulling fs layer 
5298ddb3b339: Pulling fs layer 
ec461d25c2ea: Pulling fs layer 
315b476b23a4: Pulling fs layer 
6e6acc31f8b1: Pulling fs layer 
38a227158d97: Pulling fs layer 
319a3b8afa25: Pulling fs layer 
11e1e16af8f3: Pulling fs layer 
834533551a37: Pulling fs layer 
c24255b6d9f4: Pulling fs layer 
8b4ea3c67dc2: Pulling fs layer 
40ba2c2cdf73: Pulling fs layer 
5424a04bc240: Pulling fs layer 
7df43f09096d: Pulling fs layer 
b34787ee2fde: Pulling fs layer 
4eaa47927d15: Pulling fs layer 
cb95b9da9646: Pulling fs layer 
e495e287a108: Pulling fs layer 
3158ca49a54c: Pulling fs layer 
33b5a5de9544: Pulling fs layer 
d6f46cf55f0f: Pulling fs layer 
40c19fb76cfd: Pull complete 
018a1f3d7249: Pull complete 
40f52c973507: Pull complete 
49dca4de47eb: Pull complete 
d26082bd2aa9: Pull complete 
c4f97d87af86: Pull complete 
fb839f93fc0f: Pull complete 
43661864505e: Pull complete 
d8908a83648e: Pull complete 
af8b686deb23: Pull complete 
c1214abd7b96: Pull complete 
9d00f27ba8d2: Pull complete 
09f787a7573b: Pull complete 
4e86267d5247: Pull complete 
3876cba35aed: Pull complete 
23df48ffdb39: Pull complete 
646aedbc2bb6: Pull complete 
60a65f8179cf: Pull complete 
046b321f8081: Pull complete 
Digest: sha256:a40761746eca036fee6aafdf9fdbd6878ac3dd9a7cd83c0f3f5d8a0e6350c76a
Status: Downloaded newer image for sequenceiq/hadoop-docker:2.7.0

Next, run the below command to the check the list of docker images present on your system.

root@alibaba-docker:~#:~$ docker images
REPOSITORY                 TAG                 IMAGE ID            CREATED             SIZE
hello-world                latest              fce289e99eb9        5 months ago        1.84kB
sequenceiq/hadoop-docker   2.7.0               789fa0a3b911        4 years ago         1.76GB

After that's out of the way, run the Hadoop docker image inside a docker container.

root@alibaba-docker:~# docker run -it sequenceiq/hadoop-docker:2.7.0 /etc/bootstrap.sh -bash
/
Starting sshd:                                             [  OK  ]
Starting namenodes on [9f397feb3a46]
9f397feb3a46: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-9f397feb3a46.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-9f397feb3a46.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-root-secondarynamenode-9f397feb3a46.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn--resourcemanager-9f397feb3a46.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-9f397feb3a46.out
bash-4.1#

In the above output you can see, the container is starting all the Hadoop daemons one by one. Just to make sure all the daemons are up and running, run the jps command.

bash-4.1# jps
942 Jps
546 ResourceManager
216 DataNode
371 SecondaryNameNode
126 NameNode
639 NodeManager
bash-4.1#

If you get the above output after running the jps command, then you can be assured that all the hadoop daemons are running correctly. After that, run the docker command shown below to get the details on the docker container.

root@alibaba-docker:~# docker ps
CONTAINER ID        IMAGE                            COMMAND                  CREATED             STATUS              PORTS                                                                                                                                NAMES
9f397feb3a46        sequenceiq/hadoop-docker:2.7.0   "/etc/bootstrap.sh -…"   5 minutes ago      Up 5 minutes       2122/tcp, 8030-8033/tcp, 8040/tcp, 8042/tcp, 8088/tcp, 19888/tcp, 49707/tcp, 50010/tcp, 50020/tcp, 50070/tcp, 50075/tcp, 50090/tcp   determined_ritchie

Run the below command to get the IP address on which the container is running.

bash-4.1# ifconfig
eth0      Link encap:Ethernet  HWaddr 02:42:AC:11:00:02  
          inet addr:172.17.0.2  Bcast:172.17.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:93 errors:0 dropped:0 overruns:0 frame:0
          TX packets:21 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:9760 (9.5 KiB)  TX bytes:1528 (1.4 KiB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:3160 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3160 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:455659 (444.9 KiB)  TX bytes:455659 (444.9 KiB)

From the above output, we know that the docker container is running at 172.17.0.2. Following this, the Hadoop cluster web interface can be accessed on port 50070. So, you can open your browser, specifically Mozilla Firefox browser, in your ubuntu machine and go to 172.17.0.2:50070. Your Hadoop Overview Interface will open. You can see from below output that Hadoop is running at port 9000, which is the default port.

Now scroll down to "Summary", where you will get the details for your Hadoop cluster, which is running inside the docker container. Live Node 1 means 1 datanode is up and running (single node).

To access the Hadoop Distributed File System (HDFS), you can go to Utilities -> Browse the file system. You will find the user directory, which is present in HDFS by default.

Now you're inside a docker container. You will find bash shell inside the container, not the default ubuntu terminal shell. Let's run a wordcount mapreduce program on this Hadoop cluster running inside a docker container. This program will take the input containing text and give output as key value pair where key will the work and value will be the number of occurrences of that word. First thing you'll want to do is go to the Hadoop home directory.

bash-4.1# cd $HADOOP_PREFIX

Next, run the hadoop-mapreduce-examples-2.7.0.jar file, which has a wordcount program pre-installed. The output is as follows:

bash-4.1# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep input output 'dfs[a-z.]+'
19/06/25 11:28:55 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/06/25 11:28:58 INFO input.FileInputFormat: Total input paths to process : 31
19/06/25 11:28:59 INFO mapreduce.JobSubmitter: number of splits:31
19/06/25 11:28:59 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1561475564487_0001
19/06/25 11:29:00 INFO impl.YarnClientImpl: Submitted application application_1561475564487_0001
19/06/25 11:29:01 INFO mapreduce.Job: The url to track the job: http://9f397feb3a46:8088/proxy/application_1561475564487_0001/
19/06/25 11:29:01 INFO mapreduce.Job: Running job: job_1561475564487_0001
19/06/25 11:29:22 INFO mapreduce.Job: Job job_1561475564487_0001 running in uber mode : false
19/06/25 11:29:22 INFO mapreduce.Job:  map 0% reduce 0%
19/06/25 11:30:22 INFO mapreduce.Job:  map 13% reduce 0%
19/06/25 11:30:23 INFO mapreduce.Job:  map 19% reduce 0%
19/06/25 11:31:19 INFO mapreduce.Job:  map 23% reduce 0%
19/06/25 11:31:20 INFO mapreduce.Job:  map 26% reduce 0%
19/06/25 11:31:21 INFO mapreduce.Job:  map 39% reduce 0%
19/06/25 11:32:11 INFO mapreduce.Job:  map 39% reduce 13%
19/06/25 11:32:13 INFO mapreduce.Job:  map 42% reduce 13%
19/06/25 11:32:14 INFO mapreduce.Job:  map 55% reduce 15%
19/06/25 11:32:18 INFO mapreduce.Job:  map 55% reduce 18%
19/06/25 11:32:59 INFO mapreduce.Job:  map 58% reduce 18%
19/06/25 11:33:00 INFO mapreduce.Job:  map 61% reduce 18%
19/06/25 11:33:02 INFO mapreduce.Job:  map 71% reduce 19%
19/06/25 11:33:05 INFO mapreduce.Job:  map 71% reduce 24%
19/06/25 11:33:45 INFO mapreduce.Job:  map 74% reduce 24%
19/06/25 11:33:46 INFO mapreduce.Job:  map 81% reduce 24%
19/06/25 11:33:47 INFO mapreduce.Job:  map 84% reduce 26%
19/06/25 11:33:48 INFO mapreduce.Job:  map 87% reduce 26%
19/06/25 11:33:50 INFO mapreduce.Job:  map 87% reduce 29%
19/06/25 11:34:28 INFO mapreduce.Job:  map 90% reduce 29%
19/06/25 11:34:29 INFO mapreduce.Job:  map 97% reduce 29%
19/06/25 11:34:30 INFO mapreduce.Job:  map 100% reduce 32%
19/06/25 11:34:32 INFO mapreduce.Job:  map 100% reduce 100%
19/06/25 11:34:32 INFO mapreduce.Job: Job job_1561475564487_0001 completed successfully
19/06/25 11:34:32 INFO mapreduce.Job: Counters: 50
File System Counters
FILE: Number of bytes read=345
FILE: Number of bytes written=3697508
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=80529
HDFS: Number of bytes written=437
HDFS: Number of read operations=96
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters 
Killed map tasks=1
Launched map tasks=32
Launched reduce tasks=1
Data-local map tasks=32
Total time spent by all maps in occupied slots (ms)=1580786
Total time spent by all reduces in occupied slots (ms)=191081
Total time spent by all map tasks (ms)=1580786
Total time spent by all reduce tasks (ms)=191081
Total vcore-seconds taken by all map tasks=1580786
Total vcore-seconds taken by all reduce tasks=191081
Total megabyte-seconds taken by all map tasks=1618724864
Total megabyte-seconds taken by all reduce tasks=195666944
Map-Reduce Framework
Map input records=2060
Map output records=24
Map output bytes=590
Map output materialized bytes=525
Input split bytes=3812
Combine input records=24
Combine output records=13
Reduce input groups=11
Reduce shuffle bytes=525
Reduce input records=13
Reduce output records=11
Spilled Records=26
Shuffled Maps =31
Failed Shuffles=0
Merged Map outputs=31
GC time elapsed (ms)=32401
CPU time spent (ms)=19550
Physical memory (bytes) snapshot=7076614144
Virtual memory (bytes) snapshot=22172876800
Total committed heap usage (bytes)=5196480512
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters 
Bytes Read=76717
File Output Format Counters 
Bytes Written=437
19/06/25 11:34:32 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/06/25 11:34:33 INFO input.FileInputFormat: Total input paths to process : 1
19/06/25 11:34:33 INFO mapreduce.JobSubmitter: number of splits:1
19/06/25 11:34:33 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1561475564487_0002
19/06/25 11:34:33 INFO impl.YarnClientImpl: Submitted application application_1561475564487_0002
19/06/25 11:34:33 INFO mapreduce.Job: The url to track the job: http://9f397feb3a46:8088/proxy/application_1561475564487_0002/
19/06/25 11:34:33 INFO mapreduce.Job: Running job: job_1561475564487_0002
19/06/25 11:34:50 INFO mapreduce.Job: Job job_1561475564487_0002 running in uber mode : false
19/06/25 11:34:50 INFO mapreduce.Job:  map 0% reduce 0%
19/06/25 11:35:04 INFO mapreduce.Job:  map 100% reduce 0%
19/06/25 11:35:18 INFO mapreduce.Job:  map 100% reduce 100%
19/06/25 11:35:19 INFO mapreduce.Job: Job job_1561475564487_0002 completed successfully
19/06/25 11:35:19 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=291
FILE: Number of bytes written=230543
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=570
HDFS: Number of bytes written=197
HDFS: Number of read operations=7
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters 
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=10489
Total time spent by all reduces in occupied slots (ms)=12436
Total time spent by all map tasks (ms)=10489
Total time spent by all reduce tasks (ms)=12436
Total vcore-seconds taken by all map tasks=10489
Total vcore-seconds taken by all reduce tasks=12436
Total megabyte-seconds taken by all map tasks=10740736
Total megabyte-seconds taken by all reduce tasks=12734464
Map-Reduce Framework
Map input records=11
Map output records=11
Map output bytes=263
Map output materialized bytes=291
Input split bytes=133
Combine input records=0
Combine output records=0
Reduce input groups=5
Reduce shuffle bytes=291
Reduce input records=11
Reduce output records=11
Spilled Records=22
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=297
CPU time spent (ms)=1610
Physical memory (bytes) snapshot=346603520
Virtual memory (bytes) snapshot=1391702016
Total committed heap usage (bytes)=245891072
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters 
Bytes Read=437
File Output Format Counters 
Bytes Written=197

After the mapreduce program has finished executing the operation, run the below command to check the output.

bash-4.1# bin/hdfs dfs -cat output/*
6 dfs.audit.logger
4 dfs.class
3 dfs.server.namenode.
2 dfs.period
2 dfs.audit.log.maxfilesize
2 dfs.audit.log.maxbackupindex
1 dfsmetrics.log
1 dfsadmin
1 dfs.servers
1 dfs.replication
1 dfs.file
bash-4.1#

Now you've successfully setup a single node Hadoop cluster using Docker. You can check out other articles on Alibaba Cloud to learn more about Docker and Hadoop.

Community

Setup a Single-Node Hadoop Cluster Using Docker

Background Information

Main Tutorial

Read previous post:

Read next post:

Alibaba Clouder

You may also like

Comments

Alibaba Clouder

Related Products

MaxCompute

DataWorks

ECS(Elastic Compute Service)