In this tutorial, we will install and configure a Hadoop cluster using Raspberry Pi’s. Our cluster will consists of four nodes (one master and three slaves). Since Raspberries have a low price and Hadoop is open source, together they offer a great resource to learn Hadoop. Make sure you have at least 8 hours set aside to complete this project or split it up into manageable sections. 

Items in blue are code to be copied. Items in red are to be changed for your environment. Items in purple are for slave nodes only.

Materials

  • 4x Raspberry PI 3 Model (4 Core 1Ghz CPU, 1GB RAM)
  • 4x 16gb MicroSDHC cards (Sandisk UHS-I 10x)
  • 4x micro USB to USB for power
  • 4x CAT5e network cables (3feet if possible)
  • 1x Powered USB Hub
  • 1x 5 or more port switch
  • 1x Dogbone case for 4x Raspberry Pi’s (Any case will do)
  • Rasbpian Jessie Lite (Linux 4.1.17-v7+)
  • Oracle Java java version 1.8.0_65
  • Hadoop 2.7.2 (ARM native compiled from source)

For my setup, I ordered the dogbone case, usb hub, microUSB to USB cables, and ethernet cables from Amazon. Since Microcenter is local, I drove over and purchased 4 Raspberry Pi 3’s for $29 each and 4 16gb MicroSDHC cards for $7.99 each. I already owned a 24 port non-managed switch and a 48 port Cisco switch. I also had a spare monitor, keyboard, and HDMI to DVI cable. Once I setup the Raspberry Pi’s using the monitor, I would ssh into each one from a remote computer. The total cost of this project was under $200.


Install Raspbian

Download Raspbian Jessie Lite: https://www.raspberrypi.org/downloads/raspbian/

Write to SD card (use any tool of choice) for windows I use: https://sourceforge.net/projects/win32diskimager/

Plugin in SD card and fire up your PI.

For initial configuration (raspi-config)

  • Expand filesystem
  • Memory Split Choose 16MB to give as much RAM as possible for Hadoop
  • Change the hostname (rpi0)
  • Turn on SSH
  • Change the root password

Reboot

Basic Pre-installation Configuration

Change Password, Update, and Install Java, rsync, and ant

passwd
sudo apt-get update && sudo apt-get install oracle-java8-jdk && sudo apt-get install rsync && sudo apt-get install ant 



Verify that it was installed correctly:

java –version

Run update-alternatives, ensure jdk-8-oracle-*** is selected if you have Java 7

sudo update-alternatives --config java

Setup Connectivity

sudo nano /etc/hosts

10.0.0.100 rpi0
10.0.0.101 rpi1
10.0.0.102 rpi2
10.0.0.103 rpi3

sudo nano /etc/dhcpcd.conf

At the bottom of the file add:

static ip_address=interface eth0
static ip_address=10.0.0.100/24
# static ip_address=10.0.0.101/24
# static ip_address=10.0.0.102/24
# static ip_address=10.0.0.103/24
static routers=10.0.0.1
static domain_name_servers=10.0.0.1

Restart & SSH to new IP

sudo shutdown -r

Login locally or remote

ssh pi@10.0.0.100

Configure Hadoop User

sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser
sudo adduser hduser sudo
logout

Log in as the newly created hduser. From here forward, everything will be executed as hduser. Create SSH pairs keys with blank password. This will enable nodes to communicate with each other in the cluster.

mkdir ~/.ssh
ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys

SSH to your node and answer yes when prompted to trust certificate key – otherwise Hadoop will fail to login later.

ssh rpi0
exit

Install Hadoop 2.7.3 for Raspberry PI (ARM)

Ensure you have logged out as hduser and logged in as pi user. (for sudo command below to work properly, as we did above)

Install Hadoop

cd ~/
sudo mkdir /opt (directory may already exist)
cd /opt/span>
sudo wget http://www-us.apache.org/dist/hadoop/common/stable/hadoop-2.7.3.tar.gz
sudo tar xvzf hadoop-2.7.3.tar.gz
sudo mv hadoop-2.7.3 hadoop
sudo chown -R hduser:hadoop hadoop

Give access to hduser

sudo chown -R hduser:hadoop /opt/hadoop/

Add environment variables

sudo nano ~/.bashrc

Add the following to the bottom of the file

# — HADOOP ENVIRONMENT VARIABLES START — #
export JAVA_HOME=$(readlink -f /usr/bin/java | sed “s:bin/java::”)
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS=”-XX:-PrintWarnings -Djava.library.path=$HADOOP_HOME/lib”
# — HADOOP ENVIRONMENT VARIABLES END — #

Check to see if the Hadoop variable has been set

source ~/.bashrc
hadoop version

Setting up the Hadoop Daemon configurations

Add environment variable to hadoop-env.sh

cd /opt/hadoop/etc/hadoop
 
sudo nano hadoop-env.sh

Add this:

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export HADOOP_HEAPSIZE=250

Edit Config Files (Only change what’s in red to match your hostname)

cd opt/hadoop/etc/hadoop
sudo nano core-site.xml

<configuration>

<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hdfs/tmp</value>
</property>

<property>
<name>fs.default.name</name>
<value>hdfs://
rpi0:9000/</value>
</property>

</configuration>

sudo nano hdfs-site.xml

<configuration>

<property>
<name>dfs.replication</name>
<value>
3</value>
</property>

<property>
<name>dfs.webhdfs.enable</name>
<value>true</value>
</property>

<property>
<name>dfs.blocksize</name>
<value>134217728</value>
</property>

</configuration>

sudo nano yarn-site.xml

<configuration>

<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>
rpi0:8025</value>
</property>

<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>
rpi0:8035</value>
</property>

<property>
<name>yarn.resourcemanager.address</name>
<value>
rpi0:8050</value>
</property>

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>4</value>
</property>

<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>768</value>
</property>

<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>64</value>
</property>

<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>256</value>
</property>

<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>

<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>4</value>
</property>

<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>true</value>
</property>

<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>true</value>
</property>

<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
</property>

</configuration>

sudo nano mapred-site.xml.template

Save As: mapred-site.xml

<configuration>

<property>
<name>mapredreduce.job.tracker</name>
<value>
rpi00:9001</value>
</property>

<property>
<name>mapred.framework.name</name>
<value>yarn</value>
</property>

<property>
<name>mapreduce.map.memory.mb</name>
<value>256</value>
</property>

<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx204m</value>
</property>

<property>
<name>mapreduce.map.cpu.vcores</name>
<value>2</value>
</property>

<property>
<name>mapreduce.reduce.memory.mb</name>
<value>128</value>
</property>

<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx102m</value>
</property>

<property>
<name>mapreduce.reduce.cpu.vcores</name>
<value>2</value>
</property>

<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>128</value>
</property>

<property>
<name>yarn.app.mapreduce.am.command-opts</name>
<value>-Xmx102m</value>
</property>

<property>
<name>yarn.app.mapreduce.am.resource.cpu-vcores</name>
<value>1</value>
</property>

<property>
<name>mapreduce.job.maps</name>
<value>4</value>
</property>

<property>
<name>mapreduce.job.reduces</name>
<value>4</value>
</property>

</configuration>

Hadoop Data File System (HDFS)

As part of the Hadoop installation HDFS is automatically added. A tmp folder needs to be created within HDFS to keep the temporary test data.

sudo mkdir /opt/hdfs/tmp
sudo chown hduser:hadoop /opt/hdfs/tmp
sudo chmod 750 /opt/hdfs/tmp

Add the slaves file

sudo nano /opt/hadoop/etc/hadoop/slaves

add :

rpi1
rpi2
rpi3

Copy SD Card

Use Win32imager or any other software of your choice to clone the sd-card. After clone ensure to configure each node with new hostname and ip-address.

Update Datanodes one at a time

Leave rpi0 off until all datanodes are setup

Plugin in SD card in (rpi1, then rpi2, then rpi3) and fire up your Pi’s.

Login in as pi

sudo raspi-config
Change the hostname (rpi1, rpi2, rpi3)

Change the IP Address

sudo nano /etc/dhcpcd.conf

Comment out the IP for rpi0 and uncomment the IP for rpi1. Repeat on rpi2, rpi3

interface eth0
#
static ip_address=10.0.0.100/24
static ip_address=10.0.0.101/24
# static ip_address=10.0.0.102/24
# static ip_address=10.0.0.103/24
static routers=10.0.0.1
static domain_name_servers=10.0.0.1

Reboot – This concludes the setup of the slaves

 

Add the master file to the master rpi0 only

sudo nano /opt/hadoop/etc/hadoop/masters

add :

rpi0

Logon with ssh to all nodes from namenode (rpi1). Enter Yes when you get the “Host key verification failed message”. This is important in order for namenode to be able to communicate with the other nodes without entering a password.

Format the namenode (rpi0) and start the services

hdfs namenode -format

That will format the master as a proper namenode. Finally, start the services. Previously, Hadoop used the start-all.sh script for this, but it has been deprecated and the recommended method now is to use the start-<service>.sh scripts individually. In each node, go to /opt/hadoop/etc/hadoop/ and execute the following:

start-dfs.sh
start-yarn.sh

Verify the cluster

There are several ways to confirm that everything is running properly, for example, you can point your browser to http://rpi0:50070 (the http address configured earlier) or to http://rpi0:8088. There you can see status reports of the cluster.

Alternatively, you can also check the Java processes in the nodes. In the master node you should see at least 3 processes: NameNode, SecondaryNameNode and ResourceManager. In the slaves you should see only two: NodeManager and DataNode. My favorite way is to use the hdfs report command. This is the output report of my configuration:

hdfs dfsadmin -report

Configured Capacity: 182240477184 (169.72 GB)
Present Capacity: 153073315840 (142.56 GB)
DFS Remaining: 153073020928 (142.56 GB)
DFS Used: 294912 (288 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

————————————————-

Live datanodes (4):

These strings copy the basic configuration to from the master to the slave nodes. You only need to do this if you make changes on the master to any of the xml files and want to replicate those changes to the slave nodes.

sudo rsync -avxP --rsync-path='/usr/bin/sudo /usr/bin/rsync' /opt/hadoop/etc/hadoop pi@rasp1:/opt/hadoop/etc/hadoop
sudo rsync -avxP --rsync-path='/usr/bin/sudo /usr/bin/rsync' /opt/hadoop/etc/hadoop pi@rasp1:/opt/hadoop/etc/hadoop
sudo rsync -avxP --rsync-path='/usr/bin/sudo /usr/bin/rsync' /opt/hadoop/etc/hadoop pi@rasp1:/opt/hadoop/etc/hadoop

 

References

Verduco, A. (2016, March 23). Building a Hadoop cluster with Raspberry Pi – developerWorks Recipes. Retrieved April 26, 2017, from https://developer.ibm.com/recipes/tutorials/building-a-hadoop-cluster-with-raspberry-pi/#r_overview

Widriksson, J. (2016). Raspberry PI 2 Hadoop 2 Cluster – Jonas Widriksson. Retrieved April 26, 2017, from http://www.widriksson.com/raspberry-pi-2-hadoop-2-cluster/