Monday, September 30, 2013

Setting Up a Hadoop Cluster

1.       Install JAVA on the machine
a.       Copy the  jdk1.7.0 folder to /usr/lib/jvm/     [download java from sun.java.com and extract it to get the jdk1.7.0 folder, save it to a USB device for copying to all machines]
If the jvm folder is not in /usr/lib/ you will have to create a new folder.
b.      Link java/javac/jar to alternatives
$> ln –s /usr/lib/jvm/jdk1.7.0/bin/java   java   /etc/alternatives/java     [do this for javac and jar also]
$> ln –s /usr/lib/jvm/jdk1.7.0/bin/javac   javac   /etc/alternatives/javac    
$> ln –s /usr/lib/jvm/jdk1.7.0/bin/jar   jar  /etc/alternatives/jar
If ln does not work, (maybe because an older version of java is already linked to alternatives]   
$> update-alternatives  --install /usr/bin/java   java   /usr/lib/jvm/jdk1.7.0/bin/java   1
$> update-alternatives  --install /usr/bin/javac   javac   /usr/lib/jvm/jdk1.7.0/bin/javac   1
$> update-alternatives  --install /usr/bin/jar   jar   /usr/lib/jvm/jdk1.7.0/bin/jar   1
c.       Check with
$>  java –version. You should see jdk1.7.0 instead of open-java The command should output something comparable to the following on every node of your:
java version "1.7.0"
Java(TM) SE Runtime Environment (build 1.7.0_22-b04)
Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)

2.       Install open-ssh server/ open-ssh client on the machine
a.  $> sudo apt-get install open-ssh-server
b.  $> sudo apt-get install open-ssh-client
3.       Add a dedicated hadoop user on the machine
a.  $> sudo addgroup hadoop
b.  $> sudo adduser --ingroup hadoop hduser
4.       Configure ssh
a.       $> su –hduser
b.      $> ssh-keygen –t rsa –P “”   [Press enter to select the default file authorized_keys]
c.       $> cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
d.    $> ssh localhost    [To test whether open—ssh server has been configured correctly]
e.      $> exit
5.       Install Hadoop on the machine
a.       download hadoop from  hadoop.apache.org to get the hadoop folder, make the changes as per step 6 and then save it to a USB device for copying to all machines
b.      $> sudo mv hadoop-1.0.3 hadoop
c.       $> cp hadoop  /home/hduser/.
d.      $> cd /home/hduser
e.      $> chown –R hduser:hadoop hadoop
6.       Configure the hadoop environment [To be done only once after copying hadoop, then you copy the folder with the changes to all machines]
a.       Open the file hadoop/conf/hadoop-env.sh   [$> gedit hadoop/conf/hadoop-env.sh]
     export JAVA_HOME=/usr/lib/jvm/jdk1.7.0
b.      Open the file hadoop/conf/mapred-site.xml   [$> gedit hadoop/conf/mapred-site.xml]  Replace master with IP of master node
  mapred.job.tracker
  master:54311

c.       Open the file hadoop/conf/core-site.xml   [$> gedit hadoop/conf/core-site.xml] Replace master with IP of master node
 fs.default.name
 hdfs://master:54310/
d.  Open the file hadoop/conf/hdfs-site.xml   [$> gedit hadoop/conf/hdfs-site.xml] Replace master with IP of master node
 dfs.data.dir
 home/hduser/hadoop/dfsdata
e.      Open the file hadoop/conf/master   - Type the IP of master
f.        Open the file hadoop/conf/slaves   - Type the IP of slaves on one line each

7.       Update $HOME/.bashrc
a.       $> sudo gedit $HOME/.bashrc
b.      Make the changes to the .bashrc file
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0
export HADOOP_HOME=/home/hduser/hadoop
export PATH=$PATH:$JAVA_HOME/bon:$HADOOP_HOME/bin
export HADOOP_CLASSPATH=$HADOOP_HOME
c.       To run the bash
$> . ~/.bashrc
    
8.       Create the folders needed by hadoop
a.       $> mkdir –p /home/hduser/dfsdata
9.       Edit /etc/hosts on all nodes
10.   Only on master:
a.  $> ssh-copy-id –i $HOME/.ssh/id_rsa.pub   hduser@slave01
b.  $> ssh slave01   [to verify whether master is able to talk to slave01 without a password]