Everybody is talking about Big Data these days. With the amount of unstructured data that we deal with is getting enormously high, as software architects, we need to keep a tab on what “Big Data” technologies can offer. In the Big Data landscape, Hadoop has become the de-facto  technology that almost everybody needs to know. It is the core and it is “open-source”.

I recently did a meet-up on what Big Data can offer at my work place and you too can have a glimpse by going to the following link:

http://www.slideshare.net/crishantha/meetup11-big-data

I hope it will help you to kick start Big Data learning.

If you are new to Hadoop, the best possible resource that I have come across to get you started is the following blog article by Michael G. Noll. (http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ – Running Hadoop on Ubuntu – Single Node Cluster)

The summarized steps of the above article and my own experiences are summarized below.

Prerequisites

1. Since Hadoop is a Java based application, Check the Java version of the instance that you are intending to install Hadoop. It should be having JDK 1.5 or above version.

2. Add a separate user and a group dedicated to Hadoop work. Here the group is called “hadoop” and the user is called “hduser”

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser

3. Enable SSH access to localhost for the hduser

$ su - hduser
$ ssh-keygen -t rsa -P ""
$ cat /home/hduser/.ssh/id_rsa.pub >> /home/hduser/authorized_keys
$ ssh localhost

The above will create a public/private key-pair for secure communication via SSH. By executing the above command you basically create a directory ‘/home/hduser/.ssh’ and the private key is in ‘/home/hduser/.ssh/id_rsa’ and the public key is in ‘/home/hduser/.ssh/id_rsa.pub’ respectively. Thereafter, the public key should be moved to the authorized_keys to activate your public key to your local machine/instance.

Installation

1. Download Hadoop from the Apache Hadoop mirrors and store in a folder of your choice. I am using hadoop-1.2.1.tar.gz distribution here.

$ cd /usr/local
$ sudo tar xzf hadoop-1.2.1.tar.gz
$ sudo mv hadoop-1.2.1 hadoop
$ sudo chown -R hduser:hadoop hadoop

2. Set the environment using $HOME/.bashrc

# Set JAVA_HOME and HADOOP_HOME
export JAVA_HOME=/usr/local/jdk1.6.0_37
export HADOOP_HOME=/usr/local/hadoop

# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"

# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
    hadoop fs -cat $1 | lzop -dc | head -1000 | less
}

# Add Hadoop bin/ directory to PATH
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH

2. Check the Hadoop installation – You can check the installation using the following command.

$ hadoop version

3. Configure HDFS – After installing and setting up the initial hadoop environment, you are required to set the environment for HDFS. Here, there are some configuration files to be edited. However this is the minimum level configuration that you required to edit to get one Hadoop instance up an running with HDFS.

- $HADOOP_HOME/conf/hadoop-env.sh

- $HADOOP_HOME/conf/core-site.xml

- $HADOOP_HOME/conf/mapred-site.xml

- $HADOOP_HOME/conf/hdfs-site.xml

(i) hadoop-env.sh

Required to set the JAVA_HOME here

# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/java-6-sun

(ii) core-site.xml

It is required to set the HDFS temporary folder (hadoop.tmp.dir) here in this configuration. This should be positioned within the <configuration>.. </configuration> tags.

<property>
  <name>hadoop.tmp.dir</name>
  <value>/app/hadoop/tmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

(iii). mapred-site.xml

<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

(iv). hdfs-site.xml

<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>

4. Formatting the HDFS – When you first setup Hadoop along with HDFS, you are required to format the HDFS file system. This is like formatting a normal filing system that you get with an OS. However you are not supposed to format it once you are using a HDFS mainly because it will erase all your data on it.

$ /usr/local/hadoop/bin/hadoop namenode -format

The above process usually takes a bit of time. Hence, give some time it to complete the formating. (This is my own experience)

5. Start the Single Node Hadoop Cluster -

$ /usr/local/hadoop/bin/start-all.sh

This will basically start a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.

6. Verify the Hadoop Cluster – You can check the availability of starting of above nodes by using the following command.

$ jps

5418 TaskTracker
11458 Jps
5029 SecondaryNameNode
4756 DataNode
5138 JobTracker
4445 NameNode

7. Stopping the Hadoop Cluster – You are required to execute the following command.

$ /usr/local/hadoop/bin/stop-all.sh

If you were able to complete all above test you have set up a successful single node Hadoop cluster!!

Running a MapReduce Job

Once the Hadoop installation is completed, you are able to have a go at running a MapReduce job. As a kick start you always can run the “wordcount” MapReduce job provided with the Hadoop examples. This example actually is bundled with the hadoop-examples.jar file in the distribution. (See the below steps for more details)

Step 1: Start the Hadoop Cluster

/usr/local/hadoop/bin/start-all.sh

Step 2: Copy the text files that you are going to consider for a “wordcount” to a local folder.

Step 3: Copy the text files (in the local folder) to HDFS.

hadoop dfs -copyFromLocal /home/hadoop/textfiles /user/hadoop/dfs

Step 4: List the content of the HDFS folder.

hadoop dfs -ls /user/hadoop/dfs

Step 5: If you were able to complete the step 4, you are good to go ahead with the MapReduce job.

hadoop jar $HADOOP_HOME/hadoop-examples-1.2.1.jar wordcount /user/hadoop/dfs /user/hadoop/dfs-output

If the job was completed successfully, Congratulations!

You can either choose the command line or the web interface to display the contents of the HDFS directories. If you choose the command line you can try the following command.

hadoop fs -text /user/hadoop/dfs/text01.txt

If you choose to display contents via the web UI, you can try the following URL.

http://localhost:50070/

You will get an interface like below.

Then click the “Browse the File System” link to navigate through the HDFS and see the contents. If your job ran successfully, the output should be in the relevant folder.

If you want to upgrade to Hadoop 2 you may follow the following articles,

Setting Up Hadoop 2 on Ubuntu 14: http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php

Running a Mapreduce Job: http://www.bogotobogo.com/Hadoop/BigData_hadoop_Running_MapReduce_Job.php

VN:F [1.9.22_1171]
Rating: 10.0/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: +2 (from 2 votes)
Setting up Hadoop on Ubuntu 12.04 , 10.0 out of 10 based on 2 ratings