Uncategorized

Hadoop 2.6 (Part 01) – Installing on Ubuntu 16.04 (Single-Node Cluster)

Sometime ago I wrote a blog on “Setting up Hadoop 1.x on Ubuntu 12.04“. Since its 1.x version it is no longer the correct blog to refer. So I thought to update it to 2.x version running on Ubuntu latest version, which is 16.04 LTS.

Prerequisites

1. Make sure that have Java installed.

(Version 2.7 and later of Apache Hadoop requires Java 7. It is built and tested on both OpenJDK and Oracle JDK/JRE. Earlier versions (2.6 and earlier) support Java 6.)

You may visit Hadoop Wiki for more information.

2. Add a separate user and a group dedicated to Hadoop work. Here the group is called “hadoop” and the user is called “hduser”

Adding “sudo” to the command will allow hduser to have super user privileges

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser sudo

3. Enable SSH access to localhost for the hduser. (Hadoop requires SSH to manage its nodes.Hence you are required to enable it. Since this is a Single node setup you are required to enable SSH to localhost)

// Though Ubuntu is pre-installed with SSH, to enable SSHD (The server daemon) // you are required to install SSH again.
$ sudo apt-get install ssh

$ su - hduser

// Create a key pair for the instance.
$ ssh-keygen -t rsa -P ""

// Move public key to authorized_keys file to negate password verification while login using SSH
$ cat /home/hduser/.ssh/id_rsa.pub >> /home/hduser/authorized_keys

// Now you can check SSH on localhost
$ ssh localhost

The above will create a public/private key-pair for secure communication via SSH. By executing the above command you basically create a directory ‘/home/hduser/.ssh’ and the private key is stored in ‘/home/hduser/.ssh/id_rsa’ and the public key is stored in ‘/home/hduser/.ssh/id_rsa.pub’ respectively.

Installing Hadoop

1. Download Hadoop from the Apache Hadoop mirrors and store in a folder of your choice. I am using hadoop-2.6.1.tar.gz distribution here.

// Now copy the hadoop tar to /usr/local and execute the following commands
$ cd /usr/local
$ sudo tar xzf hadoop-2.6.1.tar.gz
$ sudo mv hadoop-2.6.1 hadoop
$ sudo chown -R hduser:hadoop hadoop

2. Set the environment in /home/hduser/.bashrc

# Set JAVA_HOME and HADOOP_HOME
export JAVA_HOME=/opt/jdk1.8.0_66
export HADOOP_HOME=/usr/local/hadoop
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

Once you edit .bashrc, you may logout and come back and type

$ hadoop version

3. Configure Hadoop -

After setting up the prerequisites, you are required to set the environment for Hadoop. Here, there are a set of configuration files to be edited. However this is the minimum level configuration that you required to edit to get one Hadoop instance up an running with HDFS.

- $HADOOP_HOME/etc/hadoop/hadoop-env.sh

- $HADOOP_HOME/etc/hadoop/core-site.xml

- $HADOOP_HOME/etc/hadoop/mapred-site.xml

- $HADOOP_HOME/etc/hadoop/hdfs-site.xml

(i) hadoop-env.sh

Required to set the JAVA_HOME here

# The java implementation to use.
export JAVA_HOME=/opt/jdk1.8.0_66

(ii) core-site.xml

It is required to set the HDFS temporary folder (hadoop.tmp.dir) here in this configuration. This should be positioned within the <configuration>.. </configuration> tags.

<property>
  <name>hadoop.tmp.dir</name>
  <value>/app/hadoop/tmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

Then you are required to create the temporary directory as mentioned in the parameters.

$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp

(iii). mapred-site.xml

The following should be inserted within the <configuration>.. </configuration> tags.

The mapred-site.xml is originally not in the folder. You have to rename/ copy the mapred-site.xml.template to mapred-site.xml before inserting.

<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

(iv). hdfs-site.xml

The following should be inserted within the <configuration>.. </configuration> tags.

<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>
<property>
  <name>dfs.namenode.name.dir</name>
  <value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
  <name>dfs.datanode.data.dir</name>
  <value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>

Once editing the above, it is required to create two directories, which will be used for the NameNode and the DataNode on the host.

$ sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
$ sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
$ sudo chown -R hduser:hadoop hadoop_store

4. Formatting the HDFS -

When you first setup Hadoop along with HDFS, you are required to format the HDFS file system. This is like formatting a normal filing system that you get with an OS. However you are not supposed to format it once you are using a HDFS mainly because it will erase all your data on it.

$ hadoop namenode -format

5. Start the Single Node Hadoop Cluster

$ /usr/local/hadoop/sbin/start-all.sh

OR

$ /usr/local/hadoop/sbin/start-dfs.sh
$ /usr/local/hadoop/sbin/start-yarn.sh

This will basically start a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.

Once you execute the above, if all OK, you will see the following output on the console.

This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/hadoop/logs/hadoop-hduser-namenode-crishantha-HP-ProBook-6470b.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hduser-datanode-crishantha-HP-ProBook-6470b.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-hduser-secondarynamenode-crishantha-HP-ProBook-6470b.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-hduser-resourcemanager-crishantha-HP-ProBook-6470b.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hduser-nodemanager-crishantha-HP-ProBook-6470b.out

Execute the following to see the active ports after starting the Hadoop cluster

$ netstat -plten | grep java

tcp        0      0 127.0.0.1:54310         0.0.0.0:*               LISTEN      1002       48587       5439/java
tcp        0      0 0.0.0.0:50090           0.0.0.0:*               LISTEN      1002       49408       5756/java
tcp        0      0 0.0.0.0:50070           0.0.0.0:*               LISTEN      1002       46080       5439/java
tcp        0      0 0.0.0.0:50010           0.0.0.0:*               LISTEN      1002       51317       5579/java
tcp        0      0 0.0.0.0:50075           0.0.0.0:*               LISTEN      1002       51323       5579/java
tcp        0      0 0.0.0.0:50020           0.0.0.0:*               LISTEN      1002       51328       5579/java
tcp6       0      0 :::8040                 :::*                    LISTEN      1002       56335       6028/java
tcp6       0      0 :::8042                 :::*                    LISTEN      1002       54502       6028/java
tcp6       0      0 :::8088                 :::*                    LISTEN      1002       49681       5909/java
tcp6       0      0 :::39673                :::*                    LISTEN      1002       56327       6028/java
tcp6       0      0 :::8030                 :::*                    LISTEN      1002       49678       5909/java
tcp6       0      0 :::8031                 :::*                    LISTEN      1002       49671       5909/java
tcp6       0      0 :::8032                 :::*                    LISTEN      1002       52457       5909/java
tcp6       0      0 :::8033                 :::*                    LISTEN      1002       55528       5909/java

6. Verify the Hadoop Cluster – You can check the availability of starting of above nodes by using the following command.

$ jps

3744 NameNode
4050 SecondaryNameNode
4310 NodeManager
3879 DataNode
4200 ResourceManager
4606 Jps

You may use the Web Interface provided to check the running nodes:

http://localhost:50070 - To see DataNodes

http://localhost:50090 - To see NameNodes

7. Stopping the Hadoop Cluster

You are required to execute the following command.

$ /usr/local/hadoop/sbin/stop-all.sh

If you were able to complete all above test you have set up a successful single node Hadoop cluster!!

VN:F [1.9.22_1171]
Rating: 8.0/10 (3 votes cast)
VN:F [1.9.22_1171]
Rating: +1 (from 1 vote)

Cloud based SOA e-Government infrastructure in Sri Lanka

My talk on “Cloud based SOA e-Government Infrastructure in Sri Lanka”, which I did WSO2Con 2014 can be found using the following URL:

http://youtu.be/f3pjtsX8mm0

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: +1 (from 1 vote)

Experiencing RedHat via Ubuntu

In the recent past, the Virtualization has emerged many avenues to most of the IT people with its dynamic capability and the architecture. The Cloud computing is one of those very strong concepts inherited.

As IT people, we too get added benefits by this dynamic feature. Now in almost all OSs have the ability to experience other OSs from your native OS (I know this is nothing new to most of you all). You do need to have some Virtualization software running on your machine to get this working. KVM or Virtualbox are few of those simple softwares that can make this work for you.

Recently, I started to get the hang of RedHat (I was using Debian/ Ubuntu for sometime now) after started following a course at OpenEd. Since my laptop was giving trouble to have a separate partition for RHEL6 (Red Hat Linux 6) mainly due to device driver issues, I opted to have a Virtual Manager like KVM on top of my Ubuntu to have a RHEL6 running as a Virtual machine. Temporarily that basically solved my problem.

If you google the above requirement I am sure you will find a lot of resources to solve the matter. I too did that and found out the following link, which I am sure can speed up your experience.

http://www.howtogeek.com/117635/how-to-install-kvm-and-create-virtual-machines-on-ubuntu/

Prerequisites:

A system with 64-bit CPU is required to run the KVM.

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Setting up Hadoop on Ubuntu 12.04

Everybody is talking about Big Data these days. With the amount of unstructured data that we deal with is getting enormously high, as software architects, we need to keep a tab on what “Big Data” technologies can offer. In the Big Data landscape, Hadoop has become the de-facto  technology that almost everybody needs to know. It is the core and it is “open-source”.

I recently did a meet-up on what Big Data can offer at my work place and you too can have a glimpse by going to the following link:

http://www.slideshare.net/crishantha/meetup11-big-data

I hope it will help you to kick start Big Data learning.

If you are new to Hadoop, the best possible resource that I have come across to get you started is the following blog article by Michael G. Noll. (http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ – Running Hadoop on Ubuntu – Single Node Cluster)

The summarized steps of the above article and my own experiences are summarized below.

Prerequisites

1. Since Hadoop is a Java based application, Check the Java version of the instance that you are intending to install Hadoop. It should be having JDK 1.5 or above version.

2. Add a separate user and a group dedicated to Hadoop work. Here the group is called “hadoop” and the user is called “hduser”

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser

3. Enable SSH access to localhost for the hduser

$ su - hduser
$ ssh-keygen -t rsa -P ""
$ cat /home/hduser/.ssh/id_rsa.pub >> /home/hduser/authorized_keys
$ ssh localhost

The above will create a public/private key-pair for secure communication via SSH. By executing the above command you basically create a directory ‘/home/hduser/.ssh’ and the private key is in ‘/home/hduser/.ssh/id_rsa’ and the public key is in ‘/home/hduser/.ssh/id_rsa.pub’ respectively. Thereafter, the public key should be moved to the authorized_keys to activate your public key to your local machine/instance.

Installation

1. Download Hadoop from the Apache Hadoop mirrors and store in a folder of your choice. I am using hadoop-1.2.1.tar.gz distribution here.

$ cd /usr/local
$ sudo tar xzf hadoop-1.2.1.tar.gz
$ sudo mv hadoop-1.2.1 hadoop
$ sudo chown -R hduser:hadoop hadoop

2. Set the environment using $HOME/.bashrc

# Set JAVA_HOME and HADOOP_HOME
export JAVA_HOME=/usr/local/jdk1.6.0_37
export HADOOP_HOME=/usr/local/hadoop

# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"

# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
    hadoop fs -cat $1 | lzop -dc | head -1000 | less
}

# Add Hadoop bin/ directory to PATH
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH

2. Check the Hadoop installation – You can check the installation using the following command.

$ hadoop version

3. Configure HDFS – After installing and setting up the initial hadoop environment, you are required to set the environment for HDFS. Here, there are some configuration files to be edited. However this is the minimum level configuration that you required to edit to get one Hadoop instance up an running with HDFS.

- $HADOOP_HOME/conf/hadoop-env.sh

- $HADOOP_HOME/conf/core-site.xml

- $HADOOP_HOME/conf/mapred-site.xml

- $HADOOP_HOME/conf/hdfs-site.xml

(i) hadoop-env.sh

Required to set the JAVA_HOME here

# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/java-6-sun

(ii) core-site.xml

It is required to set the HDFS temporary folder (hadoop.tmp.dir) here in this configuration. This should be positioned within the <configuration>.. </configuration> tags.

<property>
  <name>hadoop.tmp.dir</name>
  <value>/app/hadoop/tmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

(iii). mapred-site.xml

<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

(iv). hdfs-site.xml

<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>

4. Formatting the HDFS – When you first setup Hadoop along with HDFS, you are required to format the HDFS file system. This is like formatting a normal filing system that you get with an OS. However you are not supposed to format it once you are using a HDFS mainly because it will erase all your data on it.

$ /usr/local/hadoop/bin/hadoop namenode -format

The above process usually takes a bit of time. Hence, give some time it to complete the formating. (This is my own experience)

5. Start the Single Node Hadoop Cluster -

$ /usr/local/hadoop/bin/start-all.sh

This will basically start a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.

6. Verify the Hadoop Cluster – You can check the availability of starting of above nodes by using the following command.

$ jps

5418 TaskTracker
11458 Jps
5029 SecondaryNameNode
4756 DataNode
5138 JobTracker
4445 NameNode

7. Stopping the Hadoop Cluster – You are required to execute the following command.

$ /usr/local/hadoop/bin/stop-all.sh

If you were able to complete all above test you have set up a successful single node Hadoop cluster!!

Running a MapReduce Job

Once the Hadoop installation is completed, you are able to have a go at running a MapReduce job. As a kick start you always can run the “wordcount” MapReduce job provided with the Hadoop examples. This example actually is bundled with the hadoop-examples.jar file in the distribution. (See the below steps for more details)

Step 1: Start the Hadoop Cluster

/usr/local/hadoop/bin/start-all.sh

Step 2: Copy the text files that you are going to consider for a “wordcount” to a local folder.

Step 3: Copy the text files (in the local folder) to HDFS.

hadoop dfs -copyFromLocal /home/hadoop/textfiles /user/hadoop/dfs

Step 4: List the content of the HDFS folder.

hadoop dfs -ls /user/hadoop/dfs

Step 5: If you were able to complete the step 4, you are good to go ahead with the MapReduce job.

hadoop jar $HADOOP_HOME/hadoop-examples-1.2.1.jar wordcount /user/hadoop/dfs /user/hadoop/dfs-output

If the job was completed successfully, Congratulations!

You can either choose the command line or the web interface to display the contents of the HDFS directories. If you choose the command line you can try the following command.

hadoop fs -text /user/hadoop/dfs/text01.txt

If you choose to display contents via the web UI, you can try the following URL.

http://localhost:50070/

You will get an interface like below.

Then click the “Browse the File System” link to navigate through the HDFS and see the contents. If your job ran successfully, the output should be in the relevant folder.

If you want to upgrade to Hadoop 2 you may follow the following articles,

Setting Up Hadoop 2 on Ubuntu 14: http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php

Running a Mapreduce Job: http://www.bogotobogo.com/Hadoop/BigData_hadoop_Running_MapReduce_Job.php

VN:F [1.9.22_1171]
Rating: 10.0/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: +2 (from 2 votes)

The “Big Data” Infodeck by Martin Fowler

Recently the well known Technical Writer “Martin Fowler” has published an Info Deck about “Big Data”. For those who had not seen this can be beneficial by this as much as I do. My kudos to Mr. Martin Fowler for his simple teaching style for somewhat a complex area.

Here is the link.

http://martinfowler.com/articles/bigData/

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: +2 (from 2 votes)
Go to Top