Sometime ago I wrote a blog on “Setting up Hadoop 1.x on Ubuntu 12.04“. Since its 1.x version it is no longer the correct blog to refer. So I thought to update it to 2.x version running on Ubuntu latest version, which is 16.04 LTS.

Prerequisites

1. Make sure that have Java installed.

(Version 2.7 and later of Apache Hadoop requires Java 7. It is built and tested on both OpenJDK and Oracle JDK/JRE. Earlier versions (2.6 and earlier) support Java 6.)

You may visit Hadoop Wiki for more information.

2. Add a separate user and a group dedicated to Hadoop work. Here the group is called “hadoop” and the user is called “hduser”

Adding “sudo” to the command will allow hduser to have super user privileges

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser sudo

3. Enable SSH access to localhost for the hduser. (Hadoop requires SSH to manage its nodes.Hence you are required to enable it. Since this is a Single node setup you are required to enable SSH to localhost)

// Though Ubuntu is pre-installed with SSH, to enable SSHD (The server daemon) // you are required to install SSH again.
$ sudo apt-get install ssh

$ su - hduser

// Create a key pair for the instance.
$ ssh-keygen -t rsa -P ""

// Move public key to authorized_keys file to negate password verification while login using SSH
$ cat /home/hduser/.ssh/id_rsa.pub >> /home/hduser/authorized_keys

// Now you can check SSH on localhost
$ ssh localhost

The above will create a public/private key-pair for secure communication via SSH. By executing the above command you basically create a directory ‘/home/hduser/.ssh’ and the private key is stored in ‘/home/hduser/.ssh/id_rsa’ and the public key is stored in ‘/home/hduser/.ssh/id_rsa.pub’ respectively.

Installing Hadoop

1. Download Hadoop from the Apache Hadoop mirrors and store in a folder of your choice. I am using hadoop-2.6.1.tar.gz distribution here.

// Now copy the hadoop tar to /usr/local and execute the following commands
$ cd /usr/local
$ sudo tar xzf hadoop-2.6.1.tar.gz
$ sudo mv hadoop-2.6.1 hadoop
$ sudo chown -R hduser:hadoop hadoop

2. Set the environment in /home/hduser/.bashrc

# Set JAVA_HOME and HADOOP_HOME
export JAVA_HOME=/opt/jdk1.8.0_66
export HADOOP_HOME=/usr/local/hadoop
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

Once you edit .bashrc, you may logout and come back and type

$ hadoop version

3. Configure Hadoop -

After setting up the prerequisites, you are required to set the environment for Hadoop. Here, there are a set of configuration files to be edited. However this is the minimum level configuration that you required to edit to get one Hadoop instance up an running with HDFS.

- $HADOOP_HOME/etc/hadoop/hadoop-env.sh

- $HADOOP_HOME/etc/hadoop/core-site.xml

- $HADOOP_HOME/etc/hadoop/mapred-site.xml

- $HADOOP_HOME/etc/hadoop/hdfs-site.xml

(i) hadoop-env.sh

Required to set the JAVA_HOME here

# The java implementation to use.
export JAVA_HOME=/opt/jdk1.8.0_66

(ii) core-site.xml

It is required to set the HDFS temporary folder (hadoop.tmp.dir) here in this configuration. This should be positioned within the <configuration>.. </configuration> tags.

<property>
  <name>hadoop.tmp.dir</name>
  <value>/app/hadoop/tmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

Then you are required to create the temporary directory as mentioned in the parameters.

$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp

(iii). mapred-site.xml

The following should be inserted within the <configuration>.. </configuration> tags.

The mapred-site.xml is originally not in the folder. You have to rename/ copy the mapred-site.xml.template to mapred-site.xml before inserting.

<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

(iv). hdfs-site.xml

The following should be inserted within the <configuration>.. </configuration> tags.

<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>
<property>
  <name>dfs.namenode.name.dir</name>
  <value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
  <name>dfs.datanode.data.dir</name>
  <value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>

Once editing the above, it is required to create two directories, which will be used for the NameNode and the DataNode on the host.

$ sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
$ sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
$ sudo chown -R hduser:hadoop hadoop_store

4. Formatting the HDFS -

When you first setup Hadoop along with HDFS, you are required to format the HDFS file system. This is like formatting a normal filing system that you get with an OS. However you are not supposed to format it once you are using a HDFS mainly because it will erase all your data on it.

$ hadoop namenode -format

5. Start the Single Node Hadoop Cluster

$ /usr/local/hadoop/sbin/start-all.sh

OR

$ /usr/local/hadoop/sbin/start-dfs.sh
$ /usr/local/hadoop/sbin/start-yarn.sh

This will basically start a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.

Once you execute the above, if all OK, you will see the following output on the console.

This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/hadoop/logs/hadoop-hduser-namenode-crishantha-HP-ProBook-6470b.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hduser-datanode-crishantha-HP-ProBook-6470b.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-hduser-secondarynamenode-crishantha-HP-ProBook-6470b.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-hduser-resourcemanager-crishantha-HP-ProBook-6470b.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hduser-nodemanager-crishantha-HP-ProBook-6470b.out

Execute the following to see the active ports after starting the Hadoop cluster

$ netstat -plten | grep java

tcp        0      0 127.0.0.1:54310         0.0.0.0:*               LISTEN      1002       48587       5439/java
tcp        0      0 0.0.0.0:50090           0.0.0.0:*               LISTEN      1002       49408       5756/java
tcp        0      0 0.0.0.0:50070           0.0.0.0:*               LISTEN      1002       46080       5439/java
tcp        0      0 0.0.0.0:50010           0.0.0.0:*               LISTEN      1002       51317       5579/java
tcp        0      0 0.0.0.0:50075           0.0.0.0:*               LISTEN      1002       51323       5579/java
tcp        0      0 0.0.0.0:50020           0.0.0.0:*               LISTEN      1002       51328       5579/java
tcp6       0      0 :::8040                 :::*                    LISTEN      1002       56335       6028/java
tcp6       0      0 :::8042                 :::*                    LISTEN      1002       54502       6028/java
tcp6       0      0 :::8088                 :::*                    LISTEN      1002       49681       5909/java
tcp6       0      0 :::39673                :::*                    LISTEN      1002       56327       6028/java
tcp6       0      0 :::8030                 :::*                    LISTEN      1002       49678       5909/java
tcp6       0      0 :::8031                 :::*                    LISTEN      1002       49671       5909/java
tcp6       0      0 :::8032                 :::*                    LISTEN      1002       52457       5909/java
tcp6       0      0 :::8033                 :::*                    LISTEN      1002       55528       5909/java

6. Verify the Hadoop Cluster – You can check the availability of starting of above nodes by using the following command.

$ jps

3744 NameNode
4050 SecondaryNameNode
4310 NodeManager
3879 DataNode
4200 ResourceManager
4606 Jps

You may use the Web Interface provided to check the running nodes:

http://localhost:50070 - To see DataNodes

http://localhost:50090 - To see NameNodes

7. Stopping the Hadoop Cluster

You are required to execute the following command.

$ /usr/local/hadoop/sbin/stop-all.sh

If you were able to complete all above test you have set up a successful single node Hadoop cluster!!

VN:F [1.9.22_1171]
Rating: 10.0/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: +1 (from 1 vote)
Hadoop 2.6 (Part 01) - Installing on Ubuntu 16.04 (Single-Node Cluster), 10.0 out of 10 based on 2 ratings