Hadoop 2.6 (Part 2) – Running the Mapreduce Job

This is the continuation of my previous article on “Installing Hadoop 2.6 on Ubuntu 16.04“. This article will explain how we run one of the examples given with the Hadoop binary.

Once the Hadoop installation is completed, you can run the “wordcount” example provided with the Hadoop examples in order to test a Mapreduce job. This example actually is bundled with the hadoop-examples.jar file in the distribution. (See the below steps for more details)

Step 1: Start the Hadoop Cluster, if not already started.

$ /usr/local/hadoop/sbin/start-dfs.sh
$ /usr/local/hadoop/sbin/start-yarn.sh

Step 2: Copy the text files that you are going to consider for a “wordcount” to a local folder (/home/hadoop/textfiles)

Step 3: Copy the text files (in the local folder) to HDFS.

$ echo "Word Count Text File" > textFile.txt
$ hdfs dfs -mkdir -p /user/hduser/dfs
$ hadoop dfs -copyFromLocal textFile.txt /user/hduser/dfs
Step 4: List the content of the HDFS folder.
$ hadoop dfs -ls /user/hduser/dfs
Step 5: If you were able to complete the step 4, you are good to go ahead with the MapReduce job.
$ cd /usr/local/hadoop
$ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.1.jar wordcount /user/hduser/dfs /user/hduser/dfs-output
If the job was completed successfully, Congratulations!

You can either choose the command line or the web interface to display the contents of the HDFS directories. If you choose the command line you can try the following command.

$ hadoop dfs -ls /user/hduser/dfs-output

OR

http://localhost:50070/
VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Hadoop 2.6 (Part 01) – Installing on Ubuntu 16.04 (Single-Node Cluster)

Sometime ago I wrote a blog on “Setting up Hadoop 1.x on Ubuntu 12.04“. Since its 1.x version it is no longer the correct blog to refer. So I thought to update it to 2.x version running on Ubuntu latest version, which is 16.04 LTS.

Prerequisites

1. Make sure that have Java installed.

(Version 2.7 and later of Apache Hadoop requires Java 7. It is built and tested on both OpenJDK and Oracle JDK/JRE. Earlier versions (2.6 and earlier) support Java 6.)

You may visit Hadoop Wiki for more information.

2. Add a separate user and a group dedicated to Hadoop work. Here the group is called “hadoop” and the user is called “hduser”

Adding “sudo” to the command will allow hduser to have super user privileges

$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser sudo

3. Enable SSH access to localhost for the hduser. (Hadoop requires SSH to manage its nodes.Hence you are required to enable it. Since this is a Single node setup you are required to enable SSH to localhost)

// Though Ubuntu is pre-installed with SSH, to enable SSHD (The server daemon) // you are required to install SSH again.
$ sudo apt-get install ssh

$ su - hduser

// Create a key pair for the instance.
$ ssh-keygen -t rsa -P ""

// Move public key to authorized_keys file to negate password verification while login using SSH
$ cat /home/hduser/.ssh/id_rsa.pub >> /home/hduser/authorized_keys

// Now you can check SSH on localhost
$ ssh localhost

The above will create a public/private key-pair for secure communication via SSH. By executing the above command you basically create a directory ‘/home/hduser/.ssh’ and the private key is stored in ‘/home/hduser/.ssh/id_rsa’ and the public key is stored in ‘/home/hduser/.ssh/id_rsa.pub’ respectively.

Installing Hadoop

1. Download Hadoop from the Apache Hadoop mirrors and store in a folder of your choice. I am using hadoop-2.6.1.tar.gz distribution here.

// Now copy the hadoop tar to /usr/local and execute the following commands
$ cd /usr/local
$ sudo tar xzf hadoop-2.6.1.tar.gz
$ sudo mv hadoop-2.6.1 hadoop
$ sudo chown -R hduser:hadoop hadoop

2. Set the environment in /home/hduser/.bashrc

# Set JAVA_HOME and HADOOP_HOME
export JAVA_HOME=/opt/jdk1.8.0_66
export HADOOP_HOME=/usr/local/hadoop
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

Once you edit .bashrc, you may logout and come back and type

$ hadoop version

3. Configure Hadoop -

After setting up the prerequisites, you are required to set the environment for Hadoop. Here, there are a set of configuration files to be edited. However this is the minimum level configuration that you required to edit to get one Hadoop instance up an running with HDFS.

- $HADOOP_HOME/etc/hadoop/hadoop-env.sh

- $HADOOP_HOME/etc/hadoop/core-site.xml

- $HADOOP_HOME/etc/hadoop/mapred-site.xml

- $HADOOP_HOME/etc/hadoop/hdfs-site.xml

(i) hadoop-env.sh

Required to set the JAVA_HOME here

# The java implementation to use.
export JAVA_HOME=/opt/jdk1.8.0_66

(ii) core-site.xml

It is required to set the HDFS temporary folder (hadoop.tmp.dir) here in this configuration. This should be positioned within the <configuration>.. </configuration> tags.

<property>
  <name>hadoop.tmp.dir</name>
  <value>/app/hadoop/tmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

Then you are required to create the temporary directory as mentioned in the parameters.

$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp

(iii). mapred-site.xml

The following should be inserted within the <configuration>.. </configuration> tags.

The mapred-site.xml is originally not in the folder. You have to rename/ copy the mapred-site.xml.template to mapred-site.xml before inserting.

<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>

(iv). hdfs-site.xml

The following should be inserted within the <configuration>.. </configuration> tags.

<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>
<property>
  <name>dfs.namenode.name.dir</name>
  <value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
  <name>dfs.datanode.data.dir</name>
  <value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>

Once editing the above, it is required to create two directories, which will be used for the NameNode and the DataNode on the host.

$ sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
$ sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
$ sudo chown -R hduser:hadoop hadoop_store

4. Formatting the HDFS -

When you first setup Hadoop along with HDFS, you are required to format the HDFS file system. This is like formatting a normal filing system that you get with an OS. However you are not supposed to format it once you are using a HDFS mainly because it will erase all your data on it.

$ hadoop namenode -format

5. Start the Single Node Hadoop Cluster

$ /usr/local/hadoop/sbin/start-all.sh

OR

$ /usr/local/hadoop/sbin/start-dfs.sh
$ /usr/local/hadoop/sbin/start-yarn.sh

This will basically start a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.

Once you execute the above, if all OK, you will see the following output on the console.

This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/hadoop/logs/hadoop-hduser-namenode-crishantha-HP-ProBook-6470b.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hduser-datanode-crishantha-HP-ProBook-6470b.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-hduser-secondarynamenode-crishantha-HP-ProBook-6470b.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-hduser-resourcemanager-crishantha-HP-ProBook-6470b.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hduser-nodemanager-crishantha-HP-ProBook-6470b.out

Execute the following to see the active ports after starting the Hadoop cluster

$ netstat -plten | grep java

tcp        0      0 127.0.0.1:54310         0.0.0.0:*               LISTEN      1002       48587       5439/java
tcp        0      0 0.0.0.0:50090           0.0.0.0:*               LISTEN      1002       49408       5756/java
tcp        0      0 0.0.0.0:50070           0.0.0.0:*               LISTEN      1002       46080       5439/java
tcp        0      0 0.0.0.0:50010           0.0.0.0:*               LISTEN      1002       51317       5579/java
tcp        0      0 0.0.0.0:50075           0.0.0.0:*               LISTEN      1002       51323       5579/java
tcp        0      0 0.0.0.0:50020           0.0.0.0:*               LISTEN      1002       51328       5579/java
tcp6       0      0 :::8040                 :::*                    LISTEN      1002       56335       6028/java
tcp6       0      0 :::8042                 :::*                    LISTEN      1002       54502       6028/java
tcp6       0      0 :::8088                 :::*                    LISTEN      1002       49681       5909/java
tcp6       0      0 :::39673                :::*                    LISTEN      1002       56327       6028/java
tcp6       0      0 :::8030                 :::*                    LISTEN      1002       49678       5909/java
tcp6       0      0 :::8031                 :::*                    LISTEN      1002       49671       5909/java
tcp6       0      0 :::8032                 :::*                    LISTEN      1002       52457       5909/java
tcp6       0      0 :::8033                 :::*                    LISTEN      1002       55528       5909/java

6. Verify the Hadoop Cluster – You can check the availability of starting of above nodes by using the following command.

$ jps

3744 NameNode
4050 SecondaryNameNode
4310 NodeManager
3879 DataNode
4200 ResourceManager
4606 Jps

You may use the Web Interface provided to check the running nodes:

http://localhost:50070 - To see DataNodes

http://localhost:50090 - To see NameNodes

7. Stopping the Hadoop Cluster

You are required to execute the following command.

$ /usr/local/hadoop/sbin/stop-all.sh

If you were able to complete all above test you have set up a successful single node Hadoop cluster!!

VN:F [1.9.22_1171]
Rating: 10.0/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: +1 (from 1 vote)

A Collection of IS Theories

Recently got to know about this Wiki related to IS Theories. As researcher, I am experiencing the difficulty of finding the right mix of theories in our research areas. If you an IS researcher, I am sure this Wiki will give you a concise view of most of the IS theories around. This may not give you everything, but I am sure it will give you a kick start to move forward. Kudos to whom have initiated this collaborative act.

Here is the link: http://is.theorizeit.org

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Building SaaS based Cloud Applications

Introduction

According to NIST, “cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction (NIST, 2011).

The traditional approach incur a huge capital expenditure upfront along with too much excess capacity not allowing to predict the capacity based on the market demand. The Cloud Computing approach has the ability to provision computing capabilities without requiring the human interaction with the service provider. In addition to that, its ability to have a broad network access, resource pooling, elasticity and measured services are a few of the characteristics, which basically overpower the traditional hardware approach. As benefits, it can drastically cut down the procurement lead time, can produce better scalability, substantial cost savings (no capital cost, pay for what you use) with less management headaches in terms of operational costs.

Cloud Service Models

There are three (03) basic cloud service models such as IaaS (Infrastructure as a Service), Platform as a Service (PaaS) and Software as a Service (SaaS).

SaaS Service Model

In the SaaS cloud service model the consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage or even individual application capabilities with the possible exception of limited user specific application configuration settings. A typical SaaS based application primarily provide multi-tenancy, scalability, customization and resource pooling features to the client.

Multi-tenancy

Multi-tenancy is the ability for multiple customers (tenants) to share the same applications and/ or compute resources. A single software/ application can be used and customized by different organizations as if they each have a separate instance, yet a single shared stack of software or hardware is used. Further it ensures that their data and customizations remain secure and insulated from the activity of all other tenants.

Multi-tenancy Models

There are three basic multi-tenancy models.

1. Separate applications and separate databases

2. One shared application and separate databases

3. One shared application and one shared database

Figure 1 – Multi-tenancy Models

Multi-tenancy Data Architectures

According to Figure 1, there are three types of multi-tenancy data architectures based on the way data is being stored.

1. Separate Databases

In this approach, each tenant data is stored in a separate database ensuring that the tenant can access only the specific tenant database. A separate database connection pool should be set up and need to select the connection pool based on the tenant ID associated with the logged in user.

2. Shared Database – Separate Schemas

In this approach, the data is stored in separate schemas in a single database for each tenant. Similar to the first approach separate connection pools can be created for each database schema. Alternatively a single connection pool also can be used and based on the connection tenant ID (i.e. using SET SCHEMA SQL command), the relevant schema is selected.

3. Shared Database – Shared Schema (Horizontally Partitioned)

In this approach, the data is stored in a single database schema. The tenant is separated from the tenant ID, which is represented by a separate column in each table in the schema. Only one connection pool is configured at the application level. Based on the tenant ID the database schema should be partitioned (horizontally) or indexed to speed up the performance.

Figure 2 – Multi-tenancy Data Architecture

References

1. NIST (2011), NIST Definition, https://www.nist.gov/news-events/news/2011/10/final-version-nist-cloud-computing-definition-published

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Robotic Process Automation (RPA)

Introduction

Robotic Process Automation (RPA) refers to the use of software “robots” that are trained to mimic human actions on the user interface of applications, and then reenact these actions automatically.

RPA Technology can be applied to wide range of industries,

  • Process Automation
    • Mimics the steps of rules based process without compromising the existing IT architecture, which are able to carry out “prescribed” functions and easily scale up or down based on the demand
    • The back office process automation –  (Finance, Procurement, SCM, Accounting, Customer Service, HR) can be expedited using data entry, purchase order issuing, creating of on-line access credentials and complex and connected business processes.
  • IT Support and Management
    • Processes in IT infrastructure management such as Service Desk Management, Network Monitoring
  • Automated Assistant
    • Call center software

Benefits

There are several benefits that RPA can bring in.

1. Cost and Speed – On average RPA robot is a third of the cost of an FTE (Full Time Equivalent)

2. Scalability – Additional robots can be deployed quickly with minimal expenditure. Can train tens, hundreds or thousands of robots at exactly the same time through workflow creation.

3. Accuracy – Eliminates human error

4. Analytics – RPA tools can provide this with ease

Steps Involoved

1. Identify the business processes involved in your organization

2. Constant mapping of how the employees interact with the current business processes (There is no process re-engineering involved)

3. Find the business processes that require simple data transfers first and then move to complex ones (data manipulations and data transfers)

RPA Solutions

1. No proper “Open Source” solutions so far.

2. UI Path (http://www.uipath.com/community) – Community Edition (Free for non commercial use)

3. Python based Automation – https://automatetheboringstuff.com/

4. Robot Framework – http://robotframework.org/ – Open Source, premature

5. Pega Open Span – Commercial Version - https://www.pega.com/products/pega-7-platform/openspan

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Citrix ICAClient (Client Receiver) 13 on Ubuntu 14.04

If you are planning to use Citrix ICAClient (Client Receiver) on Ubuntu 14.04 LTS, you are sure to encounter many issues. At least a SSL error.

I too encountered many issues and finally ended up resolving it thanks to Ubuntu Community Help Page. Generally, Citrix ICAClient works pretty well with Windows, but with Linux and Macs.

If you are an Ubuntu 14.04 LTS user, I am sure the following link will help you to resolve most of those issues. Just execute the commands, you will sure to resolve it.

Link: https://help.ubuntu.com/community/CitrixICAClientHowTo

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Securing Apache with SSL on Ubuntu 14

Prerequisites
$ sudo apt-get update
$ sudo apt-get install apache2
Activate the SSL Module
$ sudo a2enmod ssl
$ sudo service apache2 restart
Create a Self Signed SSL Certificate
You are required to create a self signed certificate and attach it to the Apache SSL configuration. You may create at any preferred location. Here there are moved to a new directory /etc/apache2/ssl.
$ sudo service apache2 restart
$ sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout /etc/apache2/ssl/apache.key -out /etc/apache2/ssl/apache.crt
Configure Apache to use SSL
Edit default-ssl.conf (/etc/apache2/sites-available), file that contains the default SSL configuration.
<IfModule mod_ssl.c>
    <VirtualHost _default_:443>
        ServerAdmin admin@example.com
        ServerName example.com
        ServerAlias www.example.com
        DocumentRoot /var/www/html
        ErrorLog ${APACHE_LOG_DIR}/error.log
        CustomLog ${APACHE_LOG_DIR}/access.log combined
        SSLEngine on
        SSLCertificateFile /etc/apache2/ssl/apache.crt
        SSLCertificateKeyFile /etc/apache2/ssl/apache.key
        <FilesMatch "\.(cgi|shtml|phtml|php)$">
                        SSLOptions +StdEnvVars
        </FilesMatch>
        <Directory /usr/lib/cgi-bin>
                        SSLOptions +StdEnvVars
        </Directory>
        BrowserMatch "MSIE [2-6]" \
                        nokeepalive ssl-unclean-shutdown \
                        downgrade-1.0 force-response-1.0
        BrowserMatch "MSIE [17-9]" ssl-unclean-shutdown
    </VirtualHost>
</IfModule>
Activate the SSL Virtual Host

$ sudo a2ensite default-ssl.conf
$ sudo service apache2 restart

Test the Virtual Host with SSL
Now you can test the application with https://<your-domain> it should work!
References
VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: +1 (from 1 vote)

Tomcat Startup Script – Ubuntu 14.04 LTS

Environment : Ubuntu 14.04 LTS

Prerequisites: Java and Tomcat installed in your machine/instance. JAVA_HOME should be set already before start.

Step 1: Create a file called “tomcat” under /etc/init.d folder and have the contents as below.

#!/bin/bash
#
# tomcat
#
# chkconfig:
# description:  Start up the Tomcat servlet engine.

# Source function library.
TOMCAT_DIR=/home/crishantha/lib/apache-tomcat-7.0.63

case "$1" in
 start)
   $TOMCAT_DIR/bin/startup.sh
   ;;
 stop)
   $TOMCAT_DIR/bin/shutdown.sh
   sleep 10
   ;;
 restart)
   $TOMCAT_DIR/bin/shutdown.sh
   sleep 20
   $TOMCAT_DIR/bin/startup.sh
   ;;
 *)
   echo "Usage: tomcat {start|stop|restart}" >&2
   exit 3
   ;;
esac

Step 2: Make the script executable

sudo chmod a+x tomcat

Step 3: Test the above script by executing the commands below

sudo ./tomcat start
sudo ./tomcat stop

Step 4: Registering the above script as an init script. The following will make sure to execute “start” or “stop” at the system run levels. Generally default start happens on 2 3 4 5 run levels. Default stop happens on 0 1 6 run levels.

sudo update-rc.d tomcat defaults

Step 5: Now reboot the machine/instance to see everything is fine

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: +1 (from 1 vote)

Why Apache Thrift?

Thrift is an interface definition language (IDL) and binary communication protocol. It is used as  a RPC framework, which was developed by Facebook and now maintained as an open source project by Apache Software Foundation.

Thrift is used by many popular projects such as Facebook, Hadoop, Cassendra, HBase, Hypertable.

Why Thrift over JSON?

Thrift is Lightweight, Language Independent and support Data Transport/ Serialization to build cross language services.

It basically has libraries to support languages like C++, Java, Python, Ruby, Erlang, Perl, Haskell, C#, Javascript, Node.JS.

1) Strong Typing

JSON is great if you are working with scripting languages like Python, Ruby, PHP, Javascript, etc. However, if you’re building significant portions of your application in a strongly-typed language like C++ or Java, JSON often becomes a bit of a headache to work with. Thrift lets you transparently work with strong, native types and also provides a mechanism for throwing application-level exceptions across the wire.

2) Performance

Performance is one of Thrift’s main design considerations. JSON is much more geared towards human readability, which comes at the cost of making them more CPU intensive to work with.

3) Serialization

If you are serializing large amounts of data, Thrift’s binary protocol is more efficient than JSON.

4) Versioning Support

Thrift has inbuilt mechanisms for versioning data. This can be very helpful in a distributed environment where your service interfaces may change, but you cannot atomically update all your client and server code.

5) Server Implementaion

Thrift includes RPC server implementations for a number of languages. Because they are streamlined to just support Thrift requests, they are lighter-weight and higher-performance than typical HTTP server implementations like JSON.

Advantages

Thrift generates both the server and client interfaces for a given service. Client calls will be more consistent and generally be less error prone. Thrift supports various protocols, not just HTTP.  If you are dealing with large volumes of service calls, or have bandwidth requirements, the client/server can transparently switch to more efficient transports such as this.

Disadvantages

It is more work to get started on the client side, when the clients are directly building the calling code. It’s less work for the service owner if they are building libraries for clients. The bottom line: If you are providing a simple service & API, Thrift is probably not the right tool.

References

1. Thrift Tutorial - http://thrift-tutorial.readthedocs.org/en/latest/intro.html

2. The programmers guide to Apache Thrift (MEAP) – Chapter 01

3. Writing your first Thrift Service - http://srinathsview.blogspot.com/2011/09/writing-your-first-thrift-service.html

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Hadoop Eco System

As you may aware Hadoop Eco system consists of so many open source tools. There is a lot of research is going on in this area now and everyday you would see a new version of an existing framework or a new framework altogether getting popular undermining the existing ones. Hence if you are a Hadoop developer you need to constantly gather current technological advancements, which happen around you.

As a start to understand the technological frameworks around, I myself tried to sketch a diagram to summarize some of the key open source frameworks and their relationship with their usage. I will try to evolve this diagram as much as I learn in the future and I will not forget to share the same with you all as well.


Steps

1. Feeding RDBMS data to HDFS via Sqoop

2. Cleansing imported data via Pig

3. Loading HDFS data to Hive using Hive Scripts. This can be done by manually running Hive scripts or scheduled through Oozie work scheduler

4. Hive Data Warehouse schema’s are stored separately in a Hive Data Warehouse RDBMS Schema

5. In Hadoop 1.x, Spark and Shark need to be installed separately to do real time query via Hive. In Hadoop 2.x YARN basically bundles Spark and Shark components

6. Batch queries can be executed directly via Hive

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: +2 (from 2 votes)
Go to Top