Big Data

Hadoop 2.6 (Part 2) – Running the Mapreduce Job

This is the continuation of my previous article on “Installing Hadoop 2.6 on Ubuntu 16.04“. This article will explain how we run one of the examples given with the Hadoop binary.

Once the Hadoop installation is completed, you can run the “wordcount” example provided with the Hadoop examples in order to test a Mapreduce job. This example actually is bundled with the hadoop-examples.jar file in the distribution. (See the below steps for more details)

Step 1: Start the Hadoop Cluster, if not already started.

$ /usr/local/hadoop/sbin/start-dfs.sh
$ /usr/local/hadoop/sbin/start-yarn.sh

Step 2: Copy the text files that you are going to consider for a “wordcount” to a local folder (/home/hadoop/textfiles)

Step 3: Copy the text files (in the local folder) to HDFS.

$ echo "Word Count Text File" > textFile.txt
$ hdfs dfs -mkdir -p /user/hduser/dfs
$ hadoop dfs -copyFromLocal textFile.txt /user/hduser/dfs
Step 4: List the content of the HDFS folder.
$ hadoop dfs -ls /user/hduser/dfs
Step 5: If you were able to complete the step 4, you are good to go ahead with the MapReduce job.
$ cd /usr/local/hadoop
$ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.1.jar wordcount /user/hduser/dfs /user/hduser/dfs-output
If the job was completed successfully, Congratulations!

You can either choose the command line or the web interface to display the contents of the HDFS directories. If you choose the command line you can try the following command.

$ hadoop dfs -ls /user/hduser/dfs-output

OR

http://localhost:50070/
VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Why Apache Thrift?

Thrift is an interface definition language (IDL) and binary communication protocol. It is used as  a RPC framework, which was developed by Facebook and now maintained as an open source project by Apache Software Foundation.

Thrift is used by many popular projects such as Facebook, Hadoop, Cassendra, HBase, Hypertable.

Why Thrift over JSON?

Thrift is Lightweight, Language Independent and support Data Transport/ Serialization to build cross language services.

It basically has libraries to support languages like C++, Java, Python, Ruby, Erlang, Perl, Haskell, C#, Javascript, Node.JS.

1) Strong Typing

JSON is great if you are working with scripting languages like Python, Ruby, PHP, Javascript, etc. However, if you’re building significant portions of your application in a strongly-typed language like C++ or Java, JSON often becomes a bit of a headache to work with. Thrift lets you transparently work with strong, native types and also provides a mechanism for throwing application-level exceptions across the wire.

2) Performance

Performance is one of Thrift’s main design considerations. JSON is much more geared towards human readability, which comes at the cost of making them more CPU intensive to work with.

3) Serialization

If you are serializing large amounts of data, Thrift’s binary protocol is more efficient than JSON.

4) Versioning Support

Thrift has inbuilt mechanisms for versioning data. This can be very helpful in a distributed environment where your service interfaces may change, but you cannot atomically update all your client and server code.

5) Server Implementaion

Thrift includes RPC server implementations for a number of languages. Because they are streamlined to just support Thrift requests, they are lighter-weight and higher-performance than typical HTTP server implementations like JSON.

Advantages

Thrift generates both the server and client interfaces for a given service. Client calls will be more consistent and generally be less error prone. Thrift supports various protocols, not just HTTP.  If you are dealing with large volumes of service calls, or have bandwidth requirements, the client/server can transparently switch to more efficient transports such as this.

Disadvantages

It is more work to get started on the client side, when the clients are directly building the calling code. It’s less work for the service owner if they are building libraries for clients. The bottom line: If you are providing a simple service & API, Thrift is probably not the right tool.

References

1. Thrift Tutorial - http://thrift-tutorial.readthedocs.org/en/latest/intro.html

2. The programmers guide to Apache Thrift (MEAP) – Chapter 01

3. Writing your first Thrift Service - http://srinathsview.blogspot.com/2011/09/writing-your-first-thrift-service.html

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Hadoop Eco System

As you may aware Hadoop Eco system consists of so many open source tools. There is a lot of research is going on in this area now and everyday you would see a new version of an existing framework or a new framework altogether getting popular undermining the existing ones. Hence if you are a Hadoop developer you need to constantly gather current technological advancements, which happen around you.

As a start to understand the technological frameworks around, I myself tried to sketch a diagram to summarize some of the key open source frameworks and their relationship with their usage. I will try to evolve this diagram as much as I learn in the future and I will not forget to share the same with you all as well.


Steps

1. Feeding RDBMS data to HDFS via Sqoop

2. Cleansing imported data via Pig

3. Loading HDFS data to Hive using Hive Scripts. This can be done by manually running Hive scripts or scheduled through Oozie work scheduler

4. Hive Data Warehouse schema’s are stored separately in a Hive Data Warehouse RDBMS Schema

5. In Hadoop 1.x, Spark and Shark need to be installed separately to do real time query via Hive. In Hadoop 2.x YARN basically bundles Spark and Shark components

6. Batch queries can be executed directly via Hive

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: +2 (from 2 votes)

Apache Hadoop Summit 2015 (Brussels) – Videos and Slides

You can reach very important sessions on the Hadoop world if you can go through the sessions..

http://2015.hadoopsummit.org/brussels/agenda/

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Setting Up HBase (Standalone) on Ubuntu 12.04

Though HDFS is the default Distributed Filing System attached to Hadoop, HBase also came to limelight due to several limitations in HDFS.

HDFS Limitations

1. HDFS is optimized for streaming relatively larger files, which have over 100s of MB upwards and are accessed them through MapReduce in “batch mode“.

2. HDFS files are write-once and read-many files and does not perform well in random writes or random reads.

To overcome above, HBase was developed with following features.

HBase Features

1. Can access small amount of data from a large data set from a billion row table in real-time.

2. Fast scanning across tables.

3. Flexible data model.

So, with the all above, lets dive into the world of HBase.

Installation (Standalone Mode)

HBase installation can happen in three difference modes.

1. Standalone / Local Mode

2, Pseudo Distributed Mode

3. Fully Distributed Mode

Just to feel HBase, we will first try out the Standalone Mode here.

Step 1: Download HBase. Use binaries at the Apache HBase mirrors.

In this tutorial I am using the hbase-0.94.8.tar.gz.

Further you can check the configuration dependencies from the URL http://hbase.apache.org/book/configuration.html (See Table 2.1 Hadoop Version support Matrix) According to this you are required to have a Hadoop 1.0.3 to run HBase.

Step 2: Extract HBase binaries to a desired location. Here I do extract to the location /usr/local/hbase-0.94.8 and make it HBASE_HOME

Step 3: Set the HBASE_HOME in .bashrc

You are required to include HBASE_HOME and add the HBASE_HOME to .bashrc file.

----
export HBASE_HOME=/usr/local/hbase-0.94.8
----
----
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HBASE_HOME/bin:$PATH

Step 3: Change the HBASE Configuration Files.

$HBASE_HOME/conf/hbase-env.sh to set the JAVA_HOME

export JAVA_HOME=/usr/local/jdk1.6.0_37

$HBASE_HOME/conf/hbase-site.xml

<configuration>
<property>

<name>hbase.rootdir</name>
<value>file:///usr/local/hbase-0.94.8/var</value>
</property>
</configuration>

The hbase.rootdir needs to be specified in order to have a static directory for HBase work. Otherwise the system is using a temporary folder assigned by itself.

Step 4: Change the /etc/hosts file to reflect the external IP of yours.

The changed entry above should look like below. The second line should be given the external IP (192.168.1.33 here) or 127.0.0.1 by replacing the 127.0.1.1

127.0.0.1       localhost
192.168.1.33    crishantha-Notebook-PC

Step 5: Start Hbase

$ $HBASE_HOME/bin/start-hbase.sh

So if everything goes well, you should be able to successfully start the HBase.

Check $HBASE_HOME/logs folder to find any errors while starting.

Step 6: Start the HBase shell – Once the HBase is started make sure the HBase shell is working. Execute the following command to ensure.

$ hbase shell

References

1. http://diggdata.in/post/67561846971/fetch-data-from-hbase-database-from-r-using-rhbase

2. http://blog.revolutionanalytics.com/2011/09/mapreduce-hadoop-r.html

3. https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md

VN:F [1.9.22_1171]
Rating: 6.7/10 (3 votes cast)
VN:F [1.9.22_1171]
Rating: +1 (from 1 vote)

Setting up RHadoop on Ubuntu 12.04

Rhadoop is an open-source package developed by Revolution Analytics that binds R to Hadoop and allows for the representation of MapReduce algorithms using R. RHadoop is the most mature and the best integrated project to connect R with Hadoop so far, which was written by Antonio Piccolboni, with the support from Revolution Analytics.

RHadoop Packages

RHadoop consists of 3 main packages.

1. rmr - Provides MapReduce functionality in R

2. rhdfs – Provides HDFS file management in R

3. rhbase – Provides HBase database management in R

RHadoop binaries can be downloaded from: https://github.com/RevolutionAnalytics/RHadoop

Prerequisites

1. Make sure Java, R and Hadoop binaries are installed in the machine that you are intending to install RHadoop

2. You can test this by executing following commands

$ java -version
$ hadoop version
$ R

3. Check JAVA_HOME, HADOOP_HOME and HADOOP_CMD variables are set in the $HOME/.bashrc file. If not add them.

export JAVA_HOME=/usr/local/jdk1.6.0_37
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CMD=$HADOOP_HOME/bin/hadoop

-
-
-
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH

4. Just to check the Hadoop installation, just run the $HADOOP_HOME/bin/start-all.sh and see all the nodes of the Hadoop cluster are functioning properly,

Installation

Step 1: Install rmr with its dependencies (RCpp, RJSONIO, digest, functional, stringr and plyr).

If you use the command prompt to do this,

sudo R CMD INSTALL Rcpp Rcpp_0.10.2.tar.gz
sudo R CMD INSTALL RJSONIO RJSONIO_1.0-1.tar.gz
sudo R CMD INSTALL digest digest_0.6.2.tar.gz
sudo R CMD INSTALL functional functional_0.1.tar.gz
sudo R CMD INSTALL stringr stringr_0.6.2.tar.g
sudo R CMD INSTALL plyr plyr_1.8.tar.gz
sudo R CMD INSTALL rmr rmr2_2.0.2.tar.gz

Step 2: Install rhdfs with its dependencies (rJava)

sudo JAVA_HOME=/user/local/jdk1.6.0_37 R CMD javareconf
sudo R CMD INSTALL rJava rJava_0.9-3.tar.gz
sudo HADOOP_CMD=/home/local/hadoop/bin/hadoop R CMD INSTALL rhdfs rhdfs_1.0.5.tar.gz
sudo R CMD INSTALL rhdfs rhdfs_1.0.5.tar.gz

In above the give library versions are tested and working. You are not required to stick to these mainly because these versions may not be the latest around when you are trying out this tutorial. If you are not sure about the dependency versions, let the Ubuntu repositories decide it by you allowing the use of install.packages command within R.

For example,

> install.packages("Rcpp", lib="/usr/local/lib/R/site-library")

You may continue this for all the dependencies and most of the time it works for you.

Step 3: Verify rmr and rhdfs installations (within R)

> library(rmr2,lib.loc="/usr/local/lib/R/site-library")
> library(rhdfs,lib.loc="/usr/local/lib/R/site-library")

In above, “lib.loc” is where R has installed the specific dependency libraries. If it not specified R actually looks for the $HOME/R folder for libraries. RHadoop sometimes throw specific error by not being able to find the specific libraries. Hence it is always good to specify the library location in the library command.

If there are no errors, you have successfully managed to install rmr and rhdfs on top of R and Hadoop.

References:

1.http://bighadoop.wordpress.com/2013/02/25/r-and-hadoop-data-analysis-rhadoop/

VN:F [1.9.22_1171]
Rating: 8.6/10 (5 votes cast)
VN:F [1.9.22_1171]
Rating: +2 (from 2 votes)

Installing R and RStudio on Ubuntu 12.04 LTS

If you are into Data Science field or involved with any Data Analytical research work, you will need the help of a Data Analytical tool to analyze the collected data. “R” is one of those very popular FOSS tool in the market now. You can get more details of R and related information form the CRAN web site. (http://cran.r-project.org/)

Once you install R, it gives you only a basic UI. As an alternative you can use RStudio (https://www.rstudio.com/) on top of installed R binaries. RStudio is a very much a user friendly interface which has many features to execute most of the R features.

Installing R

1. Update the repositories to get the latest R distributions:

Use sudo vi /etc/apt/sources.list to add the following:

deb http://cran.stat.ucla.edu/bin/linux/ubuntu precise/

OR

sudo add-apt-repository "deb http://cran.stat.ucla.edu/bin/linux/ubuntu precise/"

2. Get the repository SSL key and import it in to apt.

gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9
gpg -a --export E084DAB9 | sudo apt-key add -

3. Update the repositories.

sudo apt-get update
sudo apt-get install r-base

Now you may go to the command prompt and type R to get the R console.

Installing RStudio

There are two modes of installation in RStudio.

1. Desktop Version

2. Server Version (This is basically the preferred option mainly because of its ability to be accessed from anywhere on the network)

Installing RStudio (Desktop Version)

1. Visit www.rstudio.com and use http://www.rstudio.com/ide/download/desktop to download the latest binary available.

2. As a prerequisite, you are required to install R (R version 2.11.1 or higher). Follow steps given above to install R.

3. Extract the binary

sudo dpkg -i rstudio-0.xxx-amd64.deb

After extraction, a “RStudio” icon will be available to proceed.

Installing RStudio (The Server Version)

1. Visit www.rstudio.com and use http://www.rstudio.com/ide/download/server.html to download the latest binary available.

2. As a prerequisite, you are required to install R (R version 2.11.1 or higher). Follow steps given above to install R.

3. Use following steps to install all the dependencies related to RStudio.

32-bit

$ sudo apt-get install gdebi-core
$ sudo apt-get install libapparmor1  # Required only for Ubuntu, not Debian
$ wget http://download2.rstudio.org/rstudio-server-0.98.501-i386.deb
$ sudo gdebi rstudio-server-0.98.501-i386.deb

64-bit

$ sudo apt-get install gdebi-core
$ sudo apt-get install libapparmor1  # Required only for Ubuntu, not Debian
$ wget http://download2.rstudio.org/rstudio-server-0.98.501-amd64.deb
$ sudo gdebi rstudio-server-0.98.501-amd64.deb

4. Now use the following link to test the successful installation.

http://localhost:8787

Use your linux userid/password to log into the RStudio system. Thats all about it!

VN:F [1.9.22_1171]
Rating: 9.1/10 (8 votes cast)
VN:F [1.9.22_1171]
Rating: +3 (from 3 votes)
Go to Top