Rhadoop is an open-source package developed by Revolution Analytics that binds R to Hadoop and allows for the representation of MapReduce algorithms using R. RHadoop is the most mature and the best integrated project to connect R with Hadoop so far, which was written by Antonio Piccolboni, with the support from Revolution Analytics.

RHadoop Packages

RHadoop consists of 3 main packages.

1. rmr - Provides MapReduce functionality in R

2. rhdfs – Provides HDFS file management in R

3. rhbase – Provides HBase database management in R

RHadoop binaries can be downloaded from: https://github.com/RevolutionAnalytics/RHadoop


1. Make sure Java, R and Hadoop binaries are installed in the machine that you are intending to install RHadoop

2. You can test this by executing following commands

$ java -version
$ hadoop version
$ R

3. Check JAVA_HOME, HADOOP_HOME and HADOOP_CMD variables are set in the $HOME/.bashrc file. If not add them.

export JAVA_HOME=/usr/local/jdk1.6.0_37
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CMD=$HADOOP_HOME/bin/hadoop


4. Just to check the Hadoop installation, just run the $HADOOP_HOME/bin/start-all.sh and see all the nodes of the Hadoop cluster are functioning properly,


Step 1: Install rmr with its dependencies (RCpp, RJSONIO, digest, functional, stringr and plyr).

If you use the command prompt to do this,

sudo R CMD INSTALL Rcpp Rcpp_0.10.2.tar.gz
sudo R CMD INSTALL digest digest_0.6.2.tar.gz
sudo R CMD INSTALL functional functional_0.1.tar.gz
sudo R CMD INSTALL stringr stringr_0.6.2.tar.g
sudo R CMD INSTALL plyr plyr_1.8.tar.gz
sudo R CMD INSTALL rmr rmr2_2.0.2.tar.gz

Step 2: Install rhdfs with its dependencies (rJava)

sudo JAVA_HOME=/user/local/jdk1.6.0_37 R CMD javareconf
sudo R CMD INSTALL rJava rJava_0.9-3.tar.gz
sudo HADOOP_CMD=/home/local/hadoop/bin/hadoop R CMD INSTALL rhdfs rhdfs_1.0.5.tar.gz
sudo R CMD INSTALL rhdfs rhdfs_1.0.5.tar.gz

In above the give library versions are tested and working. You are not required to stick to these mainly because these versions may not be the latest around when you are trying out this tutorial. If you are not sure about the dependency versions, let the Ubuntu repositories decide it by you allowing the use of install.packages command within R.

For example,

> install.packages("Rcpp", lib="/usr/local/lib/R/site-library")

You may continue this for all the dependencies and most of the time it works for you.

Step 3: Verify rmr and rhdfs installations (within R)

> library(rmr2,lib.loc="/usr/local/lib/R/site-library")
> library(rhdfs,lib.loc="/usr/local/lib/R/site-library")

In above, “lib.loc” is where R has installed the specific dependency libraries. If it not specified R actually looks for the $HOME/R folder for libraries. RHadoop sometimes throw specific error by not being able to find the specific libraries. Hence it is always good to specify the library location in the library command.

If there are no errors, you have successfully managed to install rmr and rhdfs on top of R and Hadoop.



VN:F [1.9.22_1171]
Rating: 8.6/10 (5 votes cast)
VN:F [1.9.22_1171]
Rating: +2 (from 2 votes)
Setting up RHadoop on Ubuntu 12.04 , 8.6 out of 10 based on 5 ratings