Big Data

Towards a Cloud Enabled Data Intensive Digital Transformation

Today (08/07/2018) I had the privilege to do a 1 hour presentation on “Towards a Cloud Enabled Data Intensive Digital Transformation” for Jaffna University IT students. I hope they were able to learn something out of this presentation. You can reach the slide deck using the following link:

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

SMACK Stack for building Data Intensive Enterprise Applications

With the advent of Big Data, the enterprise applications nowadays are following a Data Intensive microservices based enterprise application architecture deviating more monolithic architectures, which we have been used to decades.

These data intensive applications should meet a set of requirements.

1. Ingest Data at Scale without a loss

2. Analyze data in real-time

3. Trigger action based on the analyzed data

4. Store the data at cloud-scale.

5. Need to run in a distributed and highly resilient cloud platform

The SMACK is such a stack, which can be used for building modern enterprise applications because it can performs each of the above objectives with a loosely coupled tool chain of technologies that are are all open source, and production-proven at scale.

(S – Spark, M – Mesos, A – Akka, C – Cassendra, K – Kafka)

  • Spark – A general engine for large-scale data processing, enabling analytics from SQL queries to machine learning, graph analytics, and stream processing
  • Mesos – Distributed systems kernel that provides resourcing and isolation across all the other SMACK stack components. Mesos is the foundation on which other SMACK stack components run.
  • Akka – A toolkit and runtime to easily create concurrent and distributed apps that are responsive to messages.
  • Cassandra – Distributed database management system that can handle large amounts of data across servers with high availability.
  • Kafka – A high throughput, low-latency platform for handling real-time data feeds with no data loss.



SMACK Data Pipeline


The following commercial options available for some of the components of SMACK.
1. Spark – Lightbend and Databricks
2, Cassendra – DataStax
3. Kafka – Confluent
4. Mesos – Mesosphere DC/OS
1. Building Data-Rich apps with “SMAL” stack –
2. The SMACK Stack is the new LAMP Stack –
VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Apache Spark with Python on Ubuntu – Part 01

1.0 Introduction

Spark is a fast and general cluster computing system for Big Data. It is written in Scala Language. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing. Apache Spark is built on one main concept, which is “Resilient Distributed Data (RDD)”.

Spark Components

(Python vs Scala vs Java) with Spark?

The most popular languages that Spark associated are Python and Scala. Both languages follow a similar syntax and compared to Java they are quite easy to follow. However compared to Python, Scala seems more faster mainly Spark is written in Scala and it overcomes the delay of having to go through another set of libraries to interpret if you chose to use Python. However, in general both are capable of doing the task in almost all the use cases.

Installing Apache Spark with Python

In order to complete this task, you are required to follow the following steps one by one.

Step 1: Install Java

- I assume you already have Java Development Kit (JDK) installed in your machines. In March 2018, the Spark supported JDK version is JDK 8.

- You may verify the Java installation

$ java -version

Step 2: Install Scala

- If you do not have Scala installed in your system, use this link to install it.

- Get the “tgz” bundle and extract it to the /usrlocal/scala folder (This is as a best practice)

$ tar xvf scala-2.11.12.tgz

// Extract the scala into /usr/local folder
$ su -
$ mv scala-2.11.12 /usr/local/scala
$ exit

- Then update the .bashrc to have SCALA_HOME and $SCALA_HOME/bin to the $PATH.

- After all, verify the scala installation

$ scala -version

Step 3: Install Spark

After installing both Java and Scala, now you are ready to download the Spark version. Use this link to download the “tgz” file.

P.Note: Also when downloading Apache Spark itself, be sure to install the latest version of Spark 2.3

$ tar xvf spark-2.3.0-bin-hadoop2.7.tgz
// Extract the Spark into /usr/local/spark
$ su -
$ mv spark-2.3.0-bin-hadoop2.7 /usr/local/spark
$ exit

- Then update the .bashrc to have SPARK_HOME and $SPARK_HOME/bin to the $PATH.

- Now you may verify the Spark installation

$ spark-shell

if all goes well, you will see a Spark prompt being displayed!. (Use :quit OR Ctrl+D to exit from the shell)

Step 4: Install Python

You may install Python using Canopy

Use this link to download the binaries to your system. (Use Linux(64-bit Python 3.5 Download for this blog)

Once you installed,Canopy, you have a Python development environment to work with Spark with all the libraries including PySpark.

Once all these installed you can try PySpark by just typing “pyspark” on the terminal window.

$ pyspark

This will allow you to continue to execute your Python scripts on Spark.

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Hadoop 2.6 (Part 2) – Running the Mapreduce Job

This is the continuation of my previous article on “Installing Hadoop 2.6 on Ubuntu 16.04“. This article will explain how we run one of the examples given with the Hadoop binary.

Once the Hadoop installation is completed, you can run the “wordcount” example provided with the Hadoop examples in order to test a Mapreduce job. This example actually is bundled with the hadoop-examples.jar file in the distribution. (See the below steps for more details)

Step 1: Start the Hadoop Cluster, if not already started.

$ /usr/local/hadoop/sbin/
$ /usr/local/hadoop/sbin/

Step 2: Copy the text files that you are going to consider for a “wordcount” to a local folder (/home/hadoop/textfiles)

Step 3: Copy the text files (in the local folder) to HDFS.

$ echo "Word Count Text File" > textFile.txt
$ hdfs dfs -mkdir -p /user/hduser/dfs
$ hadoop dfs -copyFromLocal textFile.txt /user/hduser/dfs
Step 4: List the content of the HDFS folder.
$ hadoop dfs -ls /user/hduser/dfs
Step 5: If you were able to complete the step 4, you are good to go ahead with the MapReduce job.
$ cd /usr/local/hadoop
$ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.1.jar wordcount /user/hduser/dfs /user/hduser/dfs-output
If the job was completed successfully, Congratulations!

You can either choose the command line or the web interface to display the contents of the HDFS directories. If you choose the command line you can try the following command.

$ hadoop dfs -ls /user/hduser/dfs-output


VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Why Apache Thrift?

Thrift is an interface definition language (IDL) and binary communication protocol. It is used as  a RPC framework, which was developed by Facebook and now maintained as an open source project by Apache Software Foundation.

Thrift is used by many popular projects such as Facebook, Hadoop, Cassendra, HBase, Hypertable.

Why Thrift over JSON?

Thrift is Lightweight, Language Independent and support Data Transport/ Serialization to build cross language services.

It basically has libraries to support languages like C++, Java, Python, Ruby, Erlang, Perl, Haskell, C#, Javascript, Node.JS.

1) Strong Typing

JSON is great if you are working with scripting languages like Python, Ruby, PHP, Javascript, etc. However, if you’re building significant portions of your application in a strongly-typed language like C++ or Java, JSON often becomes a bit of a headache to work with. Thrift lets you transparently work with strong, native types and also provides a mechanism for throwing application-level exceptions across the wire.

2) Performance

Performance is one of Thrift’s main design considerations. JSON is much more geared towards human readability, which comes at the cost of making them more CPU intensive to work with.

3) Serialization

If you are serializing large amounts of data, Thrift’s binary protocol is more efficient than JSON.

4) Versioning Support

Thrift has inbuilt mechanisms for versioning data. This can be very helpful in a distributed environment where your service interfaces may change, but you cannot atomically update all your client and server code.

5) Server Implementaion

Thrift includes RPC server implementations for a number of languages. Because they are streamlined to just support Thrift requests, they are lighter-weight and higher-performance than typical HTTP server implementations like JSON.


Thrift generates both the server and client interfaces for a given service. Client calls will be more consistent and generally be less error prone. Thrift supports various protocols, not just HTTP.  If you are dealing with large volumes of service calls, or have bandwidth requirements, the client/server can transparently switch to more efficient transports such as this.


It is more work to get started on the client side, when the clients are directly building the calling code. It’s less work for the service owner if they are building libraries for clients. The bottom line: If you are providing a simple service & API, Thrift is probably not the right tool.


1. Thrift Tutorial -

2. The programmers guide to Apache Thrift (MEAP) – Chapter 01

3. Writing your first Thrift Service -

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: +1 (from 1 vote)

Hadoop Eco System

As you may aware Hadoop Eco system consists of so many open source tools. There is a lot of research is going on in this area now and everyday you would see a new version of an existing framework or a new framework altogether getting popular undermining the existing ones. Hence if you are a Hadoop developer you need to constantly gather current technological advancements, which happen around you.

As a start to understand the technological frameworks around, I myself tried to sketch a diagram to summarize some of the key open source frameworks and their relationship with their usage. I will try to evolve this diagram as much as I learn in the future and I will not forget to share the same with you all as well.


1. Feeding RDBMS data to HDFS via Sqoop

2. Cleansing imported data via Pig

3. Loading HDFS data to Hive using Hive Scripts. This can be done by manually running Hive scripts or scheduled through Oozie work scheduler

4. Hive Data Warehouse schema’s are stored separately in a Hive Data Warehouse RDBMS Schema

5. In Hadoop 1.x, Spark and Shark need to be installed separately to do real time query via Hive. In Hadoop 2.x YARN basically bundles Spark and Shark components

6. Batch queries can be executed directly via Hive

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: +2 (from 2 votes)

Apache Hadoop Summit 2015 (Brussels) – Videos and Slides

You can reach very important sessions on the Hadoop world if you can go through the sessions..

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Setting Up HBase (Standalone) on Ubuntu 12.04

Though HDFS is the default Distributed Filing System attached to Hadoop, HBase also came to limelight due to several limitations in HDFS.

HDFS Limitations

1. HDFS is optimized for streaming relatively larger files, which have over 100s of MB upwards and are accessed them through MapReduce in “batch mode“.

2. HDFS files are write-once and read-many files and does not perform well in random writes or random reads.

To overcome above, HBase was developed with following features.

HBase Features

1. Can access small amount of data from a large data set from a billion row table in real-time.

2. Fast scanning across tables.

3. Flexible data model.

So, with the all above, lets dive into the world of HBase.

Installation (Standalone Mode)

HBase installation can happen in three difference modes.

1. Standalone / Local Mode

2, Pseudo Distributed Mode

3. Fully Distributed Mode

Just to feel HBase, we will first try out the Standalone Mode here.

Step 1: Download HBase. Use binaries at the Apache HBase mirrors.

In this tutorial I am using the hbase-0.94.8.tar.gz.

Further you can check the configuration dependencies from the URL (See Table 2.1 Hadoop Version support Matrix) According to this you are required to have a Hadoop 1.0.3 to run HBase.

Step 2: Extract HBase binaries to a desired location. Here I do extract to the location /usr/local/hbase-0.94.8 and make it HBASE_HOME

Step 3: Set the HBASE_HOME in .bashrc

You are required to include HBASE_HOME and add the HBASE_HOME to .bashrc file.

export HBASE_HOME=/usr/local/hbase-0.94.8

Step 3: Change the HBASE Configuration Files.

$HBASE_HOME/conf/ to set the JAVA_HOME

export JAVA_HOME=/usr/local/jdk1.6.0_37




The hbase.rootdir needs to be specified in order to have a static directory for HBase work. Otherwise the system is using a temporary folder assigned by itself.

Step 4: Change the /etc/hosts file to reflect the external IP of yours.

The changed entry above should look like below. The second line should be given the external IP ( here) or by replacing the       localhost    crishantha-Notebook-PC

Step 5: Start Hbase

$ $HBASE_HOME/bin/

So if everything goes well, you should be able to successfully start the HBase.

Check $HBASE_HOME/logs folder to find any errors while starting.

Step 6: Start the HBase shell – Once the HBase is started make sure the HBase shell is working. Execute the following command to ensure.

$ hbase shell





VN:F [1.9.22_1171]
Rating: 6.7/10 (3 votes cast)
VN:F [1.9.22_1171]
Rating: +1 (from 1 vote)

Setting up RHadoop on Ubuntu 12.04

Rhadoop is an open-source package developed by Revolution Analytics that binds R to Hadoop and allows for the representation of MapReduce algorithms using R. RHadoop is the most mature and the best integrated project to connect R with Hadoop so far, which was written by Antonio Piccolboni, with the support from Revolution Analytics.

RHadoop Packages

RHadoop consists of 3 main packages.

1. rmr - Provides MapReduce functionality in R

2. rhdfs – Provides HDFS file management in R

3. rhbase – Provides HBase database management in R

RHadoop binaries can be downloaded from:


1. Make sure Java, R and Hadoop binaries are installed in the machine that you are intending to install RHadoop

2. You can test this by executing following commands

$ java -version
$ hadoop version
$ R

3. Check JAVA_HOME, HADOOP_HOME and HADOOP_CMD variables are set in the $HOME/.bashrc file. If not add them.

export JAVA_HOME=/usr/local/jdk1.6.0_37
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CMD=$HADOOP_HOME/bin/hadoop


4. Just to check the Hadoop installation, just run the $HADOOP_HOME/bin/ and see all the nodes of the Hadoop cluster are functioning properly,


Step 1: Install rmr with its dependencies (RCpp, RJSONIO, digest, functional, stringr and plyr).

If you use the command prompt to do this,

sudo R CMD INSTALL Rcpp Rcpp_0.10.2.tar.gz
sudo R CMD INSTALL digest digest_0.6.2.tar.gz
sudo R CMD INSTALL functional functional_0.1.tar.gz
sudo R CMD INSTALL stringr stringr_0.6.2.tar.g
sudo R CMD INSTALL plyr plyr_1.8.tar.gz
sudo R CMD INSTALL rmr rmr2_2.0.2.tar.gz

Step 2: Install rhdfs with its dependencies (rJava)

sudo JAVA_HOME=/user/local/jdk1.6.0_37 R CMD javareconf
sudo R CMD INSTALL rJava rJava_0.9-3.tar.gz
sudo HADOOP_CMD=/home/local/hadoop/bin/hadoop R CMD INSTALL rhdfs rhdfs_1.0.5.tar.gz
sudo R CMD INSTALL rhdfs rhdfs_1.0.5.tar.gz

In above the give library versions are tested and working. You are not required to stick to these mainly because these versions may not be the latest around when you are trying out this tutorial. If you are not sure about the dependency versions, let the Ubuntu repositories decide it by you allowing the use of install.packages command within R.

For example,

> install.packages("Rcpp", lib="/usr/local/lib/R/site-library")

You may continue this for all the dependencies and most of the time it works for you.

Step 3: Verify rmr and rhdfs installations (within R)

> library(rmr2,lib.loc="/usr/local/lib/R/site-library")
> library(rhdfs,lib.loc="/usr/local/lib/R/site-library")

In above, “lib.loc” is where R has installed the specific dependency libraries. If it not specified R actually looks for the $HOME/R folder for libraries. RHadoop sometimes throw specific error by not being able to find the specific libraries. Hence it is always good to specify the library location in the library command.

If there are no errors, you have successfully managed to install rmr and rhdfs on top of R and Hadoop.



VN:F [1.9.22_1171]
Rating: 8.6/10 (5 votes cast)
VN:F [1.9.22_1171]
Rating: +2 (from 2 votes)

Installing R and RStudio on Ubuntu 12.04 LTS

If you are into Data Science field or involved with any Data Analytical research work, you will need the help of a Data Analytical tool to analyze the collected data. “R” is one of those very popular FOSS tool in the market now. You can get more details of R and related information form the CRAN web site. (

Once you install R, it gives you only a basic UI. As an alternative you can use RStudio ( on top of installed R binaries. RStudio is a very much a user friendly interface which has many features to execute most of the R features.

Installing R

1. Update the repositories to get the latest R distributions:

Use sudo vi /etc/apt/sources.list to add the following:

deb precise/


sudo add-apt-repository "deb precise/"

2. Get the repository SSL key and import it in to apt.

gpg --keyserver --recv-key E084DAB9
gpg -a --export E084DAB9 | sudo apt-key add -

3. Update the repositories.

sudo apt-get update
sudo apt-get install r-base

Now you may go to the command prompt and type R to get the R console.

Installing RStudio

There are two modes of installation in RStudio.

1. Desktop Version

2. Server Version (This is basically the preferred option mainly because of its ability to be accessed from anywhere on the network)

Installing RStudio (Desktop Version)

1. Visit and use to download the latest binary available.

2. As a prerequisite, you are required to install R (R version 2.11.1 or higher). Follow steps given above to install R.

3. Extract the binary

sudo dpkg -i

After extraction, a “RStudio” icon will be available to proceed.

Installing RStudio (The Server Version)

1. Visit and use to download the latest binary available.

2. As a prerequisite, you are required to install R (R version 2.11.1 or higher). Follow steps given above to install R.

3. Use following steps to install all the dependencies related to RStudio.


$ sudo apt-get install gdebi-core
$ sudo apt-get install libapparmor1  # Required only for Ubuntu, not Debian
$ wget
$ sudo gdebi rstudio-server-0.98.501-i386.deb


$ sudo apt-get install gdebi-core
$ sudo apt-get install libapparmor1  # Required only for Ubuntu, not Debian
$ wget
$ sudo gdebi rstudio-server-0.98.501-amd64.deb

4. Now use the following link to test the successful installation.


Use your linux userid/password to log into the RStudio system. Thats all about it!

VN:F [1.9.22_1171]
Rating: 9.1/10 (8 votes cast)
VN:F [1.9.22_1171]
Rating: +3 (from 3 votes)
Go to Top