Why Apache Thrift?

Thrift is an interface definition language (IDL) and binary communication protocol. It is used as  a RPC framework, which was developed by Facebook and now maintained as an open source project by Apache Software Foundation.

Thrift is used by many popular projects such as Facebook, Hadoop, Cassendra, HBase, Hypertable.

Why Thrift over JSON?

Thrift is Lightweight, Language Independent and support Data Transport/ Serialization to build cross language services.

It basically has libraries to support languages like C++, Java, Python, Ruby, Erlang, Perl, Haskell, C#, Javascript, Node.JS.

1) Strong Typing

JSON is great if you are working with scripting languages like Python, Ruby, PHP, Javascript, etc. However, if you’re building significant portions of your application in a strongly-typed language like C++ or Java, JSON often becomes a bit of a headache to work with. Thrift lets you transparently work with strong, native types and also provides a mechanism for throwing application-level exceptions across the wire.

2) Performance

Performance is one of Thrift’s main design considerations. JSON is much more geared towards human readability, which comes at the cost of making them more CPU intensive to work with.

3) Serialization

If you are serializing large amounts of data, Thrift’s binary protocol is more efficient than JSON.

4) Versioning Support

Thrift has inbuilt mechanisms for versioning data. This can be very helpful in a distributed environment where your service interfaces may change, but you cannot atomically update all your client and server code.

5) Server Implementaion

Thrift includes RPC server implementations for a number of languages. Because they are streamlined to just support Thrift requests, they are lighter-weight and higher-performance than typical HTTP server implementations like JSON.

Advantages

Thrift generates both the server and client interfaces for a given service. Client calls will be more consistent and generally be less error prone. Thrift supports various protocols, not just HTTP.  If you are dealing with large volumes of service calls, or have bandwidth requirements, the client/server can transparently switch to more efficient transports such as this.

Disadvantages

It is more work to get started on the client side, when the clients are directly building the calling code. It’s less work for the service owner if they are building libraries for clients. The bottom line: If you are providing a simple service & API, Thrift is probably not the right tool.

References

1. Thrift Tutorial - http://thrift-tutorial.readthedocs.org/en/latest/intro.html

2. The programmers guide to Apache Thrift (MEAP) – Chapter 01

3. Writing your first Thrift Service - http://srinathsview.blogspot.com/2011/09/writing-your-first-thrift-service.html

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Hadoop Eco System

As you may aware Hadoop Eco system consists of so many open source tools. There is a lot of research is going on in this area now and everyday you would see a new version of an existing framework or a new framework altogether getting popular undermining the existing ones. Hence if you are a Hadoop developer you need to constantly gather current technological advancements, which happen around you.

As a start to understand the technological frameworks around, I myself tried to sketch a diagram to summarize some of the key open source frameworks and their relationship with their usage. I will try to evolve this diagram as much as I learn in the future and I will not forget to share the same with you all as well.


Steps

1. Feeding RDBMS data to HDFS via Sqoop

2. Cleansing imported data via Pig

3. Loading HDFS data to Hive using Hive Scripts. This can be done by manually running Hive scripts or scheduled through Oozie work scheduler

4. Hive Data Warehouse schema’s are stored separately in a Hive Data Warehouse RDBMS Schema

5. In Hadoop 1.x, Spark and Shark need to be installed separately to do real time query via Hive. In Hadoop 2.x YARN basically bundles Spark and Shark components

6. Batch queries can be executed directly via Hive

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: +2 (from 2 votes)

Creating LXC instances on Ubuntu 14 LTS

Hypervisor Virtualization Vs OS level / Container Virtualization

Unlike hypervisor virtualization, where one or more independent machines run virtually on physical hardware via an intermediation layer, containers instead run user space on top of an operating system’s kernel. As a result, container virtualization is often called operating system-level virtualization.

Container /OS level virtualization, provide multiple isolated Linux environments on a single Linux host. It shares the host OS kernel and make use of the Guest OS system libraries for providing the required OS capabilities.This allows containers to have a very low overhead and to have much faster startup time compared to VMs.

As limitations, containers also been considered as less secure compared to hypervisor virtualization. However countering this argument, containers lack the larger attacker surface compared to full operating systems deployed by the hypervisor virtualization.

The most recent OS level virtualiztion/ containers are considered as OpenVZ, Oracle Solaris Zones, Linux LXCs.

LXC Containers

Linux Container (LXC), is a fast, lightweight, and OS-level virtualization technology that allows us to host multiple isolated Linux systems in a single host.

Installing LXC on Ubuntu 14 LTS

LXC is available on Ubuntu default repositories. Simply type the following for a complete installation.

sudo apt-get install lxc lxctl lxc-templates

To check the successful completion, type

sudo lxc-checkconfig

If everything is fine, it will show something similar to the following

Kernel configuration not found at /proc/config.gz; searching...
Kernel configuration found at /boot/config-3.13.0-32-generic
--- Namespaces ---
Namespaces: enabled
Utsname namespace: enabled
Ipc namespace: enabled
Pid namespace: enabled
User namespace: enabled
Network namespace: enabled
Multiple /dev/pts instances: enabled

--- Control groups ---
Cgroup: enabled
Cgroup clone_children flag: enabled
Cgroup device: enabled
Cgroup sched: enabled
Cgroup cpu account: enabled
Cgroup memory controller: enabled
Cgroup cpuset: enabled

--- Misc ---
Veth pair device: enabled
Macvlan: enabled
Vlan: enabled
File capabilities: enabled

Note : Before booting a new kernel, you can check its configuration
usage : CONFIG=/path/to/config /usr/bin/lxc-checkconfig
Creating LXC
sudo lxc-create -n <container-name> -t <template>

The <template> can be found in the  /usr/share/lxc/templates/ folder.

For example, if you need to create an Ubuntu container, you may execute,

sudo lxc-create -n ubuntu01 -t ubuntu

If you want to create an OpenSUSE container you may execute,

sudo lxc-create -n opensuse1 -t opensuse

If you want to create a Centos container, you may execute,

sudo apt-get install yum // This is require as a prerequisite for centos installation
sudo lxc-create -n centos01 -t centos

Once created you should be able to list all the LXCs created.

sudo lxc-ls

To list down the complete container information,

sudo lxc-info -n ubuntu01
Starting LXC

Execute following command to start the created containers.

sudo lxc-start -n ubuntu01 -d

Now use the following to log in to the started containers.

sudo lxc-console -n ubuntu01

The default userid/password is ubuntu/ubuntu.

[To exit from the console, press “Ctrl+a” followed by the letter “a”.]

If you need to see the assigned IP address and the state of any created instance,

sudo lxc-ls --fancy ubuntu01
Stopping LXC
sudo lxc-stop -n ubuntu01
Cloning LXC
sudo lxc-stop -n ubuntu01
sudo lxc-clone ubuntu01 ubuntu02
sudo lxc-start -n ubuntu02
Deleting LXC

sudo lxc-destroy -n ubuntu01

Managing LXC using a Web Console
sudo wget http://lxc-webpanel.github.io/tools/install.sh -O - | bash

Then, access the LXC web panel using URL: http://<ip-address>:5000. The default username/password is admin/admin

References:

1. Setting up Multiple Linix System Containers using Ubuntu 14 LTS - http://www.unixmen.com/setting-multiple-isolated-linux-systems-containers-using-lxc-ubuntu-14-04/

2. LXC Complete Guide – https://help.ubuntu.com/12.04/serverguide/lxc.html

3. The Evolution of Linux Containers and Future – https://dzone.com/articles/evolution-of-linux-containers-future

4. Can containers really ship software –  https://dzone.com/articles/can-containers-really-ship-software

VN:F [1.9.22_1171]
Rating: 8.5/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: +2 (from 2 votes)

Apache Hadoop Summit 2015 (Brussels) – Videos and Slides

You can reach very important sessions on the Hadoop world if you can go through the sessions..

http://2015.hadoopsummit.org/brussels/agenda/

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

About Mesosphere

Mesosphere

URL: http://mesosphere.com/

The Mesosphere DCOS (Data Center Operating System) is a new kind of operating system that organizes all of your machines, VMs, and cloud instances into a single pool of intelligently and dynamically shared resources. It runs on top of and enhances any modern version of Linux.

The Mesosphere DCOS is highly-available and fault-tolerant, and runs in both private and public clouds (in Data centers).

The Mesosphere DCOS includes a distributed systems kernel with enterprise-grade security, based on Apache Mesos. Data center services are available through both public and private repositories. Build your own data center services or use data center services built and supported by Mesosphere and third parties.Data center services include Apache Spark, Apache Cassandra, Apache Kafka, Apache Hadoop, Apache YARN, Apache HDFS, Google’s Kubernetes, and more.

The Mesosphere DCOS runs containerized workloads at scale, managing resource isolation (including memory, processor, and network) and optimizing resource utilization. It is highly adaptable, employing plug-ins for native Linux containers, Docker containers, and other emerging container systems. Automating standard operations, the Mesosphere DCOS is highly-elastic, highly-available and fault-tolerant.

It was invented at UC Berkeley’s AMPLab and used at large-scale in production at companies like Twitter, Netflix and Airbnb.

Apache Mesos

URL: http://mesos.apache.org/

Apache Mesos abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively.

References

1. Scale like Twitter with Apache Mesos – http://opensource.com/business/14/8/interview-chris-aniszczyk-twitter-apache-mesos

2. An Introduction to Mesosphere – https://www.digitalocean.com/community/tutorials/an-introduction-to-mesosphere

3. How to configure a production ready Mesosphere cluster on Ubuntu 14.04 – https://www.digitalocean.com/community/tutorials/how-to-configure-a-production-ready-mesosphere-cluster-on-ubuntu-14-04

4. Mesosphere Introductory Course – http://open.mesosphere.com/intro-course/intro.html

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: +1 (from 1 vote)

Cloud based SOA e-Government infrastructure in Sri Lanka

My talk on “Cloud based SOA e-Government Infrastructure in Sri Lanka”, which I did WSO2Con 2014 can be found using the following URL:

http://youtu.be/f3pjtsX8mm0

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: +1 (from 1 vote)

EA Vs SOA – Part (1)

I recently found this article on the TOGAF Open Group web site relating SOA with TOGAF EA reference model. As EA / SOA practitioners, I guess this article would give anyone a complete picture how an EA reference architecture can map to SOA.

http://www.opengroup.org/soa/source-book/togaf/entsoa.htm

The following link will give another perspective about EA and might be useful:

http://grahamberrisford.com/01EAingeneral/EA%20in%20general.htm

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Experiencing RedHat via Ubuntu

In the recent past, the Virtualization has emerged many avenues to most of the IT people with its dynamic capability and the architecture. The Cloud computing is one of those very strong concepts inherited.

As IT people, we too get added benefits by this dynamic feature. Now in almost all OSs have the ability to experience other OSs from your native OS (I know this is nothing new to most of you all). You do need to have some Virtualization software running on your machine to get this working. KVM or Virtualbox are few of those simple softwares that can make this work for you.

Recently, I started to get the hang of RedHat (I was using Debian/ Ubuntu for sometime now) after started following a course at OpenEd. Since my laptop was giving trouble to have a separate partition for RHEL6 (Red Hat Linux 6) mainly due to device driver issues, I opted to have a Virtual Manager like KVM on top of my Ubuntu to have a RHEL6 running as a Virtual machine. Temporarily that basically solved my problem.

If you google the above requirement I am sure you will find a lot of resources to solve the matter. I too did that and found out the following link, which I am sure can speed up your experience.

http://www.howtogeek.com/117635/how-to-install-kvm-and-create-virtual-machines-on-ubuntu/

Prerequisites:

A system with 64-bit CPU is required to run the KVM.

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Setting Up HBase (Standalone) on Ubuntu 12.04

Though HDFS is the default Distributed Filing System attached to Hadoop, HBase also came to limelight due to several limitations in HDFS.

HDFS Limitations

1. HDFS is optimized for streaming relatively larger files, which have over 100s of MB upwards and are accessed them through MapReduce in “batch mode“.

2. HDFS files are write-once and read-many files and does not perform well in random writes or random reads.

To overcome above, HBase was developed with following features.

HBase Features

1. Can access small amount of data from a large data set from a billion row table in real-time.

2. Fast scanning across tables.

3. Flexible data model.

So, with the all above, lets dive into the world of HBase.

Installation (Standalone Mode)

HBase installation can happen in three difference modes.

1. Standalone / Local Mode

2, Pseudo Distributed Mode

3. Fully Distributed Mode

Just to feel HBase, we will first try out the Standalone Mode here.

Step 1: Download HBase. Use binaries at the Apache HBase mirrors.

In this tutorial I am using the hbase-0.94.8.tar.gz.

Further you can check the configuration dependencies from the URL http://hbase.apache.org/book/configuration.html (See Table 2.1 Hadoop Version support Matrix) According to this you are required to have a Hadoop 1.0.3 to run HBase.

Step 2: Extract HBase binaries to a desired location. Here I do extract to the location /usr/local/hbase-0.94.8 and make it HBASE_HOME

Step 3: Set the HBASE_HOME in .bashrc

You are required to include HBASE_HOME and add the HBASE_HOME to .bashrc file.

----
export HBASE_HOME=/usr/local/hbase-0.94.8
----
----
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HBASE_HOME/bin:$PATH

Step 3: Change the HBASE Configuration Files.

$HBASE_HOME/conf/hbase-env.sh to set the JAVA_HOME

export JAVA_HOME=/usr/local/jdk1.6.0_37

$HBASE_HOME/conf/hbase-site.xml

<configuration>
<property>

<name>hbase.rootdir</name>
<value>file:///usr/local/hbase-0.94.8/var</value>
</property>
</configuration>

The hbase.rootdir needs to be specified in order to have a static directory for HBase work. Otherwise the system is using a temporary folder assigned by itself.

Step 4: Change the /etc/hosts file to reflect the external IP of yours.

The changed entry above should look like below. The second line should be given the external IP (192.168.1.33 here) or 127.0.0.1 by replacing the 127.0.1.1

127.0.0.1       localhost
192.168.1.33    crishantha-Notebook-PC

Step 5: Start Hbase

$ $HBASE_HOME/bin/start-hbase.sh

So if everything goes well, you should be able to successfully start the HBase.

Check $HBASE_HOME/logs folder to find any errors while starting.

Step 6: Start the HBase shell – Once the HBase is started make sure the HBase shell is working. Execute the following command to ensure.

$ hbase shell

References

1. http://diggdata.in/post/67561846971/fetch-data-from-hbase-database-from-r-using-rhbase

2. http://blog.revolutionanalytics.com/2011/09/mapreduce-hadoop-r.html

3. https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md

VN:F [1.9.22_1171]
Rating: 6.7/10 (3 votes cast)
VN:F [1.9.22_1171]
Rating: +1 (from 1 vote)

Setting up RHadoop on Ubuntu 12.04

Rhadoop is an open-source package developed by Revolution Analytics that binds R to Hadoop and allows for the representation of MapReduce algorithms using R. RHadoop is the most mature and the best integrated project to connect R with Hadoop so far, which was written by Antonio Piccolboni, with the support from Revolution Analytics.

RHadoop Packages

RHadoop consists of 3 main packages.

1. rmr - Provides MapReduce functionality in R

2. rhdfs – Provides HDFS file management in R

3. rhbase – Provides HBase database management in R

RHadoop binaries can be downloaded from: https://github.com/RevolutionAnalytics/RHadoop

Prerequisites

1. Make sure Java, R and Hadoop binaries are installed in the machine that you are intending to install RHadoop

2. You can test this by executing following commands

$ java -version
$ hadoop version
$ R

3. Check JAVA_HOME, HADOOP_HOME and HADOOP_CMD variables are set in the $HOME/.bashrc file. If not add them.

export JAVA_HOME=/usr/local/jdk1.6.0_37
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CMD=$HADOOP_HOME/bin/hadoop

-
-
-
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH

4. Just to check the Hadoop installation, just run the $HADOOP_HOME/bin/start-all.sh and see all the nodes of the Hadoop cluster are functioning properly,

Installation

Step 1: Install rmr with its dependencies (RCpp, RJSONIO, digest, functional, stringr and plyr).

If you use the command prompt to do this,

sudo R CMD INSTALL Rcpp Rcpp_0.10.2.tar.gz
sudo R CMD INSTALL RJSONIO RJSONIO_1.0-1.tar.gz
sudo R CMD INSTALL digest digest_0.6.2.tar.gz
sudo R CMD INSTALL functional functional_0.1.tar.gz
sudo R CMD INSTALL stringr stringr_0.6.2.tar.g
sudo R CMD INSTALL plyr plyr_1.8.tar.gz
sudo R CMD INSTALL rmr rmr2_2.0.2.tar.gz

Step 2: Install rhdfs with its dependencies (rJava)

sudo JAVA_HOME=/user/local/jdk1.6.0_37 R CMD javareconf
sudo R CMD INSTALL rJava rJava_0.9-3.tar.gz
sudo HADOOP_CMD=/home/local/hadoop/bin/hadoop R CMD INSTALL rhdfs rhdfs_1.0.5.tar.gz
sudo R CMD INSTALL rhdfs rhdfs_1.0.5.tar.gz

In above the give library versions are tested and working. You are not required to stick to these mainly because these versions may not be the latest around when you are trying out this tutorial. If you are not sure about the dependency versions, let the Ubuntu repositories decide it by you allowing the use of install.packages command within R.

For example,

> install.packages("Rcpp", lib="/usr/local/lib/R/site-library")

You may continue this for all the dependencies and most of the time it works for you.

Step 3: Verify rmr and rhdfs installations (within R)

> library(rmr2,lib.loc="/usr/local/lib/R/site-library")
> library(rhdfs,lib.loc="/usr/local/lib/R/site-library")

In above, “lib.loc” is where R has installed the specific dependency libraries. If it not specified R actually looks for the $HOME/R folder for libraries. RHadoop sometimes throw specific error by not being able to find the specific libraries. Hence it is always good to specify the library location in the library command.

If there are no errors, you have successfully managed to install rmr and rhdfs on top of R and Hadoop.

References:

1.http://bighadoop.wordpress.com/2013/02/25/r-and-hadoop-data-analysis-rhadoop/

VN:F [1.9.22_1171]
Rating: 8.6/10 (5 votes cast)
VN:F [1.9.22_1171]
Rating: +2 (from 2 votes)
Go to Top