Thrift is an interface definition language (IDL) and binary communication protocol. It is used as a RPC framework, which was developed by Facebook and now maintained as an open source project by Apache Software Foundation.
Thrift is used by many popular projects such as Facebook, Hadoop, Cassendra, HBase, Hypertable.
Why Thrift over JSON?
Thrift is Lightweight, Language Independent and support Data Transport/ Serialization to build cross language services.
1) Strong Typing
Performance is one of Thrift’s main design considerations. JSON is much more geared towards human readability, which comes at the cost of making them more CPU intensive to work with.
If you are serializing large amounts of data, Thrift’s binary protocol is more efficient than JSON.
4) Versioning Support
Thrift has inbuilt mechanisms for versioning data. This can be very helpful in a distributed environment where your service interfaces may change, but you cannot atomically update all your client and server code.
5) Server Implementaion
Thrift includes RPC server implementations for a number of languages. Because they are streamlined to just support Thrift requests, they are lighter-weight and higher-performance than typical HTTP server implementations like JSON.
It is more work to get started on the client side, when the clients are directly building the calling code. It’s less work for the service owner if they are building libraries for clients. The bottom line: If you are providing a simple service & API, Thrift is probably not the right tool.
1. Thrift Tutorial - http://thrift-tutorial.readthedocs.org/en/latest/intro.html
2. The programmers guide to Apache Thrift (MEAP) – Chapter 01
3. Writing your first Thrift Service - http://srinathsview.blogspot.com/2011/09/writing-your-first-thrift-service.html
As you may aware Hadoop Eco system consists of so many open source tools. There is a lot of research is going on in this area now and everyday you would see a new version of an existing framework or a new framework altogether getting popular undermining the existing ones. Hence if you are a Hadoop developer you need to constantly gather current technological advancements, which happen around you.
As a start to understand the technological frameworks around, I myself tried to sketch a diagram to summarize some of the key open source frameworks and their relationship with their usage. I will try to evolve this diagram as much as I learn in the future and I will not forget to share the same with you all as well.
1. Feeding RDBMS data to HDFS via Sqoop
2. Cleansing imported data via Pig
4. Hive Data Warehouse schema’s are stored separately in a Hive Data Warehouse RDBMS Schema
6. Batch queries can be executed directly via Hive
Hypervisor Virtualization Vs OS level / Container Virtualization
Unlike hypervisor virtualization, where one or more independent machines run virtually on physical hardware via an intermediation layer, containers instead run user space on top of an operating system’s kernel. As a result, container virtualization is often called operating system-level virtualization.
Container /OS level virtualization, provide multiple isolated Linux environments on a single Linux host. It shares the host OS kernel and make use of the Guest OS system libraries for providing the required OS capabilities.This allows containers to have a very low overhead and to have much faster startup time compared to VMs.
As limitations, containers also been considered as less secure compared to hypervisor virtualization. However countering this argument, containers lack the larger attacker surface compared to full operating systems deployed by the hypervisor virtualization.
The most recent OS level virtualiztion/ containers are considered as OpenVZ, Oracle Solaris Zones, Linux LXCs.
Linux Container (LXC), is a fast, lightweight, and OS-level virtualization technology that allows us to host multiple isolated Linux systems in a single host.
Installing LXC on Ubuntu 14 LTS
LXC is available on Ubuntu default repositories. Simply type the following for a complete installation.
sudo apt-get install lxc lxctl lxc-templates
To check the successful completion, type
If everything is fine, it will show something similar to the following
Kernel configuration not found at /proc/config.gz; searching... Kernel configuration found at /boot/config-3.13.0-32-generic --- Namespaces --- Namespaces: enabled Utsname namespace: enabled Ipc namespace: enabled Pid namespace: enabled User namespace: enabled Network namespace: enabled Multiple /dev/pts instances: enabled --- Control groups --- Cgroup: enabled Cgroup clone_children flag: enabled Cgroup device: enabled Cgroup sched: enabled Cgroup cpu account: enabled Cgroup memory controller: enabled Cgroup cpuset: enabled --- Misc --- Veth pair device: enabled Macvlan: enabled Vlan: enabled File capabilities: enabled Note : Before booting a new kernel, you can check its configuration usage : CONFIG=/path/to/config /usr/bin/lxc-checkconfig
sudo lxc-create -n <container-name> -t <template>
The <template> can be found in the /usr/share/lxc/templates/ folder.
For example, if you need to create an Ubuntu container, you may execute,
sudo lxc-create -n ubuntu01 -t ubuntu
If you want to create an OpenSUSE container you may execute,
sudo lxc-create -n opensuse1 -t opensuse
If you want to create a Centos container, you may execute,
sudo apt-get install yum // This is require as a prerequisite for centos installation sudo lxc-create -n centos01 -t centos
Once created you should be able to list all the LXCs created.
To list down the complete container information,
sudo lxc-info -n ubuntu01
Execute following command to start the created containers.
sudo lxc-start -n ubuntu01 -d
Now use the following to log in to the started containers.
sudo lxc-console -n ubuntu01
The default userid/password is ubuntu/ubuntu.
[To exit from the console, press “Ctrl+a” followed by the letter “a”.]
If you need to see the assigned IP address and the state of any created instance,
sudo lxc-ls --fancy ubuntu01
sudo lxc-stop -n ubuntu01
sudo lxc-stop -n ubuntu01 sudo lxc-clone ubuntu01 ubuntu02 sudo lxc-start -n ubuntu02
sudo lxc-destroy -n ubuntu01
Managing LXC using a Web Console
sudo wget http://lxc-webpanel.github.io/tools/install.sh -O - | bash
Then, access the LXC web panel using URL: http://<ip-address>:5000. The default username/password is admin/admin
1. Setting up Multiple Linix System Containers using Ubuntu 14 LTS - http://www.unixmen.com/setting-multiple-isolated-linux-systems-containers-using-lxc-ubuntu-14-04/
2. LXC Complete Guide – https://help.ubuntu.com/12.04/serverguide/lxc.html
3. The Evolution of Linux Containers and Future – https://dzone.com/articles/evolution-of-linux-containers-future
4. Can containers really ship software – https://dzone.com/articles/can-containers-really-ship-software
You can reach very important sessions on the Hadoop world if you can go through the sessions..
The Mesosphere DCOS (Data Center Operating System) is a new kind of operating system that organizes all of your machines, VMs, and cloud instances into a single pool of intelligently and dynamically shared resources. It runs on top of and enhances any modern version of Linux.
The Mesosphere DCOS is highly-available and fault-tolerant, and runs in both private and public clouds (in Data centers).
The Mesosphere DCOS includes a distributed systems kernel with enterprise-grade security, based on Apache Mesos. Data center services are available through both public and private repositories. Build your own data center services or use data center services built and supported by Mesosphere and third parties.Data center services include Apache Spark, Apache Cassandra, Apache Kafka, Apache Hadoop, Apache YARN, Apache HDFS, Google’s Kubernetes, and more.
The Mesosphere DCOS runs containerized workloads at scale, managing resource isolation (including memory, processor, and network) and optimizing resource utilization. It is highly adaptable, employing plug-ins for native Linux containers, Docker containers, and other emerging container systems. Automating standard operations, the Mesosphere DCOS is highly-elastic, highly-available and fault-tolerant.
It was invented at UC Berkeley’s AMPLab and used at large-scale in production at companies like Twitter, Netflix and Airbnb.
Apache Mesos abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively.
1. Scale like Twitter with Apache Mesos – http://opensource.com/business/14/8/interview-chris-aniszczyk-twitter-apache-mesos
2. An Introduction to Mesosphere – https://www.digitalocean.com/community/tutorials/an-introduction-to-mesosphere
3. How to configure a production ready Mesosphere cluster on Ubuntu 14.04 – https://www.digitalocean.com/community/tutorials/how-to-configure-a-production-ready-mesosphere-cluster-on-ubuntu-14-04
4. Mesosphere Introductory Course – http://open.mesosphere.com/intro-course/intro.html
I recently found this article on the TOGAF Open Group web site relating SOA with TOGAF EA reference model. As EA / SOA practitioners, I guess this article would give anyone a complete picture how an EA reference architecture can map to SOA.
The following link will give another perspective about EA and might be useful:
In the recent past, the Virtualization has emerged many avenues to most of the IT people with its dynamic capability and the architecture. The Cloud computing is one of those very strong concepts inherited.
As IT people, we too get added benefits by this dynamic feature. Now in almost all OSs have the ability to experience other OSs from your native OS (I know this is nothing new to most of you all). You do need to have some Virtualization software running on your machine to get this working. KVM or Virtualbox are few of those simple softwares that can make this work for you.
Recently, I started to get the hang of RedHat (I was using Debian/ Ubuntu for sometime now) after started following a course at OpenEd. Since my laptop was giving trouble to have a separate partition for RHEL6 (Red Hat Linux 6) mainly due to device driver issues, I opted to have a Virtual Manager like KVM on top of my Ubuntu to have a RHEL6 running as a Virtual machine. Temporarily that basically solved my problem.
If you google the above requirement I am sure you will find a lot of resources to solve the matter. I too did that and found out the following link, which I am sure can speed up your experience.
A system with 64-bit CPU is required to run the KVM.
Though HDFS is the default Distributed Filing System attached to Hadoop, HBase also came to limelight due to several limitations in HDFS.
1. HDFS is optimized for streaming relatively larger files, which have over 100s of MB upwards and are accessed them through MapReduce in “batch mode“.
2. HDFS files are write-once and read-many files and does not perform well in random writes or random reads.
To overcome above, HBase was developed with following features.
1. Can access small amount of data from a large data set from a billion row table in real-time.
2. Fast scanning across tables.
3. Flexible data model.
So, with the all above, lets dive into the world of HBase.
Installation (Standalone Mode)
HBase installation can happen in three difference modes.
1. Standalone / Local Mode
2, Pseudo Distributed Mode
3. Fully Distributed Mode
Just to feel HBase, we will first try out the Standalone Mode here.
Step 1: Download HBase. Use binaries at the Apache HBase mirrors.
In this tutorial I am using the hbase-0.94.8.tar.gz.
Further you can check the configuration dependencies from the URL http://hbase.apache.org/book/configuration.html (See Table 2.1 Hadoop Version support Matrix) According to this you are required to have a Hadoop 1.0.3 to run HBase.
Step 2: Extract HBase binaries to a desired location. Here I do extract to the location /usr/local/hbase-0.94.8 and make it HBASE_HOME
Step 3: Set the HBASE_HOME in .bashrc
You are required to include HBASE_HOME and add the HBASE_HOME to .bashrc file.
---- export HBASE_HOME=/usr/local/hbase-0.94.8 ---- ---- export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HBASE_HOME/bin:$PATH
Step 3: Change the HBASE Configuration Files.
$HBASE_HOME/conf/hbase-env.sh to set the JAVA_HOME
<configuration> <property> <name>hbase.rootdir</name> <value>file:///usr/local/hbase-0.94.8/var</value> </property> </configuration>
The hbase.rootdir needs to be specified in order to have a static directory for HBase work. Otherwise the system is using a temporary folder assigned by itself.
Step 4: Change the /etc/hosts file to reflect the external IP of yours.
The changed entry above should look like below. The second line should be given the external IP (192.168.1.33 here) or 127.0.0.1 by replacing the 127.0.1.1
127.0.0.1 localhost 192.168.1.33 crishantha-Notebook-PC
Step 5: Start Hbase
So if everything goes well, you should be able to successfully start the HBase.
Check $HBASE_HOME/logs folder to find any errors while starting.
Step 6: Start the HBase shell – Once the HBase is started make sure the HBase shell is working. Execute the following command to ensure.
$ hbase shell
Rhadoop is an open-source package developed by Revolution Analytics that binds R to Hadoop and allows for the representation of MapReduce algorithms using R. RHadoop is the most mature and the best integrated project to connect R with Hadoop so far, which was written by Antonio Piccolboni, with the support from Revolution Analytics.
RHadoop consists of 3 main packages.
1. rmr - Provides MapReduce functionality in R
2. rhdfs – Provides HDFS file management in R
3. rhbase – Provides HBase database management in R
RHadoop binaries can be downloaded from: https://github.com/RevolutionAnalytics/RHadoop
1. Make sure Java, R and Hadoop binaries are installed in the machine that you are intending to install RHadoop
2. You can test this by executing following commands
$ java -version $ hadoop version $ R
3. Check JAVA_HOME, HADOOP_HOME and HADOOP_CMD variables are set in the $HOME/.bashrc file. If not add them.
export JAVA_HOME=/usr/local/jdk1.6.0_37 export HADOOP_HOME=/usr/local/hadoop export HADOOP_CMD=$HADOOP_HOME/bin/hadoop - - - export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH
4. Just to check the Hadoop installation, just run the $HADOOP_HOME/bin/start-all.sh and see all the nodes of the Hadoop cluster are functioning properly,
Step 1: Install rmr with its dependencies (RCpp, RJSONIO, digest, functional, stringr and plyr).
If you use the command prompt to do this,
sudo R CMD INSTALL Rcpp Rcpp_0.10.2.tar.gz sudo R CMD INSTALL RJSONIO RJSONIO_1.0-1.tar.gz sudo R CMD INSTALL digest digest_0.6.2.tar.gz sudo R CMD INSTALL functional functional_0.1.tar.gz sudo R CMD INSTALL stringr stringr_0.6.2.tar.g sudo R CMD INSTALL plyr plyr_1.8.tar.gz sudo R CMD INSTALL rmr rmr2_2.0.2.tar.gz
Step 2: Install rhdfs with its dependencies (rJava)
sudo JAVA_HOME=/user/local/jdk1.6.0_37 R CMD javareconf sudo R CMD INSTALL rJava rJava_0.9-3.tar.gz sudo HADOOP_CMD=/home/local/hadoop/bin/hadoop R CMD INSTALL rhdfs rhdfs_1.0.5.tar.gz sudo R CMD INSTALL rhdfs rhdfs_1.0.5.tar.gz
In above the give library versions are tested and working. You are not required to stick to these mainly because these versions may not be the latest around when you are trying out this tutorial. If you are not sure about the dependency versions, let the Ubuntu repositories decide it by you allowing the use of install.packages command within R.
> install.packages("Rcpp", lib="/usr/local/lib/R/site-library")
You may continue this for all the dependencies and most of the time it works for you.
Step 3: Verify rmr and rhdfs installations (within R)
> library(rmr2,lib.loc="/usr/local/lib/R/site-library") > library(rhdfs,lib.loc="/usr/local/lib/R/site-library")
In above, “lib.loc” is where R has installed the specific dependency libraries. If it not specified R actually looks for the $HOME/R folder for libraries. RHadoop sometimes throw specific error by not being able to find the specific libraries. Hence it is always good to specify the library location in the library command.
If there are no errors, you have successfully managed to install rmr and rhdfs on top of R and Hadoop.