Today (08/07/2018) I had the privilege to do a 1 hour presentation on “Towards a Cloud Enabled Data Intensive Digital Transformation” for Jaffna University IT students. I hope they were able to learn something out of this presentation. You can reach the slide deck using the following link:
With the advent of Big Data, the enterprise applications nowadays are following a Data Intensive microservices based enterprise application architecture deviating more monolithic architectures, which we have been used to decades.
These data intensive applications should meet a set of requirements.
1. Ingest Data at Scale without a loss
2. Analyze data in real-time
3. Trigger action based on the analyzed data
4. Store the data at cloud-scale.
5. Need to run in a distributed and highly resilient cloud platform
The SMACK is such a stack, which can be used for building modern enterprise applications because it can performs each of the above objectives with a loosely coupled tool chain of technologies that are are all open source, and production-proven at scale.
(S – Spark, M – Mesos, A – Akka, C – Cassendra, K – Kafka)
- Spark – A general engine for large-scale data processing, enabling analytics from SQL queries to machine learning, graph analytics, and stream processing
- Mesos – Distributed systems kernel that provides resourcing and isolation across all the other SMACK stack components. Mesos is the foundation on which other SMACK stack components run.
- Akka – A toolkit and runtime to easily create concurrent and distributed apps that are responsive to messages.
- Cassandra – Distributed database management system that can handle large amounts of data across servers with high availability.
- Kafka – A high throughput, low-latency platform for handling real-time data feeds with no data loss.
Spark is a fast and general cluster computing system for Big Data. It is written in Scala Language. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing. Apache Spark is built on one main concept, which is “Resilient Distributed Data (RDD)”.
(Python vs Scala vs Java) with Spark?
The most popular languages that Spark associated are Python and Scala. Both languages follow a similar syntax and compared to Java they are quite easy to follow. However compared to Python, Scala seems more faster mainly Spark is written in Scala and it overcomes the delay of having to go through another set of libraries to interpret if you chose to use Python. However, in general both are capable of doing the task in almost all the use cases.
Installing Apache Spark with Python
In order to complete this task, you are required to follow the following steps one by one.
Step 1: Install Java
- I assume you already have Java Development Kit (JDK) installed in your machines. In March 2018, the Spark supported JDK version is JDK 8.
- You may verify the Java installation
$ java -version
Step 2: Install Scala
- If you do not have Scala installed in your system, use this link to install it.
- Get the “tgz” bundle and extract it to the /usrlocal/scala folder (This is as a best practice)
$ tar xvf scala-2.11.12.tgz // Extract the scala into /usr/local folder $ su - $ mv scala-2.11.12 /usr/local/scala $ exit
- Then update the .bashrc to have SCALA_HOME and $SCALA_HOME/bin to the $PATH.
- After all, verify the scala installation
$ scala -version
Step 3: Install Spark
After installing both Java and Scala, now you are ready to download the Spark version. Use this link to download the “tgz” file.
P.Note: Also when downloading Apache Spark itself, be sure to install the latest version of Spark 2.3
$ tar xvf spark-2.3.0-bin-hadoop2.7.tgz // Extract the Spark into /usr/local/spark $ su - $ mv spark-2.3.0-bin-hadoop2.7 /usr/local/spark $ exit
- Then update the .bashrc to have SPARK_HOME and $SPARK_HOME/bin to the $PATH.
- Now you may verify the Spark installation
if all goes well, you will see a Spark prompt being displayed!. (Use :quit OR Ctrl+D to exit from the shell)
Step 4: Install Python
You may install Python using Canopy
Use this link to download the binaries to your system. (Use Linux(64-bit Python 3.5 Download for this blog)
Once you installed,Canopy, you have a Python development environment to work with Spark with all the libraries including PySpark.
Once all these installed you can try PySpark by just typing “pyspark” on the terminal window.
This will allow you to continue to execute your Python scripts on Spark.
This is the continuation of my previous article on “Installing Hadoop 2.6 on Ubuntu 16.04“. This article will explain how we run one of the examples given with the Hadoop binary.
Once the Hadoop installation is completed, you can run the “wordcount” example provided with the Hadoop examples in order to test a Mapreduce job. This example actually is bundled with the hadoop-examples.jar file in the distribution. (See the below steps for more details)
Step 1: Start the Hadoop Cluster, if not already started.
$ /usr/local/hadoop/sbin/start-dfs.sh $ /usr/local/hadoop/sbin/start-yarn.sh
Step 2: Copy the text files that you are going to consider for a “wordcount” to a local folder (/home/hadoop/textfiles)
Step 3: Copy the text files (in the local folder) to HDFS.
$ echo "Word Count Text File" > textFile.txt $ hdfs dfs -mkdir -p /user/hduser/dfs $ hadoop dfs -copyFromLocal textFile.txt /user/hduser/dfs
$ hadoop dfs -ls /user/hduser/dfs
$ cd /usr/local/hadoop $ hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.1.jar wordcount /user/hduser/dfs /user/hduser/dfs-output
You can either choose the command line or the web interface to display the contents of the HDFS directories. If you choose the command line you can try the following command.
$ hadoop dfs -ls /user/hduser/dfs-output
Thrift is an interface definition language (IDL) and binary communication protocol. It is used as a RPC framework, which was developed by Facebook and now maintained as an open source project by Apache Software Foundation.
Thrift is used by many popular projects such as Facebook, Hadoop, Cassendra, HBase, Hypertable.
Why Thrift over JSON?
Thrift is Lightweight, Language Independent and support Data Transport/ Serialization to build cross language services.
1) Strong Typing
Performance is one of Thrift’s main design considerations. JSON is much more geared towards human readability, which comes at the cost of making them more CPU intensive to work with.
If you are serializing large amounts of data, Thrift’s binary protocol is more efficient than JSON.
4) Versioning Support
Thrift has inbuilt mechanisms for versioning data. This can be very helpful in a distributed environment where your service interfaces may change, but you cannot atomically update all your client and server code.
5) Server Implementaion
Thrift includes RPC server implementations for a number of languages. Because they are streamlined to just support Thrift requests, they are lighter-weight and higher-performance than typical HTTP server implementations like JSON.
It is more work to get started on the client side, when the clients are directly building the calling code. It’s less work for the service owner if they are building libraries for clients. The bottom line: If you are providing a simple service & API, Thrift is probably not the right tool.
1. Thrift Tutorial - http://thrift-tutorial.readthedocs.org/en/latest/intro.html
2. The programmers guide to Apache Thrift (MEAP) – Chapter 01
3. Writing your first Thrift Service - http://srinathsview.blogspot.com/2011/09/writing-your-first-thrift-service.html
As you may aware Hadoop Eco system consists of so many open source tools. There is a lot of research is going on in this area now and everyday you would see a new version of an existing framework or a new framework altogether getting popular undermining the existing ones. Hence if you are a Hadoop developer you need to constantly gather current technological advancements, which happen around you.
As a start to understand the technological frameworks around, I myself tried to sketch a diagram to summarize some of the key open source frameworks and their relationship with their usage. I will try to evolve this diagram as much as I learn in the future and I will not forget to share the same with you all as well.
1. Feeding RDBMS data to HDFS via Sqoop
2. Cleansing imported data via Pig
4. Hive Data Warehouse schema’s are stored separately in a Hive Data Warehouse RDBMS Schema
6. Batch queries can be executed directly via Hive
You can reach very important sessions on the Hadoop world if you can go through the sessions..
Though HDFS is the default Distributed Filing System attached to Hadoop, HBase also came to limelight due to several limitations in HDFS.
1. HDFS is optimized for streaming relatively larger files, which have over 100s of MB upwards and are accessed them through MapReduce in “batch mode“.
2. HDFS files are write-once and read-many files and does not perform well in random writes or random reads.
To overcome above, HBase was developed with following features.
1. Can access small amount of data from a large data set from a billion row table in real-time.
2. Fast scanning across tables.
3. Flexible data model.
So, with the all above, lets dive into the world of HBase.
Installation (Standalone Mode)
HBase installation can happen in three difference modes.
1. Standalone / Local Mode
2, Pseudo Distributed Mode
3. Fully Distributed Mode
Just to feel HBase, we will first try out the Standalone Mode here.
Step 1: Download HBase. Use binaries at the Apache HBase mirrors.
In this tutorial I am using the hbase-0.94.8.tar.gz.
Further you can check the configuration dependencies from the URL http://hbase.apache.org/book/configuration.html (See Table 2.1 Hadoop Version support Matrix) According to this you are required to have a Hadoop 1.0.3 to run HBase.
Step 2: Extract HBase binaries to a desired location. Here I do extract to the location /usr/local/hbase-0.94.8 and make it HBASE_HOME
Step 3: Set the HBASE_HOME in .bashrc
You are required to include HBASE_HOME and add the HBASE_HOME to .bashrc file.
---- export HBASE_HOME=/usr/local/hbase-0.94.8 ---- ---- export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HBASE_HOME/bin:$PATH
Step 3: Change the HBASE Configuration Files.
$HBASE_HOME/conf/hbase-env.sh to set the JAVA_HOME
<configuration> <property> <name>hbase.rootdir</name> <value>file:///usr/local/hbase-0.94.8/var</value> </property> </configuration>
The hbase.rootdir needs to be specified in order to have a static directory for HBase work. Otherwise the system is using a temporary folder assigned by itself.
Step 4: Change the /etc/hosts file to reflect the external IP of yours.
The changed entry above should look like below. The second line should be given the external IP (192.168.1.33 here) or 127.0.0.1 by replacing the 127.0.1.1
127.0.0.1 localhost 192.168.1.33 crishantha-Notebook-PC
Step 5: Start Hbase
So if everything goes well, you should be able to successfully start the HBase.
Check $HBASE_HOME/logs folder to find any errors while starting.
Step 6: Start the HBase shell – Once the HBase is started make sure the HBase shell is working. Execute the following command to ensure.
$ hbase shell
Rhadoop is an open-source package developed by Revolution Analytics that binds R to Hadoop and allows for the representation of MapReduce algorithms using R. RHadoop is the most mature and the best integrated project to connect R with Hadoop so far, which was written by Antonio Piccolboni, with the support from Revolution Analytics.
RHadoop consists of 3 main packages.
1. rmr - Provides MapReduce functionality in R
2. rhdfs – Provides HDFS file management in R
3. rhbase – Provides HBase database management in R
RHadoop binaries can be downloaded from: https://github.com/RevolutionAnalytics/RHadoop
1. Make sure Java, R and Hadoop binaries are installed in the machine that you are intending to install RHadoop
2. You can test this by executing following commands
$ java -version $ hadoop version $ R
3. Check JAVA_HOME, HADOOP_HOME and HADOOP_CMD variables are set in the $HOME/.bashrc file. If not add them.
export JAVA_HOME=/usr/local/jdk1.6.0_37 export HADOOP_HOME=/usr/local/hadoop export HADOOP_CMD=$HADOOP_HOME/bin/hadoop - - - export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH
4. Just to check the Hadoop installation, just run the $HADOOP_HOME/bin/start-all.sh and see all the nodes of the Hadoop cluster are functioning properly,
Step 1: Install rmr with its dependencies (RCpp, RJSONIO, digest, functional, stringr and plyr).
If you use the command prompt to do this,
sudo R CMD INSTALL Rcpp Rcpp_0.10.2.tar.gz sudo R CMD INSTALL RJSONIO RJSONIO_1.0-1.tar.gz sudo R CMD INSTALL digest digest_0.6.2.tar.gz sudo R CMD INSTALL functional functional_0.1.tar.gz sudo R CMD INSTALL stringr stringr_0.6.2.tar.g sudo R CMD INSTALL plyr plyr_1.8.tar.gz sudo R CMD INSTALL rmr rmr2_2.0.2.tar.gz
Step 2: Install rhdfs with its dependencies (rJava)
sudo JAVA_HOME=/user/local/jdk1.6.0_37 R CMD javareconf sudo R CMD INSTALL rJava rJava_0.9-3.tar.gz sudo HADOOP_CMD=/home/local/hadoop/bin/hadoop R CMD INSTALL rhdfs rhdfs_1.0.5.tar.gz sudo R CMD INSTALL rhdfs rhdfs_1.0.5.tar.gz
In above the give library versions are tested and working. You are not required to stick to these mainly because these versions may not be the latest around when you are trying out this tutorial. If you are not sure about the dependency versions, let the Ubuntu repositories decide it by you allowing the use of install.packages command within R.
> install.packages("Rcpp", lib="/usr/local/lib/R/site-library")
You may continue this for all the dependencies and most of the time it works for you.
Step 3: Verify rmr and rhdfs installations (within R)
> library(rmr2,lib.loc="/usr/local/lib/R/site-library") > library(rhdfs,lib.loc="/usr/local/lib/R/site-library")
In above, “lib.loc” is where R has installed the specific dependency libraries. If it not specified R actually looks for the $HOME/R folder for libraries. RHadoop sometimes throw specific error by not being able to find the specific libraries. Hence it is always good to specify the library location in the library command.
If there are no errors, you have successfully managed to install rmr and rhdfs on top of R and Hadoop.
If you are into Data Science field or involved with any Data Analytical research work, you will need the help of a Data Analytical tool to analyze the collected data. “R” is one of those very popular FOSS tool in the market now. You can get more details of R and related information form the CRAN web site. (http://cran.r-project.org/)
Once you install R, it gives you only a basic UI. As an alternative you can use RStudio (https://www.rstudio.com/) on top of installed R binaries. RStudio is a very much a user friendly interface which has many features to execute most of the R features.
1. Update the repositories to get the latest R distributions:
Use sudo vi /etc/apt/sources.list to add the following:
deb http://cran.stat.ucla.edu/bin/linux/ubuntu precise/
sudo add-apt-repository "deb http://cran.stat.ucla.edu/bin/linux/ubuntu precise/"
2. Get the repository SSL key and import it in to apt.
gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9 gpg -a --export E084DAB9 | sudo apt-key add -
3. Update the repositories.
sudo apt-get update sudo apt-get install r-base
Now you may go to the command prompt and type R to get the R console.
There are two modes of installation in RStudio.
1. Desktop Version
2. Server Version (This is basically the preferred option mainly because of its ability to be accessed from anywhere on the network)
Installing RStudio (Desktop Version)
1. Visit www.rstudio.com and use http://www.rstudio.com/ide/download/desktop to download the latest binary available.
2. As a prerequisite, you are required to install R (R version 2.11.1 or higher). Follow steps given above to install R.
3. Extract the binary
sudo dpkg -i rstudio-0.xxx-amd64.deb
After extraction, a “RStudio” icon will be available to proceed.
Installing RStudio (The Server Version)
1. Visit www.rstudio.com and use http://www.rstudio.com/ide/download/server.html to download the latest binary available.
2. As a prerequisite, you are required to install R (R version 2.11.1 or higher). Follow steps given above to install R.
3. Use following steps to install all the dependencies related to RStudio.
$ sudo apt-get install gdebi-core $ sudo apt-get install libapparmor1 # Required only for Ubuntu, not Debian $ wget http://download2.rstudio.org/rstudio-server-0.98.501-i386.deb $ sudo gdebi rstudio-server-0.98.501-i386.deb
$ sudo apt-get install gdebi-core $ sudo apt-get install libapparmor1 # Required only for Ubuntu, not Debian $ wget http://download2.rstudio.org/rstudio-server-0.98.501-amd64.deb $ sudo gdebi rstudio-server-0.98.501-amd64.deb
4. Now use the following link to test the successful installation.
Use your linux userid/password to log into the RStudio system. Thats all about it!