Apache Spark on Ubuntu – Part 01
1.0 Introduction
Spark is a fast and general cluster computing system for Big Data. It is written in Scala Language. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing. Apache Spark is built on one main concept, which is “Resilient Distributed Data (RDD)”.
Python vs Scala vs Java with Spark?
The most popular languages that Spark associated are Python and Scala. Both languages follow a similar syntax and compared to Java they are quite easy to follow. However compared to Python, Scala seems more faster mainly Spark is written in Scala and it overcomes the delay of having to go through another set of libraries to interpret if you chose to use Python. However, in general both are capable of doing the task in almost all the use cases.
Spark with Python
You may install Python using Canopy
Use this link to download the binaries to your system. (Use Linux(64-bit Python 3.5 Download for this blog)
Once you installed,Canopy, you have a Python development environment to work with Spark with all the libraries including PySpark.
Once all these installed you can try PySpark by just typing “pyspark” on the terminal window.
$ pyspark
This will allow you to continue to execute your Python scripts on Spark.
Installing Apache Spark
In order to complete this task, you are required to follow the following steps one by one.
Step 1: Install Java
- I assume you already have Java Development Kit (JDK) installed in your machines. In March 2018, the Spark supported JDK version is JDK 8.
- You may verify the Java installation
$ java -version
Step 2: Install Scala
- If you do not have Scala installed in your system, use this link to install it.
- Get the “tgz” bundle and extract it to the /usrlocal/scala folder (This is as a best practice)
$ tar xvf scala-2.11.12.tgz // Extract the scala into /usr/local folder $ su - $ mv scala-2.11.12 /usr/local/scala $ exit
- Then update the .bashrc to have SCALA_HOME and $SCALA_HOME/bin to the $PATH.
- After all, verify the scala installation
$ scala -version
Step 3: Install Spark
After installing both Java and Scala, now you are ready to download the Spark version. Use this link to download the “tgz” file.
$ tar xvf spark-2.3.0-bin-hadoop2.7.tgz // Extract the Spark into /usr/local/spark $ su - $ mv spark-2.3.0-bin-hadoop2.7 /usr/local/spark $ exit
- Then update the .bashrc to have SPARK_HOME and $SPARK_HOME/bin to the $PATH.
- Now you may verify the Spark installation
$ spark-shell
if all goes well, you will see a Spark prompt being displayed!… Congratulations!
Comments are closed.