1.0 Introduction

Spark is a fast and general cluster computing system for Big Data. It is written in Scala Language. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing. Apache Spark is built on one main concept, which is “Resilient Distributed Data (RDD)”.

Spark Components

(Python vs Scala vs Java) with Spark?

The most popular languages that Spark associated are Python and Scala. Both languages follow a similar syntax and compared to Java they are quite easy to follow. However compared to Python, Scala seems more faster mainly Spark is written in Scala and it overcomes the delay of having to go through another set of libraries to interpret if you chose to use Python. However, in general both are capable of doing the task in almost all the use cases.

Installing Apache Spark with Python

In order to complete this task, you are required to follow the following steps one by one.

Step 1: Install Java

- I assume you already have Java Development Kit (JDK) installed in your machines. In March 2018, the Spark supported JDK version is JDK 8.

- You may verify the Java installation

$ java -version

Step 2: Install Scala

- If you do not have Scala installed in your system, use this link to install it.

- Get the “tgz” bundle and extract it to the /usrlocal/scala folder (This is as a best practice)

$ tar xvf scala-2.11.12.tgz

// Extract the scala into /usr/local folder
$ su -
$ mv scala-2.11.12 /usr/local/scala
$ exit

- Then update the .bashrc to have SCALA_HOME and $SCALA_HOME/bin to the $PATH.

- After all, verify the scala installation

$ scala -version

Step 3: Install Spark

After installing both Java and Scala, now you are ready to download the Spark version. Use this link to download the “tgz” file.

P.Note: Also when downloading Apache Spark itself, be sure to install the latest version of Spark 2.3

$ tar xvf spark-2.3.0-bin-hadoop2.7.tgz
// Extract the Spark into /usr/local/spark
$ su -
$ mv spark-2.3.0-bin-hadoop2.7 /usr/local/spark
$ exit

- Then update the .bashrc to have SPARK_HOME and $SPARK_HOME/bin to the $PATH.

- Now you may verify the Spark installation

$ spark-shell

if all goes well, you will see a Spark prompt being displayed!. (Use :quit OR Ctrl+D to exit from the shell)

Step 4: Install Python

You may install Python using Canopy

Use this link to download the binaries to your system. (Use Linux(64-bit Python 3.5 Download for this blog)

Once you installed,Canopy, you have a Python development environment to work with Spark with all the libraries including PySpark.

Once all these installed you can try PySpark by just typing “pyspark” on the terminal window.

$ pyspark

This will allow you to continue to execute your Python scripts on Spark.

VN:F [1.9.22_1171]
Rating: 10.0/10 (1 vote cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
Apache Spark with Python on Ubuntu - Part 01, 10.0 out of 10 based on 1 rating