Apache Spark with Python on Ubuntu – Part 01
Spark is a fast and general cluster computing system for Big Data. It is written in Scala Language. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing. Apache Spark is built on one main concept, which is “Resilient Distributed Data (RDD)”.
(Python vs Scala vs Java) with Spark?
The most popular languages that Spark associated are Python and Scala. Both languages follow a similar syntax and compared to Java they are quite easy to follow. However compared to Python, Scala seems more faster mainly Spark is written in Scala and it overcomes the delay of having to go through another set of libraries to interpret if you chose to use Python. However, in general both are capable of doing the task in almost all the use cases.
Installing Apache Spark with Python
In order to complete this task, you are required to follow the following steps one by one.
Step 1: Install Java
- I assume you already have Java Development Kit (JDK) installed in your machines. In March 2018, the Spark supported JDK version is JDK 8.
- You may verify the Java installation
$ java -version
Step 2: Install Scala
- If you do not have Scala installed in your system, use this link to install it.
- Get the “tgz” bundle and extract it to the /usrlocal/scala folder (This is as a best practice)
$ tar xvf scala-2.11.12.tgz // Extract the scala into /usr/local folder $ su - $ mv scala-2.11.12 /usr/local/scala $ exit
- Then update the .bashrc to have SCALA_HOME and $SCALA_HOME/bin to the $PATH.
- After all, verify the scala installation
$ scala -version
Step 3: Install Spark
After installing both Java and Scala, now you are ready to download the Spark version. Use this link to download the “tgz” file.
P.Note: Also when downloading Apache Spark itself, be sure to install the latest version of Spark 2.3
$ tar xvf spark-2.3.0-bin-hadoop2.7.tgz // Extract the Spark into /usr/local/spark $ su - $ mv spark-2.3.0-bin-hadoop2.7 /usr/local/spark $ exit
- Then update the .bashrc to have SPARK_HOME and $SPARK_HOME/bin to the $PATH.
- Now you may verify the Spark installation
if all goes well, you will see a Spark prompt being displayed!. (Use :quit OR Ctrl+D to exit from the shell)
Step 4: Install Python
You may install Python using Canopy
Use this link to download the binaries to your system. (Use Linux(64-bit Python 3.5 Download for this blog)
Once you installed,Canopy, you have a Python development environment to work with Spark with all the libraries including PySpark.
Once all these installed you can try PySpark by just typing “pyspark” on the terminal window.
This will allow you to continue to execute your Python scripts on Spark.
Comments are closed.