skip to Main Content

Integrating Apache Spark with PyCharm


November 27, 2015 | General, spark

There had been a lot of buzz around the Apache Spark software. Not only the code written in Spark is concise, but the Spark developers claim that it is 100 times faster when compared to Hadoop. The other major difference is that while Hadoop has support for only Java API, Apache Spark provides the API for Java, Scala, Python, and the recently introduced R. Spark claims that all the languages supported are first class citizens, but with the rapid pace at which new features are being added not all the languages are treated the same by Spark. For writing non-Java code Hadoop uses a feature called Streaming and Spark uses Pipes.

By supporting multiple languages out of the box in Spark, more and more developers can get easily started with Spark. In this blog, we will look at how to write programs in one of the popular Python IDE called

Option 1

Include the below code in each and every program. This is not preferred approach as it depends on the environment which we are executing the program and also auto-completion doesn’t work.

import os import sys

# Path for spark source folder
os.environ[‘SPARK_HOME’]=”/home/myubuntu/Installations/spark-1.5.2-bin-hadoop1-scala2.11″

# Append pyspark to Python Path
sys.path.append(“/home/myubuntu/Installations/spark-1.5.2-bin-hadoop1-scala2.11/python/”)
sys.path.append(“/home/myubuntu/Installations/spark-1.5.2-bin-hadoop1-scala2.11/python/lib/py4j-0.8.2.1-src.zip”)

Option 2

Another way is to create a PyCharm project and add the below properties in the `Run -> Edit Configurations …..` menu. The advantage of this approach is that the auto-completion works, but each project has to modify manually.

PYTHONPATH=/home/myubuntu/Installations/spark-1.5.2-bin-hadoop1-scala2.11/python/:/home/myubuntu/Installations/spark-1.5.2-bin-hadoop1-scala2.11/python/lib/py4j-0.8.2.1-src.zip
SPARK_HOME=/home/myubuntu/Installations/spark-1.5.2-bin-hadoop1-scala2.11

Option 3

The final and the preferred approach is to modify the bin/pycharm.sh script and export the above environment variables as below. This is the preferred approach as each project in PyCharm need not be modified and auto-completion also works.

export PYTHONPATH=/home/myubuntu/Installations/spark-1.5.2-bin-hadoop1-scala2.11/python/:/home/myubuntu/Installations/spark-1.5.2-bin-hadoop1-scala2.11/python/lib/py4j-0.8.2.1-src.zip
export SPARK_HOME=/home/myubuntu/Installations/spark-1.5.2-bin-hadoop1-scala2.11

Look out for more blogs on the latest Apache Spark from us.

Back To Top