Integrating Apache Spark with PyCharm

November 27, 2015 | General, Spark

There had been a lot of buzz around the Apache Spark software. Not only the code written in Spark is concise, but the Spark developers claim that it is 100 times faster when compared to Hadoop. The other major difference is that while Hadoop has support for only Java API, Apache Spark provides the API for Java, Scala, Python, and the recently introduced R. Spark claims that all the languages supported are first class citizens, but with the rapid pace at which new features are being added not all the languages are treated the same by Spark. For writing non-Java code Hadoop uses a feature called Streaming and Spark uses Pipes.

By supporting multiple languages out of the box in Spark, more and more developers can get easily started with Spark. In this blog, we will look at how to write programs in one of the popular Python IDE called

Option 1

Include the below code in each and every program. This is not preferred approach as it depends on the environment which we are executing the program and also auto-completion doesn’t work.

import os import sys

# Path for spark source folder
os.environ[‘SPARK_HOME’]=”/home/myubuntu/Installations/spark-1.5.2-bin-hadoop1-scala2.11″

# Append pyspark to Python Path
sys.path.append(“/home/myubuntu/Installations/spark-1.5.2-bin-hadoop1-scala2.11/python/”)
sys.path.append(“/home/myubuntu/Installations/spark-1.5.2-bin-hadoop1-scala2.11/python/lib/py4j-0.8.2.1-src.zip”)

Option 2

Another way is to create a PyCharm project and add the below properties in the `Run -> Edit Configurations …..` menu. The advantage of this approach is that the auto-completion works, but each project has to modify manually.

PYTHONPATH=/home/myubuntu/Installations/spark-1.5.2-bin-hadoop1-scala2.11/python/:/home/myubuntu/Installations/spark-1.5.2-bin-hadoop1-scala2.11/python/lib/py4j-0.8.2.1-src.zip
SPARK_HOME=/home/myubuntu/Installations/spark-1.5.2-bin-hadoop1-scala2.11

Option 3

The final and the preferred approach is to modify the bin/pycharm.sh script and export the above environment variables as below. This is the preferred approach as each project in PyCharm need not be modified and auto-completion also works.

export PYTHONPATH=/home/myubuntu/Installations/spark-1.5.2-bin-hadoop1-scala2.11/python/:/home/myubuntu/Installations/spark-1.5.2-bin-hadoop1-scala2.11/python/lib/py4j-0.8.2.1-src.zip
export SPARK_HOME=/home/myubuntu/Installations/spark-1.5.2-bin-hadoop1-scala2.11

Look out for more blogs on the latest Apache Spark from us.

Workforce Transformation

Quick Link

Hire Skilled Talent

Quick Link

Upgrade Your Digital Skills

Quick Link

Get Hired

Quick Link

Industry

Quick Link

Application Development

Quick Link

Big Data and Analytics

Quick Link

Business Intelligence

Quick Link

Cloud and DevOps

Quick Link

Cyber Security

Quick Link

Development

Quick Link

Internet of Things

Quick Link

ITIL® and IT Service Management

Quick Link

Java/J2EE

Quick Link

Machine Learning and Analytics

Quick Link

Management

Quick Link

Microsoft Technologies

Quick Link

Mobile

Quick Link

Web Technologies

Quick Link

Master Class

Quick Link

Webinars

Quick Link

Workshops

Quick Link

Blog

Quick Link

Podcast

Quick Link

Tech News

Quick Link

Awards

Quick Link

Careers

Quick Link

Our Culture

Quick Link

Locations

Quick Link

Referrals

Quick Link

Option 1

Option 2

Option 3