Fine tuning Apache Spark program with less stages | Cognixia

GenAI
Approach
Companies
The Cognixia Approach Uncover the skills gap in your human capital, create customized training solutions for you, and plot your roadmap to a future-proofed workforce Pioneer the Future Workforce Transformation Empower your workforce with the right skills & knowledge, discover how Cognixia can deliver the right mix of skills to your talent Catalyze Change Recode…

Know More
Quick Link
Companies
Companies
- Workforce Transformation
  Upskill your existing workforce with our digital-forward training solutions Hire industry-ready digitally native talent for all your new talent needs Delivered now Experienced trainers for every skill Tailor-made training solutions for your unique needs 24×7 support for learners globally Course completion certification A globally-recognized certificate after course completion. Hands-on training experience A perfect balance of…
  
  Know More
  Quick Link
  Workforce Transformation
  Workforce Transformation
- Hire Skilled Talent
  Hire digitally native talent to solve your digital needs Skills Attitude Assessments Mindset Assessments Location Based To know more about JUMP Contact Us
  
  Know More
  Quick Link
  Hire Skilled Talent
  Hire Skilled Talent
Individuals
Upgrade Your Digital Skills Specialize your talents, learn new skills and stay indispensable to your organization with Cognixia’s upskilling programs. Learn More ❱ Get Hired Fast-track your path to career growth with thousands of fresh opportunities and find the job you’ve always dreamed of. Learn More ❱

Know More
Quick Link
Individuals
Individuals
- Upgrade Your Digital Skills
  Enhance your digital skillset with our robust course offering Direct mentorship with experienced instructors Classroom, virtual, self-paced and hybrid learning modes Lifetime access to all training materials To know more on what course you should pick Contact Us
  
  Know More
  Quick Link
  Upgrade Your Digital Skills
  Upgrade Your Digital Skills
- Get Hired
  Apply today to launch your digital career Apply Get Trained Location Based To know more about JUMP Contact Us
  
  Know More
  Quick Link
  Get Hired
  Get Hired
Courses
Dive into the latest technology frameworks and business paradigms to build a future-proofed career

Know More
Quick Link
Courses
Courses
- Industry
  
  Global Aviation
  
  Global Automobile
  
  Global BFSI
  
  Global E-commerce
  
  Global Food-tech
  
  Global Healthcare
  
  Global Media and Entertainment
  
  Global Oil and Gas
  
  Global Pharmaceutical
  
  Global Telecommunication
  
  Know More
  Quick Link
  Industry
  Industry
- Application Development
  
  Python v3.7
  
  Self-Paced Python Developer Training
  
  Self-Paced Java Programming Training
  
  Know More
  Quick Link
  Python v3.7
  Application Development
- Big Data and Analytics
  
  CouchDB
  
  Self-Paced Analytics with R
  
  Self-Paced Big Data Hadoop Administrator Training
  
  Self-Paced Big Data Hadoop Developer Training
  
  Know More
  Quick Link
  Cassandra Developer
  Big Data and Analytics
- Business Intelligence
  
  QlikView
  
  Microstrategy
  
  Know More
  Quick Link
  Microstrategy
  Business Intelligence
- Cloud and DevOps
  
  Cloud Development Professional Training
  
  Advanced Ansible Training
  
  DevOps Training
  
  Advanced DevOps Training
  
  GCP- Google Cloud Platform
  
  DevOps Plus Training
  
  Cloud Computing with AWS Training
  
  Know More
  Quick Link
  DevOps Plus Training
  Cloud and DevOps
- Cyber Security
  
  Cyber Crime and Cyber Security Training
  
  Self-Paced Linux Administration Training
  
  Know More
  Quick Link
  Cyber Crime and Cyber Security Training
  Cyber Security
- Development
  
  Docker and Kubernetes Bootcamp
  
  FULL Stack (MEAN) Developer Training
  
  Google Certified Android App Development Training
  
  Blockchain Training
  
  Apache Spark & Scala Training
  
  Big Data Hadoop Administrator Training
  
  Big Data Hadoop Developer Training
  
  Know More
  Quick Link
  Docker and Kubernetes Training
  Development
- ITIL® and IT Service Management
  
  ITIL® 4 Awareness
  
  ITIL® 4 Foundation
  
  Know More
  Quick Link
  ITIL® 4 Foundation
  ITIL® and IT Service Management
- Java/J2EE
  
  Web Services
  
  Spring Cloud
  
  Node.js
  
  Angular.JS
  
  Spring Boot
  
  Know More
  Quick Link
  Spring Boot
  Java/J2EE
- Machine Learning and Analytics
  
  Tableau Training
  
  Machine Learning, AI, & Deep Learning Training
  
  Machine Learning with Python and R
  
  Advanced Machine Learning with Deep Learning Training
  
  Machine Learning with Python Training
  
  Know More
  Quick Link
  Machine Learning with Python Training
  Machine Learning and Analytics
- Management
  
  PMP Training
  
  Certified Scrum Master Training
  
  Six Sigma Black Belt Training
  
  Six Sigma Green Belt Training
  
  Know More
  Quick Link
  PMP Training
  Management
- Microsoft Technologies
  
  AZ-300: Microsoft Azure Architect Technologies
  
  AZ-104: Microsoft Azure Administrator
  
  AZ-103: Microsoft Azure Administrator
  
  AZ-101: Microsoft Azure Integration & Security
  
  AZ-100: Microsoft Azure Infrastructure & Deployment
  
  Know More
  Quick Link
  AZ-104: Microsoft Azure Administrator
  Microsoft Technologies
- Mobile
  
  Self Paced Android App Development
  
  Know More
  Quick Link
  React Native
  Mobile
- Web Technologies
  
  React.js
  
  Knockout.js
  
  JavaScript & Ajax
  
  HTML5 AND CSS3
  
  Ember.JS
  
  Backbone.js
  
  Know More
  Quick Link
  HTML5 AND CSS3
  Web Technologies
Events

Know More
Quick Link
Events
Events
- Master Class
  
  Know More
  Quick Link
  Master Class
  Master Class
- Webinars
  
  Know More
  Quick Link
  Webinars
  Webinars
- Workshops
  
  Know More
  Quick Link
  Workshops
  Workshops
Resources

Know More
Quick Link
Resources
Resources
- Blog
  
  Know More
  Quick Link
  Blog
  Blog
- Podcast
  
  Know More
  Quick Link
  Podcast
  Podcast
- Tech News
  
  Know More
  Quick Link
  Tech News
  Tech News
About
Mission To bring about a shift in the mindsets of people and enterprises through future-proofed, digitally-ready talent solutions. We shape the future by grooming the next generation of disruptors, innovators and leaders and aim to bridge the global supply/demand gap in the number of digital-ready professionals who are skilled in the technologies of tomorrow.

Know More
Quick Link
About
About
- Awards
  Cognixia creates some of the most comprehensive and relevant online learning experiences for professionals in nearly every field imaginable. And we’re proud to be recognized for the passion and dedication that we bring to thousands of lives.
  
  Know More
  Quick Link
  Awards
  Awards
- Our Culture
  Disciplined in performance Responsive in approach Passionate to achieve Competitive to succeed Industrious from start to finish
  
  Know More
  Quick Link
  Our Culture
  Our Culture
- Locations
  
  Know More
  Quick Link
  Locations
  Locations
- Referrals
  Success tastes best when shared. Tell us about a friend, colleague or a family member, who might be interested in pursuing a career in digital technologies or transforming their workforce.
  
  Know More
  Quick Link
  Referrals
  Referrals
Contact

Profile Search Course

December 2, 2015 | Big Data

When an Apache Spark job is run then a DAG (Directed Acyclic Graph) of operators is created, then it is split into a set of Stages, which are further divided into a set of Tasks as shown in the below image. Apache Spark job is split into stages based on the data shuffling (moving data) across machines. Shuffling the data is a costly affair and less the data shuffled, the faster the job will be completed.

job scheduling process Here is a simple WordCount Python program which finds out how many times a particular word is repeated in the given input data.

from pyspark import SparkConf
from pyspark import SparkContext

conf = SparkConf().setMaster("spark://myubuntu:7077").set("spark.eventLog.enabled", "true")
.set("spark.eventLog.dir", "/home/myubuntu/Installations/spark-1.5.2-bin-hadoop1-scala2.11/eventLogs")
sc = SparkContext(conf=conf)

wordCount = sc.parallelize(["spark is cool", "spark is fast", "spark is hip"]).flatMap(lambda line : line.split())
.map(lambda word : (word, 1)).reduceByKey(lambda a, b: a + b)
print wordCount.collect()

job-scheduling-process

For the above program, below is the visualization of the different stages and tasks. Note that there are multiple stages and the data has to be shuffled because of the reduceByKey() aggregation. The below DAG is not automatically generated in the Spark console, ‘spark.eventLog.enabled’ and the ‘spark.eventLog.dir’ properties have to be set as shown in the above program. The properties can be set either in a programmatic way or by using the Spark configuration files.

Once the properties are set, the visualization, as shown above, will help in fine-tuning the Spark program. Note that less the less the number of stages, less the shuffling of the data between machines, less the time is taken for the Spark job to be completed. Look out for more tips around the Apache Spark framework.