Banner

Apache Spark Big Data Boot Camp

Live Classroom
Duration: 5 days
Live Virtual Classroom
Duration: 5 days
Pattern figure

Overview

Apache Spark is a popular toolset for powering Big Data solutions with distributed cluster computing, owing to its speed, expanded versatility and access to powerful APIs and libraries. Spark gives applications the ability to support data science capabilities with R-type dataframes and Big Data streaming helping overcome the time constraints. This fast-paced three day course provides a thorough, hands-on overview of the Apache Spark platform as well as the technologies and paradigms that form a part of Spark. The course will help participants master all the skills necessary to be able to use Apache Spark for their own applications.

What You'll Learn

  • The origin of Apache Spark
  • Apache Spark vs. Apache Hadoop
  • Apache Spark use cases
  • Streaming architecture of Spark
  • SQL architecture in Spark
  • Apache Spark and Machine Learning
  • Machine Learning libraries
  • Apache Spark GraphX

Curriculum

  • Introduction to data analysis
  • Introduction to Big Data
  • Definition of Big Data
  • Introduce the techniques and challenges in Big Data
  • Introduce the techniques and challenges in Distributed Computing
  • Show how the functional programming approach is particularly useful in tackling these challenges
  • Short overview of previous solutions – Google’s MapReduce and Apache Hadoop
  • Introduction to Apache Spark
  • Exercise: Exposure to Admin and setup

  • Spark architecture in a cluster
  • Spark ecosystem and cluster management
  • Deploying Spark on a cluster
  • Deploying Spark on a standalone cluster
  • Deploying Spark on Mesos cluster
  • Deploying Spark on YARN cluster
  • Cloud-based deployment
  • Exercise: Learn to deploy and begin using Spark

  • Dig deeper into Apache Spark
  • Introduce Resilient Distributed Datasets (RDD)
  • Apache Spark installation
  • Introduce the Spark Shell
  • Actions and Transformations (Laziness)
  • Caching
  • Loading and saving data files from the file system
  • Exercise: Get hands-on with Spark Code and RDDs

  • Tailored RDD
  • Pair RDD
  • NewHadoop RDD
  • Aggregations
  • Partitioning
  • Broadcast variables
  • Accumulators
  • Exercise: You’ll learn expanded RDD capabilities

  • SparkSQL and DataFrames
  • DataFrame and SQL API
  • DataFrame Schema
  • Datasets and Encoders
  • Loading and saving data
  • Aggregations
  • Joins
  • Exercise: Learn to use one of Spark’s most powerful features – DataFrames using R-style modelling supported by supercomputer clusters

  • A brief introduction to streaming
  • Spark streaming
  • Discretized streams
  • Structured streaming
  • Stateful/stateless transformations
  • Checkpointing
  • Inter-operability with streaming platforms (Apache Kafka)
  • Exercise: Another of Spark 2.1’s most exciting features is the ability to provide Big Data streaming to allow beating the timeframe constraints of previous Big Data solutions

  • Introduction to Machine Learning
  • Spark Machine Learning APIs
  • Feature extractor and transformation
  • Classification using logistic regression
  • Best practices in machine learning for the practitioners
  • Exercise: Use Spark to perform production-friendly calls for powerful machine learning service and predictive analysis

  • Brief introduction to Graph theory
  • GraphX
  • Vertex and Edge RDDs
  • Graph operators
  • Pregel API
  • PageRank/ Travelling salesman problem
  • Exercise: get hands-on practice using GraphX

  • Testing in a distributed environment
  • Testing Spark application
  • Debugging Spark application
  • Exercise: Lab practice supporting Spark solutions with best practices for testing, debugging and normal-day production issues for Spark solutions
waves
Ripple wave

Who should attend

This boot camp is highly recommended for –

  • Developers and team leads
  • Software engineers
  • Business analysts
  • System analysts
  • Data analysts and scientists
  • Data scientists
  • Operations and DevOps engineers
  • Java developers
  • Big Data engineers

Prerequisites

There are no mandatory prerequisites for this course, however, completing having a basic understanding of Scala/Python would be beneficial. It is also recommended to complete the Fundamentals of DevOps course prior to taking the Apache Spark Big Data boot camp.

Interested in this Course?

    Ready to recode your DNA for GenAI?
    Discover how Cognixia can help.

    Get in Touch
    Pattern figure
    Ripple wave