Big Data Hadoop And Spark Developer Training

Overview

Building Strategic Influence in Matrix Organizations

Become an expert in Hadoop by getting hands-on knowledge on MapReduce, Hadoop Architecture, Pig & Hive, Oozie, Flume and Apache workflow scheduler. Build familiarity with HBase, Zookeeper, and Sqoop concepts while working on industry-based use-cases and projects.

Why get Big Data Hadoop Developer Certification from Cognixia?

As job opportunities multiply for IT professionals in Big Data & Hadoop, career opportunities for industry-savvy professionals are everywhere. According to a recent study, by 2020, the Big Data & Hadoop market is estimated to grow at a compound annual growth rate (CAGR) of 58%, surpassing $16 billion.

Cognixia’s Big Data Hadoop Developer certification course highlights the key ideas and proficiency for managing Big Data with Apache’s open source platform – Hadoop. Not only does it impart in-depth knowledge on core ideas through the course, it also facilitates executing it through a variety of hands-on applications. Through this course, IT experts working in organizations of all sizes can learn to code within the MapReduce framework. The course also covers advanced modules like Yarn, Zookeeper, Oozie, Flume and Sqoop.

What you'll learn

Why You Shouldn’t Miss this course

Learn to write complex codes in MapReduce on both MRv1 & MRv2 (Yarn) and understand Hadoop architecture
Perform analytics and learn high-level scripting frameworks Pig & Hive
Build an advanced understanding of Hadoop system, including Oozie, Flume and Apache workflow scheduler
Gain familiarity with other concepts, such as Hbase, Zookeeper and Sqoop
Get hands-on expertise in numerous configurations surroundings of Hadoop cluster
Learn about optimization & troubleshooting
Acquire in-depth knowledge on Hadoop architecture by learning about Hadoop Distribution file system (vHDFS one.0 & vHDFS a pair of.0)
Get to work on Real Life Project on Industry standards

Project 1: “Twitter Analysis”
- To date, approximately 20% of all data is in structured form. The limitation of RDBMS is that we can isolate and store only structured data. Hadoop, however, enables us to store or process all data – structured or unstructured.
- Today, Twitter has become a significant source of data, as well as a reliable tool for analyzing what the consumer is thinking about (sentiment analysis). This helps in figuring out which topics and discussions are trending at any given time. During this case study, we’ll aggregate data from Twitter through various means, to conduct an exploratory analysis.
Project 2: “Click Stream Analysis”
- E-commerce websites have had a tremendous impact on local economies across the globe. As part of their operation, e-commerce websites maintain a detailed record of user-activity, storing it as clickstream. This activity is used to analyze the browsing patterns of a particular user, thus helping e-commerce technology to recommend products with high accuracy, during current and future visits. This also helps e-commerce marketers, as well as their technology platforms, to design personalized promotional emails for its users.
- In this case study, we’ll see how we can analyze the clickstream and user data by using Pig and Hive. We’ll gather the user data with the help of RDBMS and capture user behavior (clickstream) data by using Flume in HDFS. Next, we’ll analyze this data using Pig and Hive. We’ll also be automating the clickstream analysis by putting workflow engine Oozie to use.

Prerequisites

Recommended Experience

Any individual who wants to pursue their career in Big Data and Hadoop should have a basic understanding of Core Java. However, it’s not mandatory.

Curriculum

Structured for Strategic Application

Introduction to Linux and Big Data Virtual Machine (VM)

Introduction/ Installation of Virtual Box and the Big Data VM, Introduction to Linux, Why Linux?, Windows and the Linux equivalents, Different flavors of Linux, Unity Shell (Ubuntu UI), Basic Linux Commands (enough to get started with Hadoop)

Understanding Big Data

3V (Volume- Variety- Velocity) characteristics
Structured and unstructured data
Application and use cases of Big Data
Limitations of traditional large scale systems
How a distributed way of computing is superior (cost and scale)
Opportunities and challenges with Big Data

HDFS (The Hadoop Distributed File System)

HDFS Overview and Architecture
Deployment Architecture
Name Node
Data Node and Checkpoint Node (aka Secondary Name Node)
Safe mode
Configuration files
HDFS Data Flows (Read/Write)

How HDFS Addresses Fault Tolerance?

CRC Checksum
Data Replication
Rack awareness and block placement policy
Small file problems

HDFS Interfaces

Command-Line Interface
File Systems
Administrative
Web Interfaces

Advanced HDFS Features

Load Balancer
Dist cp (Distributed Copy)
HDFS Federation
HDFS High Availability
Hadoop Archives

Map Reduce – 1 (Theoretical Concepts)

MapReduce overview
Functional Programming paradigms
How to think in a MapReduce way

MapReduce Architecture

Legacy MR v/s Next Generation MapReduce (YARN/ MRv2)
Slots v/s Containers
Schedulers
Shuffling, Sorting
Hadoop Data Types
Input and Output Formats
Input Splits – Partitioning (Hash Partitioner v/s Customer Partitioner)
Configuration files
Distributed Cache

MR Algorithm and Data Flow

Word Count

Alternatives to MR – BSP (Bulk Synchronous Parallel)

Adhoc Querying
Graph Computing Engines

MapReduce – 2 (Practice) Developing, Debugging and Deploying MR Programs

Standalone mode (in Eclipse)
Pseudo Distributed mode (as in the Big Data VM)
Fully Distributed mode (as in Production)
MR API
Old and the New MR API
Java Client API
Hadoop data types
Custom Writable

WritableComparable

Different input and output formats
Saving Binary Data using Sequence Files and Avro Files
Hadoop Streaming (developing and debugging non Java MR programs – Ruby and Python)

Optimization Techniques

Speculative execution
Combiners
JVM Reuse
Compression

Mr Algorithms (Non- Graph)

Proof of Concepts and Use Cases

Click Stream Analysis using Pig and Hive
Analyzing the Twitter data with Hive
Further ideas for data analysis

Advance HBase Features

HBase Data Modeling
Bulk loading data in HBase
HBase Coprocessors – Endpoints (similar to Stored Procedures in RDBMS)
HBase Coprocessors – Observers (similar to Triggers in RDBMS)

MR Algorithms (Graph)

PageRank
Inverted Index

Higher Level Abstractions for MR (Pig)

Introduction and Architecture
Different modes of executing Pig constructs
Data Types
Dynamic Invokers
Pig streaming Macros
Pig Latin language Constructs (LOAD, STORE, DUMP, SPLIT, etc)
User-Defined Functions
Use Cases

Comparison of Pig and Hive

NoSQL Databases – 1 (Theoretical Concepts)
NoSQL Concepts
Review of RDBMS
Need for NoSQL
Brewers CAP Theorem
ACI D v/s BASE
Schema on Read vs. Schema on Write
Different levels of consistency
Bloom filters

Different Types of NoSQL Databases

Key Value
Columnar, Document
Graph

Columnar Databases Concepts NoSQL Databases – 2 (Practice)

Interfaces to HBase (for DDL and DML Operations)

Introduction to Sqoop

Use-case of Sqoop
Sqoop Architecture
Sqoop Demo

Introduction to Flume

Use-case of Flume
Flume Architecture
Flume Demo

Introduction to Oozie

Use-case of Oozie
Oozie Architecture
Oozie Demo

Introduction to Yarn

Usecase of YARN
YARN Architecture
YARN Demo

Spark

Introduction to RDD
Installation and Configuration of Spark
Spark Architecture
Different interfaces to Spark
Sample Python programs in Spark

Hadoop Ecosystem and Use Cases

Hadoop industry solutions
Importing/exporting data across RDBMS and HDFS using Sqoop
Getting real-time events into HDFS using Flume
Creating workflows in Oozie
Introduction to Graph processing
Graph processing with Neo4J
Using the Mongo Document Database
Using the Cassandra Columnar Database
Distributed Coordination with Zookeeper

SSH Configuration

Stand alone mode (Theory)
Distributed mode (Theory)
Pseudo distributed
Fully-distributed

Setting up a Hadoop Cluster using Apache Hadoop

Cloudera Hadoop cluster on the Amazon Cloud (Practice)
Using EMR (Elastic Map Reduce)
Using EC2 (Elastic Compute Cloud)

Load More

Feature

Designed for Immediate Organizational Impact

Includes real-world simulations, stakeholder tools, and influence models tailored for complex organizations.

Course Duration36 hours of live, online, instructor-led training

24x7 SupportTechnical & query support round the clock

Lifetime LMS AccessAccess all the materials on LMS anytime, anywhere

Price match GuaranteeGuaranteed best price aligning with quality of deliverables

Interested in this course?

Let's Connect!

FAQs

Frequently Asked Questions

Find details on duration, delivery formats, customization options, and post-program reinforcement.

Who are the instructors?

Our instructors/trainers are Cloudera and Hortonworks-certified professionals. They have industry experience of more than 12 years and are Big Data SME’s.

What internet speed is required to attend the live classes?

To attend the live virtual training, at least 2 Mbps would be required.

Can I install Hadoop on my local machine?

Yes, Cognixia’s Virtual Machine can be installed on any local system. The training team of Collabera will assist you with this.

What are the system requirements to install the Hadoop environment?

To install the Hadoop environment, you’ll need 8GB RAM, 64-bit OS, 50GB hard disk space free, and a Virtualization Technology-enabled processor.

For how long can we have access to the LMS?

The access to the Learning Management System (LMS) is lifelong. That includes class recordings, presentations, sample code, and projects.

What is the course duration?

If you miss a session, don’t worry! All our sessions are available in recorded form on the LMS. We also have a technical support team that’s ready to assist you with any questions you may have.

What if I miss a training session?

The online live training course runs for eight weekends (15-16 sessions)

Load More