course-deatils-thumbnail

Big Data Hadoop Developer Training

Overview

Become an expert in Hadoop by getting hands-on knowledge on MapReduce, Hadoop Architecture, Pig & Hive, Oozie, Flume and Apache workflow scheduler. Build familiarity with HBase, Zookeeper, and Sqoop concepts while working on industry-based use-cases and projects.

Why get Big Data Hadoop Developer Certification from Cognixia?

As job opportunities multiply for IT professionals in Big Data & Hadoop, career opportunities for industry-savvy professionals are everywhere. According to a recent study, by 2020, the Big Data & Hadoop market is estimated to grow at a compound annual growth rate (CAGR) of 58%, surpassing $16 billion.

Cognixia’s Big Data Hadoop Developer certification course highlights the key ideas and proficiency for managing Big Data with Apache’s open source platform – Hadoop. Not only does it impart  in-depth knowledge on core ideas through the course, it also facilitates executing it through a variety of hands-on applications. Through this course, IT experts working in organizations of all sizes can learn to code within the MapReduce framework. The course also covers advanced modules like Yarn, Zookeeper, Oozie, Flume and Sqoop.

Schedule Classes

Looking for more sessions of this class?

What You'll learn

  • Learn to write complex codes in MapReduce on both MRv1 & MRv2 (Yarn) and understand Hadoop architecture
  • Perform analytics and learn high-level scripting frameworks Pig & Hive
  • Build an advanced understanding of Hadoop system, including Oozie, Flume and Apache workflow scheduler
  • Gain familiarity with other concepts, such as Hbase, Zookeeper and Sqoop
  • Get hands-on expertise in numerous configurations surroundings of Hadoop cluster
  • Learn about optimization & troubleshooting
  • Acquire in-depth knowledge on Hadoop architecture by learning about Hadoop Distribution file system (vHDFS one.0 & vHDFS a pair of.0)
  • Get to work on Real Life Project on Industry standards
  • Project 1: “Twitter Analysis”

To date, approximately 20% of all data is in structured form. The limitation of RDBMS is that we can isolate and store only structured data. Hadoop, however, enables us to store or process all data – structured or unstructured.

Today, Twitter has become a significant source of data, as well as a reliable tool for analyzing what the consumer is thinking about (sentiment analysis). This helps in figuring out which topics and discussions are trending at any given time. During this case study, we’ll aggregate data from Twitter through various means, to conduct an exploratory analysis.

  • Project 2: “Click Stream Analysis”

E-commerce websites have had a tremendous impact on local economies across the globe.  As part of their operation, e-commerce websites maintain a detailed record of user-activity, storing it as clickstream. This activity is used to analyze the browsing patterns of a particular user, thus helping e-commerce technology to recommend products with high accuracy, during current and future visits. This also helps e-commerce marketers, as well as their technology platforms, to design personalized promotional emails for its users.

In this case study, we’ll see how we can analyze the clickstream and user data by using Pig and Hive. We’ll gather the user data with the help of RDBMS and capture user behavior (clickstream) data by using Flume in HDFS. Next, we’ll analyze this data using Pig and Hive. We’ll also be automating the clickstream analysis by putting workflow engine Oozie to use.

Curriculum

Introduction/ Installation of Virtual Box and the Big Data VM, Introduction to Linux, Why Linux?, Windows and the Linux equivalents, Different flavors of Linux, Unity Shell (Ubuntu UI), Basic Linux Commands (enough to get started with Hadoop)

  • 3V (Volume- Variety- Velocity) characteristics
  • Structured and unstructured data
  • Application and use cases of Big Data
  • Limitations of traditional large scale systems
  • How a distributed way of computing is superior (cost and scale)
  • Opportunities and challenges with Big Data
  • HDFS Overview and Architecture
  • Deployment Architecture
  • Name Node
  • Data Node and Checkpoint Node (aka Secondary Name Node)
  • Safe mode
  • Configuration files
  • HDFS Data Flows (Read/Write)
  • CRC Checksum
  • Data Replication
  • Rack awareness and block placement policy
  • Small file problems
  • Command-Line Interface
  • File Systems
  • Administrative
  • Web Interfaces
  • Load Balancer
  • Dist cp (Distributed Copy)
  • HDFS Federation
  • HDFS High Availability
  • Hadoop Archives
  • MapReduce overview
  • Functional Programming paradigms
  • How to think in a MapReduce way
  • Legacy MR v/s Next Generation MapReduce (YARN/ MRv2)
  • Slots v/s Containers
  • Schedulers
  • Shuffling, Sorting
  • Hadoop Data Types
  • Input and Output Formats
  • Input Splits – Partitioning (Hash Partitioner v/s Customer Partitioner)
  • Configuration files
  • Distributed Cache
  • Adhoc Querying
  • Graph Computing Engines
  • Standalone mode (in Eclipse)
  • Pseudo Distributed mode (as in the Big Data VM)
  • Fully Distributed mode (as in Production)
  • MR API
  • Old and the New MR API
  • Java Client API
  • Hadoop data types
  • Custom Writable
  • Different input and output formats
  • Saving Binary Data using Sequence Files and Avro Files
  • Hadoop Streaming (developing and debugging non Java MR programs – Ruby and Python)
  • Speculative execution
  • Combiners
  • JVM Reuse
  • Compression
  • Sorting
  • Term Frequency
  • Inverse Document Frequency
  • Student Database
  • Max Temperature
  • Different ways of joining data
  • Word Co-occurrence
  • Click Stream Analysis using Pig and Hive
  • Analyzing the Twitter data with Hive
  • Further ideas for data analysis
  • HBase Data Modeling
  • Bulk loading data in HBase
  • HBase Coprocessors – Endpoints (similar to Stored Procedures in RDBMS)
  • HBase Coprocessors – Observers (similar to Triggers in RDBMS)
  • PageRank
  • Inverted Index
  • Introduction and Architecture
  • Different modes of executing Pig constructs
  • Data Types
  • Dynamic Invokers
  • Pig streaming Macros
  • Pig Latin language Constructs (LOAD, STORE, DUMP, SPLIT, etc)
  • User-Defined Functions
  • Use Cases
  • NoSQL Databases – 1 (Theoretical Concepts)
  • NoSQL Concepts
  • Review of RDBMS
  • Need for NoSQL
  • Brewers CAP Theorem
  • ACI D v/s BASE
  • Schema on Read vs. Schema on Write
  • Different levels of consistency
  • Bloom filters
  • Key Value
  • Columnar, Document
  • Graph
  • HBase Architecture
  • Master and the Region Server
  • Catalog tables (ROOT and META)
  • Major and Minor Compaction
  • Configuration Files
  • HBase v/s Cassandra
  • Java API
  • Client API
  • Filters
  • Scan Caching and Batching
  • Command Line Interface
  • REST API
  • Use-case of Sqoop
  • Sqoop Architecture
  • Sqoop Demo
  • Use-case of Flume
  • Flume Architecture
  • Flume Demo
  • Use-case of Oozie
  • Oozie Architecture
  • Oozie Demo
  • Usecase of YARN
  • YARN Architecture
  • YARN Demo
  • Introduction to RDD
  • Installation and Configuration of Spark
  • Spark Architecture
  • Different interfaces to Spark
  • Sample Python programs in Spark
  • Hadoop industry solutions
  • Importing/exporting data across RDBMS and HDFS using Sqoop
  • Getting real-time events into HDFS using Flume
  • Creating workflows in Oozie
  • Introduction to Graph processing
  • Graph processing with Neo4J
  • Using the Mongo Document Database
  • Using the Cassandra Columnar Database
  • Distributed Coordination with Zookeeper
  • Stand alone mode (Theory)
  • Distributed mode (Theory)
  • Pseudo distributed
  • Fully-distributed
  • Cloudera Hadoop cluster on the Amazon Cloud (Practice)
  • Using EMR (Elastic Map Reduce)
  • Using EC2 (Elastic Compute Cloud)

Prerequisites

Any individual who wants to pursue their career in Big Data and Hadoop should have a basic understanding of Core Java. However, it’s not mandatory.

Reach out to us for more information

Interested in this course? Let’s connect!

  • This field is for validation purposes and should be left unchanged.

Course features

Course Duration
Course Duration

36 hours of live, online, instructor-led training

24x7 Support
24x7 Support

Technical & query support round the clock

Lifetime LMS Access
Lifetime LMS Access

Access all the materials on LMS anytime, anywhere

Price Match Gurantee
Price match Gurantee

Guranteed best price aligning with quality of deliverables

FAQs

Our instructors/trainers are Cloudera and Hortonworks-certified professionals. They have industry experience of more than 12 years and are Big Data SME’s.

To attend the live virtual training, at least 2 Mbps would be required.

Yes, Cognixia’s Virtual Machine can be installed on any local system. The training team of Collabera will assist you with this.

To install the Hadoop environment, you’ll need 8GB RAM, 64-bit OS, 50GB hard disk space free, and a Virtualization Technology-enabled processor.

The access to the Learning Management System (LMS) is lifelong. That includes class recordings, presentations, sample code, and projects.

If you miss a session, don’t worry! All our sessions are available in recorded form on the LMS. We also have a technical support team that’s ready to assist you with any questions you may have.

The online live training course runs for eight weekends (15-16 sessions)