skip to Main Content

Getting started with Big Data


August 24, 2015 | Big Data

In the previous blog, we looked at how the Apache Software Foundation (ASF) works. The different software developed under the ASF like Big Data Hadoop, Pig, Hive, Spark & others are becoming the foundations for building platforms. There is no denying that without the ASF the IT world would have been not a bit but entirely different. ASF is entirely defining the way we think of what software is, but also how it should be built. It’s all good till now, the only gripe is that much of focus in the ASF projects is about building new and cool features, but less about usability.

It’s not easy for those who want to get started with Big Data to install the myriad of Apache software and integrate them. There are entire companies (like Cloudera, Hortonworks, MapR, and others) whose main purpose is to get the different software from the ASF and make sure they play nice to each other. For those who are interested to get started with Big Data, but are stuck with the installation and configuration the following have been created by Cognixia.

Big Data Virtual Machine (VM): The VM uses Ubuntu and the different Big Data software like Hadoop, Hive, Pig, Sqoop, Oozie etc are already installed and configured so that any Big Data enthusiastic can easily get started. The VM runs on a Laptop/Desktop with a minimum 3 GB RAM, 20 GB of Hard Disk and a processor which has been bought within the last 3-4 years.

big data virtual machine

All the software used in the Big Data VM are a free and open source, so there is no expiry time. The good thing about the VM is that it is self-contained and can work in an offline mode, there is no need to be connected to the internet. For those who are really curious about the Big Data VM, below is how the different software is stacked together to get the desired productivity.

Big Data Cluster: While the VM runs on the local machine. It is like a personal edition of the different software helps one to get started with it very easily. Cognixia has also created a cluster (a group of machines) using the Cloudera CDH software. This cluster mimics the production environment with multiple machines. It should be possible to login from the gateway remotely from the local machine and run MapReduce programs, put/get data in HBase and other activities.

Since it is the production-like environment and that multiple users would be accessing it at any point in time, limited access will be given to the Cognixia cluster. Access to the Big Data Virtual Machine (VM) and the Cloudera Big Data Cluster are provided for those who wish to enroll for a Big Data courses with Cognixia. For the VM there is no limit on how long it has to be used, the cluster would be provided for a duration of two years for all the participants from the beginning of the course. Here are more details about the different course offerings from Cognixia.

Back To Top