Hadoop Introduction

Summary -

In this topic, we described about the below sections -

Introduction

Why Hadoop?

The data created by internet usage are increased drastically every year by year. The large data is called as Big Data and needs to be processed to produce meaningful information. The need of processing the big data is the basic idea of creating Hadoop. Hadoop is the solution for the Bigdata to analyze and produce the meaningful information from it.

What is Hadoop?

Hadoop is an open source framework, distributed, scalable, batch processing and fault-tolerance system that can store and process the huge amount of data (Bigdata). Hadoop efficiently stores large volumes of data on a cluster of commodity hardware. Hadoop not only a storage system but also platform for processing large data along with storage.

Hadoop supports the coding in programming languages like Java, C, C++, Perl, Python, ruby etc. Hadoop is an efficient framework because of running jobs on multiple machines simultaneously which is parallel processing model.

Hadoop created by Doug cutting. Hadoop created with Inspiration by Google big table and map reduce area papers in 2004. It was derived by Google technology and added to practice by other systems. Hadoop named after a shuffled elephant and is originally built to support distribution for nutch engine.

Hadoop is apache open source frame work and a large-scale distributed batch processing infrastructure to process large amount of data. Hadoop has ability to scale to hundreds or thousands of computers and each with several processor centers. Hadoop can also efficiently distribute substantial amounts of work across a set of computers. Hadoop software frame work that includes no. of components which are specially designed to solve large scale distributed data storage, analysis and retrieval tasks.

Hadoop framework consists three main core components and those are -

HDFS - It is responsible for storing massive amount of data on the cluster and storage layer of Hadoop.
MAP Reduce - It is responsible for processing massive amount of data on the cluster and data processing layer of Hadoop.
Yarn - It is responsible for resource management.

HDFS and Map reduce are its kernel for Hadoop.

Advantages -

Scalable

Hadoop is a highly scalable storage platform as it can store and distribute very large data sets across hundreds of systems/servers that operate in parallel. Hadoop enables businesses to run applications on thousands of nodes involving thousands of terabytes of data processing. And also supports hardware horizontal scalability which can add the nodes during the processing without system downtime.

Cost effective

Hadoop offers a cost-effective storage solution for businesses exploding data sets.

Flexible

Hadoop manages data whether structured or unstructured, encoded or formatted, or any other type of data. Businesses can use Hadoop to derive valuable business insights from data sources such as social media, email conversations. Hadoop brings the value to the table where unstructured data can be useful in decision making process.

Faster

Hadoop’s unique storage method is based on a distributed file system. The tools for data processing are often on the same servers where the data is located, resulting in much faster data processing. For example, If dealing with large volumes of unstructured data, Hadoop is able to efficiently process terabytes of data in just minutes, and petabytes in hours.

Fault Tolerant

The data sent to one individual node and the same data also replicates on other nodes in the same cluster. If the individual node failed to process the data, the other nodes in the same cluster available to process the data. A key advantage of using Hadoop is its fault tolerance.

Disadvantages -

Security Concerns

Hadoop security model is disabled by default due to sheer complexity and also missing encryption at the storage and network levels.

Vulnerable By Nature

The framework is written almost entirely in Java which is the most widely used controversial programming languages in existence. Java has been heavily exploited and as a result, implicated in numerous security breaches.

Not Fit for Small Data

Due to its high capacity design, the HDFS, lacks the ability to efficiently support the reading of small files. As a result, it is not recommended for organizations with small quantities of data.

Potential Stability Issues

As open source software, Hadoop has had its stability issues.

General Limitations

History -

Hadoop was created by Doug Cutting and Michael – J. Cafarella.
Doug was working for yahoo at the time and named it after his son’s toy elephant.
Hadoop was originally developed to support Nutch search engine project.
Hadoop is designed based on the work done by Google.