Summary -

In this topic, we described about the below sections -

Hadoop is an open source framework, distributed, scalable, batch processing and fault-tolerance system that can store and process the huge amount of data (Bigdata). Hadoop efficiently stores large volumes of data on a cluster of commodity hardware.

Hadoop not only a storage system but also platform for processing large data along with storage. The major two daemons are -

  1. Master Daemons
  2. Slave Daemons

These two daemons are divided into sub nodes like below.

  1. Master Daemons
    • Name Node
    • Secondary Name Node
    • Job Tracker
  2. Slave Daemons
    • Data Node
    • Task Tracker

These five daemons run for Hadoop to be functional. HDFS provides the storage layer and MapReduce provides the computation layer in Hadoop. There are one name node and several data nodes on storage layer (HDFS).

There is a resource manager and several node managers on computation layer (MapReduce). Name node (HDFS), resource manager (Map-Reduce) run on master and data nodes (HDFS), node manager (Map-Reduce) runs on slaves. Client submits data and program to Hadoop to process any data. HDFS used to store the data and MapReduce used to process the data.

Below are the two main parts of the Hadoop processing -

Hadoop Data Storage: -

HDFS (Hadoop Distributed File System) is the primary storage system of Hadoop and is a distributed file system designed to run on product hardware and to hold very large amounts of data like terabytes or petabytes of data.

HDFS stores data reliably even in the case of machine failure, provides high throughput access to application data and is suitable for applications that have large data sets. HDFS is a highly fault tolerant and self-healing distributed file system.

HDFS was developed specifically for large-scale data processing workloads where scalability, flexibility and throughput are critical. The data submitted for processing is broken into small chunks as blocks.

Block is the smallest unit of data that the HDFS store and Hadoop application distributes data blocks across the multiple nodes. Once all the blocks of the data are stored on data node, the user can process the data.

Hadoop Data Processing: -

Hadoop MapReduce is the data processing layer. MapReduce is the framework for writing applications programs to process the vast amount of data stored in the HDFS.

MapReduce processes a huge amount of data in parallel by dividing the job into a set of independent tasks. In Hadoop, MapReduce works by breaking the processing into phases -

  • Map
  • Reduce
  • Map – Mapping is the first phase of processing. All the complex logic/business rules coded in the mapping phase. The map takes a set of data and converts it into another set of data. Map phase breaks individual elements into tuples (key-value pairs).
  • Reduce – Reduce is the second phase of processing. The light-weight processing code like aggregation/summation coded here. The output from the map is the input to Reducer. Reducer combines tuples (key-value) based on the key and then modifies the value of the key accordingly.