MapReduce Introduction

Summary -

In this topic, we described about the below sections -

Introduction

MapReduce is a parallel programming model for processing the huge amount of data. MapReduce is the data processing layer of Hadoop and is a software framework for easily writing applications that process vast amount of structured and unstructured data stored in the HDFS. MapReduce is a framework used to write applications to process massive amounts of data in parallel on large clusters of hardware.

MapReduce making the structured data out of some unstructured data etc. MapReduce provides automatic parallelization & distribution fault-tolerance, I/O scheduling, monitoring and status updates. Hadoop can run MapReduce programs written in various languages like java, ruby, python etc.

What is MapReduce?

MapReduce is a processing technique and program model for distributed computing. MapReduce makes easy to distribute tasks across nodes and performs Sort or Merge based on distributed computing.

The underlying system takes care of partitioning input data, scheduling the programs execution across several machines, handling machine failures and managing inter-machine communication.

Computational process occurs on both unstructured and structured data. In MapReduce, many implementation options available. MapReduce is Fault-tolerant, reliable and supports thousands of nodes.

MapReduce architecture provides –

Automatic parallelization & distribution
Fault-tolerance
I/O Scheduling
Monitoring & Status updates

MapReduce Overview -

Applications data processing on Hadoop are written using the MapReduce paradigm. A MapReduce usually splits the input data-set into independent chunks, which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of maps, which are then input to reduce the tasks.

The framework takes care of scheduling tasks, monitoring and re-executes the failed tasks. Both input and output of the job are stored in the file system.

MapReduce applications specify the input/output locations and supply MapReduce functions via implementation of appropriate Hadoop interfaces such as Mapper and Reducer.

The Hadoop job client then submits the job and configuration job tracker which assumes the responsibility slaves, scheduling tasks, monitoring and providing status to job client.

Limitations -

MapReduce can’t control the order in which maps or reductions are run. For maximum parallelism, maps and reduces not depend on data generated in the MapReduce job.

A database with index will always be faster than a MapReduce job on un-indexed data. Reduce operations do not take place until all maps are complete.

Features -

Automatic parallelization and distribution.
Fault tolerance, status and monitoring the tools.
MapReduce programs are normally written in Java and any other scripting language using Streaming API.
Clear abstraction for the programmers.