MapReduce API

Summary -

In this topic, we described about the below sections -

Classes and methods are involved in the MapReduce programming operations. Below are the concepts in mapreduce api -

Job context interface
Job class
Mapper class
Reducer class

JobContext interface -

Job class is the main class to implement the JobContext interface. JobContext interface is the super-interface for all the classes. It defines different jobs in MapReduce.

It gives a read-only view of the job while tasks are running. The JobContext sub interfaces are -

Mapcontext< KEYIN, VALUEIN, KEYOUT, VALUEOUT >
	- It defines the context given to the mapper.
Reducecontext< KEYIN, VALUEIN, KEYOUT, VALUEOUT >
 - It defines the context passed to reducer.

Job class -

The job submitter's view of the Job. The Job class is the most important class of MapReduce API. The Job class allows the user to configure the job, submit it, control its execution and query the state.

The set methods work until the job is submitted, afterwards they will throw an IllegalStateException. In general, user creates the application, describes various facets of the job via Job, then submits the job and monitor its progress.

Below example shows how to submit a job -

// Create a new Job
     Job job = Job.getInstance();
     job.setJarByClass(MyJob.class);     
     // Specify various job-specific parameters    
     job.setJobName("myjob");     
     job.setInputPath(new Path("in"));
     job.setOutputPath(new Path("out"));     
     job.setMapperClass(MyJob.MyMapper.class);
     job.setReducerClass(MyJob.MyReducer.class);
     // Submit the job, then poll for progress
      until the job is complete
     job.waitForCompletion(true);

Constructors of Job class -

job()
    job(Configuration conf)
    job(Configuration conf, String jobname)

Some Methods of Job class -

cleanupProgress(): Get the progress of the job's cleanup-tasks
getCounters(): Gets the counters for this job
getFinishTime(): Get finish time of the job
getJobFile(): Get the path of the submitted job configuration
getJobName(): job name specified by the user
getJobState(): Returns the job current state
getPriority(): Get scheduling info of the job
getStartTime(): Get start time of the job
isComplete(): Checks whether the job is finished or not
isSuccessful(): Check if the job completed successfully
setInputFormatClass(): Sets the input format for the job
setJobName(String name): Sets the job name specified by the user
setOutputFormatClass(): Sets the output format for the job
setMapperClass(Class): Sets the mapper for the job
setReducerClass(Class): Sets the reducer for the job
setPartitionerClass(Class): Sets the partitioner for the job
setCombinerClass(Class): Sets the combiner for the job.

Mapper class (Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>): -

Maps input key or value to a group of intermediate key or value pairs. Maps are individual task for translate input records to intermediate records. The transformed intermediate records need not be of the same type as the input records.

Given input pair may map to zero or many output pairs.

Constructors of Mapper class -

Mapper()

Methods of Mapper class -

setup(org.apache.hadoop.mapreduce.Mapper.Context context) - Called once at the beginning of the task.
map(KEYIN key, VALUEIN value, org.apache.hadoop.mapreduce.Mapper.Context context) - Called once for each key/value pair in the InputSplit.
run(org.apache.hadoop.mapreduce.Mapper.Context context) - Expert users can override this method for more control over the execution of the Mapper.
cleanup(org.apache.hadoop.mapreduce.Mapper.Context context) - Called once at the end of the task.

Reducer class (Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>): -

It defines the reducer job in mapreduce. Reduces is a setof intermediate values, that share a key to a smaller set of values. Reducer implementations can access the Configuration for the job via the JobContext.getConfiguration() method.

Three phases of reducers are -

Shuffle - Reducer copies sorted output from every mapper using http across the network.
Sort - When the outputs are fetched, both the shuffle and sort phases occurs simultaneously.
Reduce - Syntax is - reduce (object, Iterable, Context).

Constructors of Reducer class -

Reducer()

Methods of Reducer class -

setup(org.apache.hadoop.mapreduce.Reducer.Context context) - Called once at the start of the task.
reduce(KEYIN key, Iterable<VALUEIN> values, org.apache.hadoop.mapreduce.Reducer.Context context) - This method is called once for each key.
run(org.apache.hadoop.mapreduce.Reducer.Context context) - Advanced application writers can use to control how the reduce task works.
cleanup(org.apache.hadoop.mapreduce.Reducer.Context context) - Called once at the end of the task.