Thursday, July 9, 2015

Today we discuss MapReduce algorithm in Hadoop with reference to their documentation.
MapReduce processes vast amounts of data in parallel on large clusters(thousands of nodes) in a reliable fault-tolerant manner.
A MapReduce job usually splits the input data set into independent chunks which are then processed in a shared nothing model by map tasks. The outputs of the maps are then input to the reduce tasks. The framework stores the input and output of a job in  a filesystem, schedules the processing, monitors the activity and re-executes failed tasks. The framework employs a master slave model for the execution of the map and reduce.
 While the jobs are monitored and completed by the framework, their configuration including the input or output locations are specified by the applications.
At each stage of the processing - the map, combine, reduce steps take the <key, value> pairs as the input and output.  Both key and value implement the writable interface and the key classes have to implement the writablecomparable interface.

For eg:
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>{
public void map (LongWritable key, Text Value, OutputCollector<Text, IntWritable> output, Reporter reporter)
:// eg tokenizes a sentence into words
}
}

public static class Reduce MapReduceBase implements Reducer<Text, IntWriteable, Text, IntWriteable> {
public void reduce (Text key, Iterator<IntWriteable> values, OutputCollector<Text, IntWriteable> output, Reporter) {
:// counts the number of words encountered
}
}
The Mapper splits the line into tokens which the tokenizer emits as a keyvalue pair of the words and the counts.
The Reducer sums up the values which represent occurrences for each key.

Together the mapper and the reducer form the payload to the framework and are specified by the applications.
 The Mapper decides on the number of maps based on the input data and blocksize. The Reducer has three primary phases - shuffle, sort and reduce.  Shuffle takes the sorted output of all the mappers as input. Reducer inputs are grouped by keys during sort. Both shuffle and sort occur simultaneously as map outputs are fetched. The output of the reducer is not sorted. A partitioner partitions the key space. It controls the partitioning of the keys of the intermediate map-outputs. By default, a hash function is used to derive the partitions. A reporter reports progress and relevant application-level status messages. An OutputCollector collects data output by the Mapper or the reducer.

No comments:

Post a Comment