Machine Learning tools on NoSQL databases
MapReduce is inherently suited for operations on BigData. And Machine Learning statistical methods love more and more data. Naturally a programmer will look to implementing such methods on BigData using MapReduce.
What is Map-Reduce ?
MapReduce is an arrangement of tasks that enable relatively easy scaling. It include:
- Hardware arrangement – nodes as part of cluster that can communicate and process parallel
- File System – that provides distributed storage across multiple disks
- Software processes that run on various cpu in the assembly.
- Controller – that manages mapper and reducer tasks
- Mapper – that assigns identical tasks to multiple cpus for each to run over its local data
- Reducer – aggregates output from several mappers to form end product
Mappers emit a key-value pair
Controllers sort key-value pairs by key
Reducers get pairs grouped by key
The map-reduce is the only task that the programmer will need to write and the rest of the chores are handled by the platform.
Map-reduce can be written in python or javascript depending on the NoSQL database and associated technology. For example , the following databases support Map-Reduce operations:
MongoDB – this provides a mapReduce database command
Riak – This comes with an Erlang shell that can provide this functionality
CouchDB – this comes with document identifiers that can be used with simple Javascript functions to MapReduce
Machine Learning methods that involve statistics usually have a summation over some expressions or term components. For example, a least squares regression requires to compute the sum of the squared residuals and tries to keep it minimum.
This suits map reduce very well because the mappers can find the partial sums of squared residuals while the reducer can merely aggregate the partial sums into totals to complete the calculation. Many algorithms can be arranged in Statistical Query Model form. In some cases, iteration is required. Each iterative step involves a map-reduce sequence.
Let us now look at clustering techniques and how to fit it on map-reduce
For K-means clustering that proceeds this way:
Initialize :
Pick K starting guesses for Centroid at random
Iterate:
Assign points to cluster whose centroid is closest
Calculate cluster centroids
The corresponding map-reduce is as follows:
Mapper - run through local data and for each point, determine closest centroid. Accumulate the vector sum of points closest to each centroid. Combiner emits centroid as key and tuple of partial sum of set and set index
Reducer - for each old centroid, aggregate sum and n from all mappers and calculate new centroid.
Courtesy : Michael Bowles PhD introduction to BigData
No comments:
Post a Comment