Cluster computing

Saturday, November 7, 2015

Machine Learning tools on NoSQL databases

MapReduce is inherently suited for operations on BigData. And Machine Learning statistical methods love more and more data. Naturally a programmer will look to implementing such methods on BigData using MapReduce.

What is Map-Reduce ?

MapReduce is an arrangement of tasks that enable relatively easy scaling. It include:

Hardware arrangement – nodes as part of cluster that can communicate and process parallel

File System – that provides distributed storage across multiple disks

Software processes that run on various cpu in the assembly.

Controller – that manages mapper and reducer tasks

Mapper – that assigns identical tasks to multiple cpus for each to run over its local data

Reducer – aggregates output from several mappers to form end product

Mappers emit a key-value pair

Controllers sort key-value pairs by key

Reducers get pairs grouped by key

The map-reduce is the only task that the programmer will need to write and the rest of the chores are handled by the platform.

Map-reduce can be written in python or javascript depending on the NoSQL database and associated technology. For example , the following databases support Map-Reduce operations:

MongoDB – this provides a mapReduce database command

Riak – This comes with an Erlang shell that can provide this functionality

CouchDB – this comes with document identifiers that can be used with simple Javascript functions to MapReduce

Machine Learning methods that involve statistics usually have a summation over some expressions or term components. For example, a least squares regression requires to compute the sum of the squared residuals and tries to keep it minimum.

This suits map reduce very well because the mappers can find the partial sums of squared residuals while the reducer can merely aggregate the partial sums into totals to complete the calculation. Many algorithms can be arranged in Statistical Query Model form. In some cases, iteration is required. Each iterative step involves a map-reduce sequence.

Let us now look at clustering techniques and how to fit it on map-reduce

For K-means clustering that proceeds this way:

Initialize :

Pick K starting guesses for Centroid at random

Iterate:

Assign points to cluster whose centroid is closest

Calculate cluster centroids

The corresponding map-reduce is as follows:

Mapper - run through local data and for each point, determine closest centroid. Accumulate the vector sum of points closest to each centroid. Combiner emits centroid as key and tuple of partial sum of set and set index

Reducer - for each old centroid, aggregate sum and n from all mappers and calculate new centroid.

Courtesy : Michael Bowles PhD introduction to BigData

Cluster computing

Saturday, November 7, 2015

No comments:

Post a Comment