Saturday, November 14, 2015

Let us look at applying GLMNet algorithm to big data using MapReduce in continuation of our discussion of previous arguments.
This is one of the popular algorithms for Machine Learning. As with most regression analysis, more data is better. But this algorithm can work even when we have fewer data because it gives methods to 1) determine if we are overfitting and
2) hedge our answers if we are overfitting.
With ordinary least squares regression, objective is to minimize the sum squared error between actual values and our linear approximation. Coefficient shrinkage methods add a penalty on coefficients.
It yields a family of regression solutions - from completely insensitive to input data to unconstrained ordinary least squares.
First we build a Regularized Regression. We want to eliminate over-fitting because it tunes the model and it gives best performance on held out data.
In this step in order to avoid over-fitting, we  have to exert control over degrees of freedom in regression. We cut back on attributes when we make a subset selection  and we penalize regression coefficients with coefficient shrinkage, ridge regression and lasso regression.
With coefficient shrinkage, different penalties give solutions with different properties.
With more than one attribute the coefficient penalty function has important effects on solutions. Some choices of coefficient penalty give control over sparseness of the solutions. Two types of co-efficient penalty are most frequently used: sum of squared coefficients and sum of absolute value of coefficients. GLMNet algorithm incorporates ElasticNet penalty which is a weighted combination of sum squares and sum of absolute values.
Courtesy : Michael Bowles on applying GLMNet to MapReduce and PBWorks for GLMNet.
What I find interesting in these discussions it the use of a coefficient shrinkage, a somewhat complicating factor,  which seems to improve an otherwise simple technique of regression. Similarly 'summation form' lends itself to applying on MapReduce because different mappers can compute those sums locally for their group and reducer can aggregate them and is inherently simple However we could achieve even more sophisticated MapReduce beyond the summation form to incorporate more algorithms based on other techniques of parallelization. Such techniques include not only partitioning data and computations but also time-slicing. One such example is to parallelize iterations. Currently only a reducer does iterations based on the aggregation of previous iteration variables from mappers but if this could be pushed down to  mappers where the reducer may aggregate not just by sum but by other functions such as max, min or count.

No comments:

Post a Comment