Cluster computing

Wednesday, November 18, 2015

In continuation of our discussion on how to apply linear regression in an interative manner to build it as an when the data becomes available, let us know look at GLMnet algorithm,
As we had noted earlier, the GLMNet algorithm can work when we have fewer data because it gives methods to
1) determine if we are overfitting and
2) hedge our answers if we are overfitting
With ordinary least squares regression, objective is to minimize the sum squared error between actual values and our linear approximation. Coefficient shrinkage methods add a penalty on coefficients.
It yields a family of regression solutions - from completely insensitive to input data to unconstrained ordinary least squares.
First we build a Regularized Regression. We want to eliminate over-fitting because it tunes the model and it gives best performance on held out data. In order to avoid this overfitting, we constrain the regression with penalties such as coefficient shrinkage. Two types of coefficient shrinkage are most frequently used : sum of squared coefficients and sum of absolute value of coefficients.GLMNet algorithm incorporates ElasticNet penalty which is a weighted combination of sum squares and sum of absolute values.
We now look at how to use this in an incremental manner.
First, the existing data already has a regularized expression and a fitting line, shaped with coefficient shrinkage. Either the new data affects the existing regression or a new regression is created for the new data and the two fitting lines are replaced by a common one.
Both methods require slightly different computation from the first iteration because we adjust existing line/s instead of fitting a new line. Again we can divide this into non-overlapping data and overlapping data. For the non-overlapping data, a new regression is the same as the first iteration. For the overlapping data, we compute the shrinkage directly against the existing line. The use of shrinkage here directly gives the corrections applied. Although the shrinkage was originally used to eliminate overfitting, it also seems to give an indication of the correction to be applied to the line. The higher the number of data points and the larger the shrinkages, the more the correction applied to the line. In this way we start from a completely data insensitive line to one that fits it best. For the sake of convenience, we use the existing line from previous regression. This is a slight different computation we introduce.
If we don't want to introduce a different computation as above, we can determine a GLMNet regression for the new points exclusively and then proceed to combining the regressions from new and existing datasets
Combining regression is a computation over and on top of the GLMNet computation for any dataset. While GLMNet works well for small dataset, this step enables the algorithm to be applied repeatedly as the data becomes available more and more. This will also be required when we combine regression lines such as for non-overlapping data.

Cluster computing

Wednesday, November 18, 2015

No comments:

Post a Comment