Friday, November 20, 2015

In continuation of our discussion on applying the parallelization technique on more machine learning algorithms, let us discuss a few more. The technique enables incremental processing of more and more data as they become available.
The summation form lends itself to this technique on the examples we say because the computations of the algorithm does not change and while the data may not all be divided and computed in parallel, it is in fact being parallelized and computed the same ways as the data becomes available one chunk after the other.
In these cases we are reusing the previous computation of the data and we don't have to revisit that data again. We make summarized information already which we adjust based on new data. In the case the summary from the first data set is the result of the computation from the first data set. 
Now let we can generalize this technique to be applied to all summation forms. 
In the summation form, the data is already chunked into subgroups and then the computation is performed. The reducer merely aggregates the results. In other words, the data doesn’t overlap so it is easier to aggregate the results.  In a streaming mode for the data when it becomes available the same computations can be done as and when the non-overlapping data arrives. Therefore the summary from each data set now need to be combined and since they represent non-overlapping data, it becomes easier to combine since each summary will contribute independently to the overall picture and the reducer in this case will merely applying smoothing of the aggregated results.  
Now let us take another example. We compute k-means. Here the distance between the data points and centroid called the Euclidean distance can be calculated on every new dataset as they arrive on just the same way as for any subgroup or the entire dataset. Moreover combining centroid is also possible because the data points that are close to either centroid will be roughly equidistant to the new midpoint between the previous and the current centroid of the same class. Thus given the update to the previous centroid from existing data is also straightforward  from the new data. Recall that the original k-means had an iteration step that aaigns the data points to the nearest cluster. With the technique above we are already prepared for iteration  and in fact we do it incrementally. While the original iterations refined the centroids by reassigning, we improve it with newer data giving better answers at each stage and eliminating reprocessing over entire data 

No comments:

Post a Comment