Tuesday, November 17, 2015

Previously we discussed using computations that don’t span the entire data all at once albeit in parallel but compute based on as and when data is made available. We took an example of computing mean this way.  
Let us take an example of the summation form to compute the mean 
For example if we take 3,1,5, 7, 9  the mean is 25/5 = 5 
If we were to add 10,15,17,19,24, the mean would now become 85  + 25 / 10 = 11 
We could arrive at this by adding 85 /5  = 17 and taking the sum (5 + 17) / 2 for two equal ranges otherwise we take the weights based on their element count = 11 
In this case we are reusing the previous computation of the data and we don't have to revisit that data again. We make summarized information already which we adjust based on new data. In the case the summary from the first data set is the result of the computation from the first data set. 
Now let we can generalize this technique to be applied to all summation forms. 
In the summation form, the data is already chunked into subgroups and then the computation is performed. The reducer merely aggregates the results. In other words, the data doesn’t overlap so it is easier to aggregate the results.  In a streaming mode for the data when it becomes available the same computations can be done as and when the non-overlapping data arrives. Therefore the summary from each data set now need to be combined and since they represent non-overlapping data, it becomes easier to combine since each summary will contribute independently to the overall picture and the reducer in this case will merely applying smoothing of the aggregated results.  
For example in a linear regression, the lines may be drawn in chunks across different data as they arrive and then the reducer merely has to sort the summaries and apply a generalized line across the data. In this case it can take the beginning and the end points of the pieces to draw the line or find one that matches most of the slopes. Thus it is possible to aggregate results on non-overlapping data 
For the overlapping data however, the summaries are not trivial to combine because the previous summary may be completely overwritten. However, we look at the datapoints that are new in the current summary and calculate their affect on the previous slope. Since the new data points may be scattered over and beyond the previous data points, it should be easy to determine whether they affect the previous slope uniformly or non-uniformly. For the uniform case, there may not need to be a correction as in a change of the gradient and may at most require a displacement of the regression line. For the non-uniform case, depending on which end has more of the newer data points assuming all points contribute equally, a clockwise or anticlockwise rotation may be performed to get the correction to the previous slope. Thus we have an algorithm where we use the previous summary to have created the current summary. 
Furthermore, if there are two data point blocks that have a partial overlap, then they are split into two non-overlapping and one overlapping regions The regression line for the non-overlapping regions is preserved. The regression lines from the two contributing blocks in the overlap can merely be replaced by a line that is equidistant from both and passing through their intersection. In other words, summaries are additive too.  
Since we already mentioned that summation forms are all applicable this way, and now with the summaries being additive too, we have expanded the kind of algorithms that can be written this way. 

No comments:

Post a Comment