Monday, November 30, 2015

#codingexercise
From a stream of numbers, get N minimum assuming no duplicates
SortedList<int> GetMinStream(BinaryReader reader, int N)
{
  var ret = new SortedList<int, int>();
  ret.Capacity = N + 1;
  while  (reader.PeekChar() != -1)
 {
    int val = reader.ReadInt32();
    int count = ret.count;
    if (count > 0 ){
    int min = ret.GetByIndex(count -1);
    if (val < min){
        ret.Add(val, val)
        ret.RemoveAt(count - 1);
    }
    }else{
        ret.Add(val, val);
   }
  }
  return ret;
}

For medians, we can maintain heap.

Sunday, November 29, 2015

Today we try to improve stochastic gradient method. We can recall that this is already better than Batch gradient method because it provides earlier solutions than evaluating the entire data. The error function could be steepest descent or sum of squares and in either case, we apply the correction iteratively but after every sample in the case of stochastic gradient method as opposed to evaluating after all the samples as in the case of batch gradient method. Since the update to common parameters such as the residual are after every sample, the stochastic gradient method is not conducive to parallel-ization as opposed to batch gradient method which is in summation form.Each update requires holding a lock on the common parameters which in effect serializes the parallel processing. In batch gradient or mini-batch gradient, we try to get the best of both worlds by reducing the number of times we need to take the lock and distributing the data on parallel processors. However parallelization is not restricted to exploiting summation forms by distributing data to each processor directly or indirectly by distributing the iterations over the processors.
We instead use a technique of leap-frogging over the data by using some statistics/metadata/summary information over the data. The summaries are generated as and when the data becomes available while at the same time we decouple the data processing with the algorithm to find the solution. For example, we can take the data and calculate the step-lengths in the search direction for the data to compute the correction as data processing summary for any or all data  as and when it becomes available. But instead of updating the residual after one or several batches, we provide the residual updates only when we want to find the solution and this is independent of the data processing. The new residual is used in the data processing the next time which is what tightly integrates the algorithm to the data. But if we could come up with an algorithm that works off the data summaries instead of the data processing, then we decouple the data processing push operation from the solution finding pull operation of the algorithm. One way to do this would be to generate different summaries by 'guessing' different residuals for the data and then have the algorithm use the summary closest to the one matching the newly computed residual. The trick is to have the guesses closely align with those that cause the convergence (something we do currently by taking feedback from the previous loop and keeping the loop iterations smaller.  Another way to do this would be to perform the entire algorithm as is on the local data for each processor to find the local minima of the model and then find the global minima over all the data. Then if we have the luxury of iterating globally, we can update the residual with the weighted averages of the residuals found so far.
The key idea here is to evaluate the data once to find the optimal solution for the data as with conventional algorithm. But when more data becomes available, combine the previous solution/summary with the current data to find the new solution/summary. This is what we call by leap-frogging. In other words, while both stochastic and batch gradient require data to be processed again, we require data to be processed incrementally such that data once processed doesn't need to be processed again. The equation therefore changes from applying newer residual to the same data to one that applies corrections on current data based on previously computed partial results. This way the processing of newer data can be done in parallel while the computation of the result can be done serially and one time. One variation of SGD that comes close along these lines is the use of momentum learning where a factor of the previous gradient is added to the weight update for the current iteration. 

Saturday, November 28, 2015

the stochastic gradient method mentioned in the earlier post is different from the batch gradient in the following ways:
1) the batch gradient method operates on all the data before each update  where as the stochastic gradient method updates after reading one training set.
2) If the number of training samples are very large then batch gradient method can be very slow. On the other hand, stochastic gradient method will be faster because it will improve from the first training set itself.
3) The error function is not as well minimized in SGD as it is in GD. However, it does converge though not to the minimum and oscillate around the optimal value. In fact it has been found to be empirically better than GD.
4) While the batch gradient method performs the following :
Repeat until convergence {
  Theta - j = Theta -j + correction  ( for every j )
}
the stochastic gradient method performs the following:
Loop {
for i = 1 to m {
Theta-j = Theta-j + correction ( for every j )
}
}
5) The batch gradient method find the global minima and because it evaluates the entire data it is not susceptible to local minima should they occur.
6) The latter method is called stochastic because the approximate direction that can be computed in every step can be thought of as a random variable of a stochastic process.

Friday, November 27, 2015


Thoughts on parallelizing Steepest Descent algorithm.

The Steepest Descent method is used to solve the equations of the form Ax = b. The method tries to move towards the solution by sliding down the bottom of a paraboloid. Assume the data to be a function f which is a quadratic form. It is minimized by the solution to Ax = b. f(x) is a curved surface for positive-definite A. The paraboloid   is the intersection of n hyperplanes each having dimension n-1. In our case we have two because we have two axes co-ordinates for the solution.  The bottommost point of the paraboloid (gradient = 0) is our target. It happens to be a point where the directional derivative of a function f on x is zero. We take a series of steps until we are satisfied that we are close enough to the solution x.  When we take a step, we choose a direction in which f decreases quickly. We keep track of an ‘error’ e – a vector that tells us how far we are from the solution.  A measure called residual gives us the direction of the steepest descent.  A ‘line search’ is a procedure that chooses a ‘step-length’ alpha to minimize f along a line. The ‘step-length’ can be maximized when the previous residual and the current gradient are orthogonal along the search line. Therefore the algorithm follows a zig zag path to the function f.

Given the inputs A, b, a starting value x,  a maximizing number of iterations i=max  and an error-tolerance of epsilon < 1, this traditional algorithm is described as follows:

The steepest descent algorithm proceeds this way:

set I to 0

set residual to b - Ax

set delta to residual-transposed.residual.

Initialize delta-0 to delta

while I < I-max and delta > epsilon^2 delta-0 do:

    q = Ar 

    alpha = delta / (r-transposed. q)

    x = x + alpha.residual

    If I is divisible by 50

        r = b - Ax

    else

        r = r - alpha.q

    delta = r-transposed. r

     I = I + 1

When parallelizing the algorithm,  mappers could independently calculate the updates to the target along each of the components of the vector by finding the residual, the step length and the error as above and calling it summary and then sum the calculations for these  in the reducer. Note that operations are to be performed for all data points and their ordering doesn’t matter, so partitioning the data into chunks and performing each step in parallel on that data is trivial and a speedup. The reducer can then aggregate the partial results from each of the mappers and then choose the next residual based on error tolerance.
 The key idea here is to express it in summation form. The inner product of two vectors x and y written as x-transposed.y represents the scalar sum from I = 1 to n of xi.yi. Note x-transposed.A.x is also scalar.
 In the algorithm above, r-transposed.q and r-transposed.r are both summation forms above. Note that r, b and x are vectors represented by n x 1 matrix. The vector A is n x n matrix. The summation forms yield scalar. This is used to compute delta and alpha both of which are scalars.
The solution x is updated iteratively by step length alpha in search direction q.

The above algorithm does the computations in mini-batched Gradient descent. Where the updates are available after processing k training samples where 1 < k < all n samples  In this case k = 50.
The advantage with processing entire data set is that we get the same optimum solution after the same number of iterations and hence it is deterministic. This is improved to having better answer sooner by doing it in batches of 50.
However, we could arrive at an approximate answer from the get go by training on one sample instead of more samples and doing more iterations by using the previous computations in the adjustments.
the stochastic gradient method mentioned in the earlier post is different from the batch gradient in the following ways:
1) the batch gradient method operates on all the data before each update  where as the stochastic gradient method updates after reading one training set.
2) If the number of training samples are very large then batch gradient method can be very slow. On the other hand, stochastic gradient method will be faster because it will improve from the first training set itself.
3) The error function is not as well minimized in SGD as it is in GD. However, it does converge though not to the minimum and oscillate around the optimal value. In fact it has been found to be empirically better than GD.
4) While the batch gradient method performs the following :
Repeat until convergence {
  Theta - j = Theta -j + correction  ( for every j )
}
the stochastic gradient method performs the following:
Loop {
for i = 1 to m {
Theta-j = Theta-j + correction ( for every j )
}
}
5) The batch gradient method find the global minima and because it evaluates the entire data it is not susceptible to local minima should they occur.
6) The latter method is called stochastic because the approximate direction that can be computed in every step can be thought of as a random variable of a stochastic process.

Thursday, November 26, 2015

We bring out the similarities between the Batch Gradient descent and the Conjugate gradient methods both of which we have discussed earlier.
Essentially both try to solve a linear equation of the form Ax = b
Both of them calculate the gradients for descent and the gradient is a vector field that for a given point gradient points in the direction of the steepest change for the objective.
Steepest descent finds orthogonal directions by stretching the descent in a direction as much as possible.
The steepest descent algorithm proceeds this way:
set I to 0
set residual to b - Ax
set delta to residual-transposed.residual.
Initialize delta-0 to delta
while I < I-max and delta > epsilon^2 delta-0 do:
    q = Ar 
    alpha = delta / (r-transposed. q)
    x = x + alpha.residual
    If I is divisible by 50
        r = b - Ax
    else
        r = r - alpha.q
    delta = r-transposed. r
     I = I + 1

The reason for the use of the number 50 is that after a certain number of iterations, the exact residual is recalculated to remove accumulated floating point error. Of course, the number 50 is arbitrary and for large n, square_root(n) may be more appropriate instead.
If the tolerance is large. the residual need not be corrected at all. If the tolerance is close to the limits of the floating point precision of the machine, a test should be added after delta is evaluated, to check if delta is less than epsilon^2.delta-0. and if this test holds true, the exact residual should be recomputed and delta reevaluated. This prevents the procedure from terminating early due to floating point roundoff error.
courtesy: Conjugate Gradient Method by Jonathan Richard Shewchuk

Tuesday, November 24, 2015

#codingexercise
double GetStandardDeviation(List<int> numbers)
{
double count = numbers.Count();
if (count <= 0) return 0;
double sum = numbers.Sum();
double mean = sum / count;
double variance = 0;
foreach (var number in numbers)
     variance += (number-mean)*(number-mean);
double stddev = Math.sqrt(variance/count);
return stddev;
}

Parallel and Concurrent programming of Independent Component Analysis:

A famous example of Independent Component Analysis is given by the “cocktail party problem”, where a recording of people talking simultaneously in a room is studied to separate out the underlying speech signals from the sample data. This is based on two assumptions – one that the source signals are independent of each other and two the values of each source signal have non-gaussian distributions. 

When the data is represented by a random vector x = (x1, x2, … xm) Transposed and the components as the random vectors (s1, s2, … sn) Transposed. The task is to transform the observed data x, using a linear static transformation W as s = Wx into maximally independent component s measured by some objective function on (s1, s2 … sn) of independence.

The data as represented by x above is considered to be a sum of the independent components sk weighted by the mixing weights ai,k for each xi. This is written as x = A.s Therefore W = A-inverse.

In ICA, the main goal is to compute the unmixing matrix W. The linear equation s= Wx can be solved by batch gradient descent to optimize W’s likelihood. Recall Likelihood is another term for conditional probability of the training vector from the Bayes definition mentioned earlier. The components can then be determined as the expression as

the singular column matrix [ 1-2g(wi Transposed . xi ) ] . xi Transposed

Here the wi.Transposed .x is the  objective function which is defined as the mixing weighted data.

The first term of unity is the starting data point and g is the gradient. The components are updated by taking the difference between the starting and the adjustment. This batch gradient descent is performed till the components stabilize.  At which point they can be used as predictors.

The original algorithm calls for the processing of the full training set before each update of the components.

In parallelization with sub groups of data, this is computed by each mapper and then aggregated by the reducer. The expression above can be calculated on each dataset separately but the reducer performs the updates to the components by aggregating the sum.

In modified parallelization, we retain the same procedure for every new data we encounter as the data becomes available.  But for the already calculated old data where we have the components determined so far as well as the summary which is the expression above, we could use the summaries to calculate the components anew because there is no concept of overlapping and non-overlapping of data since all the data participates in the iterative update of the components – just that we use the summary instead of recomputing the sum. This is a trivial rearrangement of the subgroups of data into old and new as it becomes available but applying the same parallelization as with partitioned subgroups of data. An alternative could be to try and merge the components by weights against the summary because each component is linearly transformed to the next update.
Note that the components are batch updated so the merging will be across all components.

Modified Principal Component Analysis:

PCA can be considered as fitting an n-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component. If some axis of the ellipse is small, then the variance along that axis is also small, and by omitting the axis, the loss can be tolerated.

To find the axes of the ellipse, we subtract the mean from each variable to center the data around the origin. Then we compute the covariance matrix of the data and then calculate the eigen values and the corresponding eigenvectors of this covariance matrix. Then we must orthogonalize the set of eigen vectors and normalize each to become unit vectors. The resulting eigen vectors become the axis of the ellipsoid fitted to the data. The proportion of the variance that each eigenvector represents can be calculated dividing the eigenvalue corresponding to that eigenvector by the sum of all eigenvalues. The eigenvectors are ordered such that the greatest variance by some projection of the data comes to lie on the first co-ordinate, the second greatest variance on the second co-ordinate and so on.

When we parallelize this algorithm on chunks of data, each mapper computes the principle eigenvectors of the covariance matrix as 1/m times the summation of the support vector and support vector transposed from I = 1 to m and minus the second term comprising of mean and mean transposed. The latter component consisting of only means can also be expressed as summation form. Therefore the covariance matrix is all in the summation form.

The modified algorithm adjusts the summation forms the same way as in the examples of adjusting mean by taking weighted average of the summation where the weights are the number of associated data points.

In other words:

For new and now available data:

   Compute the first component and the second component of the covariance matrix.

For older data:

  Adjust the covariance matrix by taking the weighted average of the components.

Monday, November 23, 2015

In Naïve Bayes, the classic algorithm is as follows:  
Estimate probability of a training vector given a condition exists, (Aka likelihood) 
Estimate Probability of a training vector given  condition does not exist. 
We also calculate probability that the condition exists (Aka prior) and the condition doesn't exists. These are referred to as weights to the above. 
this is applied to entire data.  

Now we modify the previous algorithm for incremental data as follows: 

For new data: 
Calculate As above 

For old data: 
 Sum the previous calculated Target directly with the new target 

This is because the denominator remains the same and the weights remain the same. 
We now discuss discriminant analysis which uses concepts similar to Bayes.  In Bayes, we expressed the conditional probabilities. In DA we use the conditional probability and the prior for each target variable.  This method applies to normal distribution data only. Consequently we calculate the mean and the standard deviation. For computing the mean incrementally as and when data becomes available,  we could find the mean of the new data and take the weighted average of the previous and current mean. For computing the standard deviation as and when data becomes available, we use variance from the previous data and the variance from the new data and take their weighted mean. Then we take the square root of the result to find the new standard deviation. In other words we do the same as what we did for finding mean incrementally because the support variables or their squares are both in summation form. We argued earlier that summation form yields itself to parallelization as well as incremental updates. 
Let us find out now how to build discriminant analysis incrementally.thus far we have said that mean, standard deviation and probability can each be calculated incrementally. And DA can be expressed in a form that has two components : the first component is based on squared differences from the mean as numerator and standard deviation as denominator and the second component is based  on the natural logarithm of the probability. Both the components clearly rely on parameters that can be found incrementally. Consequently, the DA can be calculated at every stage. 







Sunday, November 22, 2015

Today we review a book called "Strategy Rules" by Yoffie and Cusumano.
Yoffie is an Harvard Business School professor and  Cusumano is an MIT professor.
They offer a new take on three giants - Bill Gates, Andy Grove and Steve Jobs by bringing them together into one how-to guide on strategy and offer a new take on these three giants of entrepreneurship and technology by bringing them together.
According to the authors, the three men differ widely in their personalities but followed the same five rules for strategy and execution.
1. Look forward, Reason back. The first rule was to look forward into the future and reason back to the actions required today.  They imagined what the world could be with their vision and have made a tremendous impact on the world. All three had the ability to determine in detail what needed to happen immediately to turn vision into reality.
2. Make Big Bets, Without betting the company. Gates Grove and Jobs were all leaders but not reckless They knew how to time or diversify their big bets so that even huge strategic bets were not reversible.
3. Build platforms AND ecosystems.  Another important rule, the authors write was to build platforms and ecosystems as opposed to pursuing a product strategy. This we know from our review of "The Responsible Entrepreneur" is not just game changing but self-sustaining. As an example,  Gates chose not to sell his product DOS to IBM that requested it but retained the right to license the system to other companies. The rest is history.
4. Exploit leverage and power -  According to the authors all three men could play judo and sumo Judo requires using the opponents strength and turn it into weaknesses. As an example, Job could successfully negotiate with the music companies for a license to their music. Apple was so tiny and even its market share was 2%, yet the music companies made the agreement to Apple.  This laid the foundation of the iTunes revolution.  At the same time, the three did not hesitate to use their power.
5. Shape the company around your personal anchor - Personally each had their own taste, strengths and interest - Gates the coding genius, Grove - a precise engineer, and Jobs a wizard at design. The companies they built reflected these strengths. At their peak, Microsoft, Apple and Intel were worth 1.5 trillion. And these companies are the talk of the world.





1 / 1

In Naïve Bayes, the classic algorithm is as follows:  

Estimate probability of a training vector given a condition exists, (Aka likelihood) 

Estimate Probability of a training vector given  condition does not exist. 

We also calculate probability that the condition exists (Aka prior) and the condition doesn't exists. These are referred to as weights to the above. 

this is applied to entire data.  

Now we modify the previous algorithm for incremental data as follows: 


For new data: 

Calculate As above 


For old data: 

Sum the previous calculated Target directly with the new target 


This is because the denominator remains the same and the weights remain the same.







Friday, November 20, 2015

In continuation of our discussion on applying the parallelization technique on more machine learning algorithms, let us discuss a few more. The technique enables incremental processing of more and more data as they become available.
The summation form lends itself to this technique on the examples we say because the computations of the algorithm does not change and while the data may not all be divided and computed in parallel, it is in fact being parallelized and computed the same ways as the data becomes available one chunk after the other.
In these cases we are reusing the previous computation of the data and we don't have to revisit that data again. We make summarized information already which we adjust based on new data. In the case the summary from the first data set is the result of the computation from the first data set. 
Now let we can generalize this technique to be applied to all summation forms. 
In the summation form, the data is already chunked into subgroups and then the computation is performed. The reducer merely aggregates the results. In other words, the data doesn’t overlap so it is easier to aggregate the results.  In a streaming mode for the data when it becomes available the same computations can be done as and when the non-overlapping data arrives. Therefore the summary from each data set now need to be combined and since they represent non-overlapping data, it becomes easier to combine since each summary will contribute independently to the overall picture and the reducer in this case will merely applying smoothing of the aggregated results.  
Now let us take another example. We compute k-means. Here the distance between the data points and centroid called the Euclidean distance can be calculated on every new dataset as they arrive on just the same way as for any subgroup or the entire dataset. Moreover combining centroid is also possible because the data points that are close to either centroid will be roughly equidistant to the new midpoint between the previous and the current centroid of the same class. Thus given the update to the previous centroid from existing data is also straightforward  from the new data. Recall that the original k-means had an iteration step that aaigns the data points to the nearest cluster. With the technique above we are already prepared for iteration  and in fact we do it incrementally. While the original iterations refined the centroids by reassigning, we improve it with newer data giving better answers at each stage and eliminating reprocessing over entire data 

Thursday, November 19, 2015

#coding interview question
1) Write a program for run-length encoding
for eg:  AAABBCCCCC
is written as 3A2B5C
String RLE(String input)
{
if String.IsNullOrEmpty(input) return input;
int count = 0;
int prevChar = '/0';
StringBuilder output = "";
for (int i =0; i < input.Length; i++)
{
if (input[i] != prevChar){
       if (prevChar != '/0') {
            output += count.toString();
            output += prevChar;
        }
      count = 1;
      prevChar = input[i];
}else{
      count+=1;
} // end if
} // end for
            output += count.toString();
            output += prevChar;
return output;
}

2) And in place encoding is as follows:
<work-in-progress>
Void RLE(ref StringBuilder input)
{
if String.IsNullOrEmpty(input.toString ()) return;

int count = 0;
int prev = -1; // previous index
int prevCount = 0;
Char prevChar = '/0';
String readOnlyCopy = input.SubString(0,1); // just the first bRead to get started
int countLen = count.ToString().Length;
int i = 0; // index to span reading the length of the string
bool readAhead = false;
while ( i < input.Length && readOnlyCopy.Length > 0)
{
  if (prev == -1 || input[i] != input[prev] ){
       if (prev != -1) {
            prevChar = input [prev];
            prevCount = count;
            countLen = count.toString().Length;  // actually we need fixed size packing say 4 bytes for number and one byte for char but the processing is similar
            if (countLen  >= count){
             readOnlyCopy += input.subString (i+1, countLen); // to include count and char
             readAhead = true;
            }

            Int w = prev;
            If (prev==-1) w++;
            If (countLen > input.Length - i ) input.resize (input.Length×2);
            For (int j=0; j < countLen; j++){
                   Input[w]=count.toString ()[j];
                   w++;
             }
             Input [w] = prevChar;
             // i = i + countLen;
             readOnlyCopy.TrimLeft(0, countLen);

        }
      count = 1;
      prev = i;
      prevChar = input [prev];
      Countlen=1;
}else{
      count+=1;
} // end if
i++;
if (readahead == false)
      readOnlyCopy += input.SubString(i, 1); // next bRead
else
      readahead=false;
} // end while
// one last while lookahead
            int w = prev;
            if (w + count.toString().Length + 2 > input.Length) input.resize(input.Length *2 );
            countLen = count.toString().Length;
            for (int j = 0; j < countLen; j++){
                  Input[w] = count.toString()[j];
                  w++;
            }
            Input[w] = prevChar;
            I =  i + countLen;
            w++;
            Input[w] = '/0';
return ;
}

Wednesday, November 18, 2015

In continuation of our discussion on how to apply linear regression in an interative manner to build it as an when the data becomes available, let us know look at GLMnet algorithm,
As we had noted earlier, the GLMNet algorithm can work when we have fewer data because it gives methods to
1) determine if we are overfitting and
2) hedge our answers if we are overfitting
With ordinary least squares regression, objective is to minimize the sum squared error between actual values and our linear approximation. Coefficient shrinkage methods add a penalty on coefficients.
It yields a family of regression solutions - from completely insensitive to input data to unconstrained ordinary least squares.
First we build a Regularized Regression. We want to eliminate over-fitting because it tunes the model and it gives best performance on held out data. In order to avoid this overfitting, we constrain the regression with penalties such as coefficient shrinkage. Two types of coefficient shrinkage are most frequently used : sum of squared coefficients and sum of absolute value of coefficients.GLMNet algorithm incorporates ElasticNet penalty which is a weighted combination of sum squares and sum of absolute values.
We now look at how to use this in an incremental manner.
First, the existing data already has a regularized expression and a fitting line, shaped with coefficient shrinkage. Either the new data affects the existing regression or a new regression is created for the new data and the two fitting lines are replaced by a common one.
Both methods require slightly different computation from the first iteration because we adjust existing line/s instead of fitting a new line. Again we can divide this into non-overlapping data and overlapping data. For the non-overlapping data, a new regression is the same as the first iteration. For the overlapping data, we compute the shrinkage directly against the existing line. The use of shrinkage here directly gives the corrections applied. Although the shrinkage was originally used to eliminate overfitting, it also seems to give an indication of the correction to be applied to the line. The  higher the number of data points and the larger the shrinkages, the more the correction applied to the line. In this way we start from a completely data insensitive line to one that fits it best. For the sake of convenience, we use the existing line from previous regression.  This is a slight different computation we introduce.
If we don't want to introduce a different computation as above, we can determine a GLMNet regression for the new points exclusively and then proceed to combining the regressions from new and existing datasets
Combining regression is a computation over and on top of the GLMNet computation for any dataset.  While GLMNet works well for small dataset, this step enables the algorithm to be applied repeatedly as the data becomes available more and more. This will also be required when we combine regression lines such as for non-overlapping data.

Tuesday, November 17, 2015

Previously we discussed using computations that don’t span the entire data all at once albeit in parallel but compute based on as and when data is made available. We took an example of computing mean this way.  
Let us take an example of the summation form to compute the mean 
For example if we take 3,1,5, 7, 9  the mean is 25/5 = 5 
If we were to add 10,15,17,19,24, the mean would now become 85  + 25 / 10 = 11 
We could arrive at this by adding 85 /5  = 17 and taking the sum (5 + 17) / 2 for two equal ranges otherwise we take the weights based on their element count = 11 
In this case we are reusing the previous computation of the data and we don't have to revisit that data again. We make summarized information already which we adjust based on new data. In the case the summary from the first data set is the result of the computation from the first data set. 
Now let we can generalize this technique to be applied to all summation forms. 
In the summation form, the data is already chunked into subgroups and then the computation is performed. The reducer merely aggregates the results. In other words, the data doesn’t overlap so it is easier to aggregate the results.  In a streaming mode for the data when it becomes available the same computations can be done as and when the non-overlapping data arrives. Therefore the summary from each data set now need to be combined and since they represent non-overlapping data, it becomes easier to combine since each summary will contribute independently to the overall picture and the reducer in this case will merely applying smoothing of the aggregated results.  
For example in a linear regression, the lines may be drawn in chunks across different data as they arrive and then the reducer merely has to sort the summaries and apply a generalized line across the data. In this case it can take the beginning and the end points of the pieces to draw the line or find one that matches most of the slopes. Thus it is possible to aggregate results on non-overlapping data 
For the overlapping data however, the summaries are not trivial to combine because the previous summary may be completely overwritten. However, we look at the datapoints that are new in the current summary and calculate their affect on the previous slope. Since the new data points may be scattered over and beyond the previous data points, it should be easy to determine whether they affect the previous slope uniformly or non-uniformly. For the uniform case, there may not need to be a correction as in a change of the gradient and may at most require a displacement of the regression line. For the non-uniform case, depending on which end has more of the newer data points assuming all points contribute equally, a clockwise or anticlockwise rotation may be performed to get the correction to the previous slope. Thus we have an algorithm where we use the previous summary to have created the current summary. 
Furthermore, if there are two data point blocks that have a partial overlap, then they are split into two non-overlapping and one overlapping regions The regression line for the non-overlapping regions is preserved. The regression lines from the two contributing blocks in the overlap can merely be replaced by a line that is equidistant from both and passing through their intersection. In other words, summaries are additive too.  
Since we already mentioned that summation forms are all applicable this way, and now with the summaries being additive too, we have expanded the kind of algorithms that can be written this way. 
Previously we discussed using computations that don’t span the entire data all at once albeit in parallel but compute based on as and when data is made available. We took an example of computing mean this way.  
Let us take an example of the summation form to compute the mean 
For example if we take 3,1,5, 7, 9  the mean is 25/5 = 5 
If we were to add 10,15,17,19,24, the mean would now become 85  + 25 / 10 = 11 
We could arrive at this by adding 85 /5  = 17 and taking the sum (5 + 17) / 2 for two equal ranges otherwise we take the weights based on their element count = 11 
In this case we are reusing the previous computation of the data and we don't have to revisit that data again. We make summarized information already which we adjust based on new data. In the case the summary from the first data set is the result of the computation from the first data set. 
Now let we can generalize this technique to be applied to all summation forms. 
In the summation form, the data is already chunked into subgroups and then the computation is performed. The reducer merely aggregates the results. In other words, the data doesn’t overlap so it is easier to aggregate the results.  In a streaming mode for the data when it becomes available the same computations can be done as and when the non-overlapping data arrives. Therefore the summary from each data set now need to be combined and since they represent non-overlapping data, it becomes easier to combine since each summary will contribute independently to the overall picture and the reducer in this case will merely applying smoothing of the aggregated results.  
For example in a linear regression, the lines may be drawn in chunks across different data as they arrive and then the reducer merely has to sort the summaries and apply a generalized line across the data. In this case it can take the beginning and the end points of the pieces to draw the line or find one that matches most of the slopes. Thus it is possible to aggregate results on non-overlapping data 
For the overlapping data however, the summaries are not trivial to combine because the previous summary may be completely overwritten. However, we look at the datapoints that are new in the current summary and calculate their affect on the previous slope. Since the new data points may be scattered over and beyond the previous data points, it should be easy to determine whether they affect the previous slope uniformly or non-uniformly. For the uniform case, there may not need to be a correction as in a change of the gradient and may at most require a displacement of the regression line. For the non-uniform case, depending on which end has more of the newer data points assuming all points contribute equally, a clockwise or anticlockwise rotation may be performed to get the correction to the previous slope. Thus we have an algorithm where we use the previous summary to have created the current summary. 
Furthermore, if there are two data point blocks that have a partial overlap, then they are split into two non-overlapping and one overlapping regions The regression line for the non-overlapping regions is preserved. The regression lines from the two contributing blocks in the overlap can merely be replaced by a line that is equidistant from both and passing through their intersection. In other words, summaries are additive too.  
Since we already mentioned that summation forms are all applicable this way, and now with the summaries being additive too, we have expanded the kind of algorithms that can be written this way.