Tuesday, December 1, 2015

Today we review the paper Optimal Distributed Online Prediction using Mini Batches. by Dekel, Shamir, Xiao et al.  This paper mentions that online prediction methods are typically presented as serial algorithms running on a single processor. But to keep up with the increasing data sets,we need to parallelize it. In this context they present a distributed mini-batch algorithm. - a method that converts many serial gradient based online prediction algorithms into distributed algorithms.In addition, this method takes into account communication latencies between nodes in distributed environment. In fact they show linear speedup of distributed stochastic optimization on multiple processors.They also mention advantages of their approach on a web scale prediction problem.
We will discuss this but let us take a short break to discuss an interview question below.
Given a series of letters and another smaller set find the smallest window in the larger set that has all the elements from the smaller.
One way to do this would be to find all starting positions of a sliding window and then to find the minimum of lengths.

Monday, November 30, 2015

#codingexercise
From a stream of numbers, get N minimum assuming no duplicates
SortedList<int> GetMinStream(BinaryReader reader, int N)
{
  var ret = new SortedList<int, int>();
  ret.Capacity = N + 1;
  while  (reader.PeekChar() != -1)
 {
    int val = reader.ReadInt32();
    int count = ret.count;
    if (count > 0 ){
    int min = ret.GetByIndex(count -1);
    if (val < min){
        ret.Add(val, val)
        ret.RemoveAt(count - 1);
    }
    }else{
        ret.Add(val, val);
   }
  }
  return ret;
}

For medians, we can maintain heap.

Sunday, November 29, 2015

Today we try to improve stochastic gradient method. We can recall that this is already better than Batch gradient method because it provides earlier solutions than evaluating the entire data. The error function could be steepest descent or sum of squares and in either case, we apply the correction iteratively but after every sample in the case of stochastic gradient method as opposed to evaluating after all the samples as in the case of batch gradient method. Since the update to common parameters such as the residual are after every sample, the stochastic gradient method is not conducive to parallel-ization as opposed to batch gradient method which is in summation form.Each update requires holding a lock on the common parameters which in effect serializes the parallel processing. In batch gradient or mini-batch gradient, we try to get the best of both worlds by reducing the number of times we need to take the lock and distributing the data on parallel processors. However parallelization is not restricted to exploiting summation forms by distributing data to each processor directly or indirectly by distributing the iterations over the processors.
We instead use a technique of leap-frogging over the data by using some statistics/metadata/summary information over the data. The summaries are generated as and when the data becomes available while at the same time we decouple the data processing with the algorithm to find the solution. For example, we can take the data and calculate the step-lengths in the search direction for the data to compute the correction as data processing summary for any or all data  as and when it becomes available. But instead of updating the residual after one or several batches, we provide the residual updates only when we want to find the solution and this is independent of the data processing. The new residual is used in the data processing the next time which is what tightly integrates the algorithm to the data. But if we could come up with an algorithm that works off the data summaries instead of the data processing, then we decouple the data processing push operation from the solution finding pull operation of the algorithm. One way to do this would be to generate different summaries by 'guessing' different residuals for the data and then have the algorithm use the summary closest to the one matching the newly computed residual. The trick is to have the guesses closely align with those that cause the convergence (something we do currently by taking feedback from the previous loop and keeping the loop iterations smaller.  Another way to do this would be to perform the entire algorithm as is on the local data for each processor to find the local minima of the model and then find the global minima over all the data. Then if we have the luxury of iterating globally, we can update the residual with the weighted averages of the residuals found so far.
The key idea here is to evaluate the data once to find the optimal solution for the data as with conventional algorithm. But when more data becomes available, combine the previous solution/summary with the current data to find the new solution/summary. This is what we call by leap-frogging. In other words, while both stochastic and batch gradient require data to be processed again, we require data to be processed incrementally such that data once processed doesn't need to be processed again. The equation therefore changes from applying newer residual to the same data to one that applies corrections on current data based on previously computed partial results. This way the processing of newer data can be done in parallel while the computation of the result can be done serially and one time. One variation of SGD that comes close along these lines is the use of momentum learning where a factor of the previous gradient is added to the weight update for the current iteration. 

Saturday, November 28, 2015

the stochastic gradient method mentioned in the earlier post is different from the batch gradient in the following ways:
1) the batch gradient method operates on all the data before each update  where as the stochastic gradient method updates after reading one training set.
2) If the number of training samples are very large then batch gradient method can be very slow. On the other hand, stochastic gradient method will be faster because it will improve from the first training set itself.
3) The error function is not as well minimized in SGD as it is in GD. However, it does converge though not to the minimum and oscillate around the optimal value. In fact it has been found to be empirically better than GD.
4) While the batch gradient method performs the following :
Repeat until convergence {
  Theta - j = Theta -j + correction  ( for every j )
}
the stochastic gradient method performs the following:
Loop {
for i = 1 to m {
Theta-j = Theta-j + correction ( for every j )
}
}
5) The batch gradient method find the global minima and because it evaluates the entire data it is not susceptible to local minima should they occur.
6) The latter method is called stochastic because the approximate direction that can be computed in every step can be thought of as a random variable of a stochastic process.

Friday, November 27, 2015


Thoughts on parallelizing Steepest Descent algorithm.

The Steepest Descent method is used to solve the equations of the form Ax = b. The method tries to move towards the solution by sliding down the bottom of a paraboloid. Assume the data to be a function f which is a quadratic form. It is minimized by the solution to Ax = b. f(x) is a curved surface for positive-definite A. The paraboloid   is the intersection of n hyperplanes each having dimension n-1. In our case we have two because we have two axes co-ordinates for the solution.  The bottommost point of the paraboloid (gradient = 0) is our target. It happens to be a point where the directional derivative of a function f on x is zero. We take a series of steps until we are satisfied that we are close enough to the solution x.  When we take a step, we choose a direction in which f decreases quickly. We keep track of an ‘error’ e – a vector that tells us how far we are from the solution.  A measure called residual gives us the direction of the steepest descent.  A ‘line search’ is a procedure that chooses a ‘step-length’ alpha to minimize f along a line. The ‘step-length’ can be maximized when the previous residual and the current gradient are orthogonal along the search line. Therefore the algorithm follows a zig zag path to the function f.

Given the inputs A, b, a starting value x,  a maximizing number of iterations i=max  and an error-tolerance of epsilon < 1, this traditional algorithm is described as follows:

The steepest descent algorithm proceeds this way:

set I to 0

set residual to b - Ax

set delta to residual-transposed.residual.

Initialize delta-0 to delta

while I < I-max and delta > epsilon^2 delta-0 do:

    q = Ar 

    alpha = delta / (r-transposed. q)

    x = x + alpha.residual

    If I is divisible by 50

        r = b - Ax

    else

        r = r - alpha.q

    delta = r-transposed. r

     I = I + 1

When parallelizing the algorithm,  mappers could independently calculate the updates to the target along each of the components of the vector by finding the residual, the step length and the error as above and calling it summary and then sum the calculations for these  in the reducer. Note that operations are to be performed for all data points and their ordering doesn’t matter, so partitioning the data into chunks and performing each step in parallel on that data is trivial and a speedup. The reducer can then aggregate the partial results from each of the mappers and then choose the next residual based on error tolerance.
 The key idea here is to express it in summation form. The inner product of two vectors x and y written as x-transposed.y represents the scalar sum from I = 1 to n of xi.yi. Note x-transposed.A.x is also scalar.
 In the algorithm above, r-transposed.q and r-transposed.r are both summation forms above. Note that r, b and x are vectors represented by n x 1 matrix. The vector A is n x n matrix. The summation forms yield scalar. This is used to compute delta and alpha both of which are scalars.
The solution x is updated iteratively by step length alpha in search direction q.

The above algorithm does the computations in mini-batched Gradient descent. Where the updates are available after processing k training samples where 1 < k < all n samples  In this case k = 50.
The advantage with processing entire data set is that we get the same optimum solution after the same number of iterations and hence it is deterministic. This is improved to having better answer sooner by doing it in batches of 50.
However, we could arrive at an approximate answer from the get go by training on one sample instead of more samples and doing more iterations by using the previous computations in the adjustments.
the stochastic gradient method mentioned in the earlier post is different from the batch gradient in the following ways:
1) the batch gradient method operates on all the data before each update  where as the stochastic gradient method updates after reading one training set.
2) If the number of training samples are very large then batch gradient method can be very slow. On the other hand, stochastic gradient method will be faster because it will improve from the first training set itself.
3) The error function is not as well minimized in SGD as it is in GD. However, it does converge though not to the minimum and oscillate around the optimal value. In fact it has been found to be empirically better than GD.
4) While the batch gradient method performs the following :
Repeat until convergence {
  Theta - j = Theta -j + correction  ( for every j )
}
the stochastic gradient method performs the following:
Loop {
for i = 1 to m {
Theta-j = Theta-j + correction ( for every j )
}
}
5) The batch gradient method find the global minima and because it evaluates the entire data it is not susceptible to local minima should they occur.
6) The latter method is called stochastic because the approximate direction that can be computed in every step can be thought of as a random variable of a stochastic process.

Thursday, November 26, 2015

We bring out the similarities between the Batch Gradient descent and the Conjugate gradient methods both of which we have discussed earlier.
Essentially both try to solve a linear equation of the form Ax = b
Both of them calculate the gradients for descent and the gradient is a vector field that for a given point gradient points in the direction of the steepest change for the objective.
Steepest descent finds orthogonal directions by stretching the descent in a direction as much as possible.
The steepest descent algorithm proceeds this way:
set I to 0
set residual to b - Ax
set delta to residual-transposed.residual.
Initialize delta-0 to delta
while I < I-max and delta > epsilon^2 delta-0 do:
    q = Ar 
    alpha = delta / (r-transposed. q)
    x = x + alpha.residual
    If I is divisible by 50
        r = b - Ax
    else
        r = r - alpha.q
    delta = r-transposed. r
     I = I + 1

The reason for the use of the number 50 is that after a certain number of iterations, the exact residual is recalculated to remove accumulated floating point error. Of course, the number 50 is arbitrary and for large n, square_root(n) may be more appropriate instead.
If the tolerance is large. the residual need not be corrected at all. If the tolerance is close to the limits of the floating point precision of the machine, a test should be added after delta is evaluated, to check if delta is less than epsilon^2.delta-0. and if this test holds true, the exact residual should be recomputed and delta reevaluated. This prevents the procedure from terminating early due to floating point roundoff error.
courtesy: Conjugate Gradient Method by Jonathan Richard Shewchuk

Tuesday, November 24, 2015

#codingexercise
double GetStandardDeviation(List<int> numbers)
{
double count = numbers.Count();
if (count <= 0) return 0;
double sum = numbers.Sum();
double mean = sum / count;
double variance = 0;
foreach (var number in numbers)
     variance += (number-mean)*(number-mean);
double stddev = Math.sqrt(variance/count);
return stddev;
}

Parallel and Concurrent programming of Independent Component Analysis:

A famous example of Independent Component Analysis is given by the “cocktail party problem”, where a recording of people talking simultaneously in a room is studied to separate out the underlying speech signals from the sample data. This is based on two assumptions – one that the source signals are independent of each other and two the values of each source signal have non-gaussian distributions. 

When the data is represented by a random vector x = (x1, x2, … xm) Transposed and the components as the random vectors (s1, s2, … sn) Transposed. The task is to transform the observed data x, using a linear static transformation W as s = Wx into maximally independent component s measured by some objective function on (s1, s2 … sn) of independence.

The data as represented by x above is considered to be a sum of the independent components sk weighted by the mixing weights ai,k for each xi. This is written as x = A.s Therefore W = A-inverse.

In ICA, the main goal is to compute the unmixing matrix W. The linear equation s= Wx can be solved by batch gradient descent to optimize W’s likelihood. Recall Likelihood is another term for conditional probability of the training vector from the Bayes definition mentioned earlier. The components can then be determined as the expression as

the singular column matrix [ 1-2g(wi Transposed . xi ) ] . xi Transposed

Here the wi.Transposed .x is the  objective function which is defined as the mixing weighted data.

The first term of unity is the starting data point and g is the gradient. The components are updated by taking the difference between the starting and the adjustment. This batch gradient descent is performed till the components stabilize.  At which point they can be used as predictors.

The original algorithm calls for the processing of the full training set before each update of the components.

In parallelization with sub groups of data, this is computed by each mapper and then aggregated by the reducer. The expression above can be calculated on each dataset separately but the reducer performs the updates to the components by aggregating the sum.

In modified parallelization, we retain the same procedure for every new data we encounter as the data becomes available.  But for the already calculated old data where we have the components determined so far as well as the summary which is the expression above, we could use the summaries to calculate the components anew because there is no concept of overlapping and non-overlapping of data since all the data participates in the iterative update of the components – just that we use the summary instead of recomputing the sum. This is a trivial rearrangement of the subgroups of data into old and new as it becomes available but applying the same parallelization as with partitioned subgroups of data. An alternative could be to try and merge the components by weights against the summary because each component is linearly transformed to the next update.
Note that the components are batch updated so the merging will be across all components.