Thoughts on parallelizing Steepest Descent algorithm.
The Steepest Descent method is used to solve the equations
of the form Ax = b. The method tries to move towards the solution by sliding
down the bottom of a paraboloid. Assume the data to be a function f which is a quadratic form. It is minimized by the solution to Ax = b. f(x) is a curved surface for positive-definite A. The
paraboloid is the intersection of n hyperplanes each having dimension n-1. In our case we have two because we have two axes co-ordinates
for the solution. The bottommost point
of the paraboloid (gradient = 0) is our target. It happens to be a point where
the directional derivative of a function f on x is zero. We take a series of
steps until we are satisfied that we are close enough to the solution x. When we take a step, we choose a direction in
which f decreases quickly. We keep track of an ‘error’ e – a vector that tells
us how far we are from the solution. A
measure called residual gives us the direction of the steepest descent. A ‘line search’ is a procedure that chooses a
‘step-length’ alpha to minimize f along a line. The ‘step-length’ can be
maximized when the previous residual and the current gradient are orthogonal
along the search line. Therefore the algorithm follows a zig zag path to the
function f.
Given the inputs A, b, a starting value x, a maximizing number of iterations i=max and an error-tolerance of epsilon < 1,
this traditional algorithm is described as follows:
The steepest descent algorithm proceeds this way:
set I to 0
set residual to b - Ax
set delta to residual-transposed.residual.
Initialize delta-0 to delta
while I < I-max and delta > epsilon^2 delta-0 do:
q = Ar
alpha = delta / (r-transposed. q)
x = x + alpha.residual
If I is divisible by 50
r = b - Ax
else
r = r - alpha.q
delta = r-transposed. r
I = I + 1
When parallelizing the algorithm, mappers could independently calculate the
updates to the target along each of the components of the vector by finding the
residual, the step length and the error as above and calling it summary and
then sum the calculations for these in
the reducer. Note that operations are to be performed for all data points and
their ordering doesn’t matter, so partitioning the data into chunks and
performing each step in parallel on that data is trivial and a speedup. The
reducer can then aggregate the partial results from each of the mappers and then
choose the next residual based on error tolerance.
The key idea here is to express it in summation form. The inner product of two vectors x and y written as x-transposed.y represents the scalar sum from I = 1 to n of xi.yi. Note x-transposed.A.x is also scalar.
In the algorithm above, r-transposed.q and r-transposed.r are both summation forms above. Note that r, b and x are vectors represented by n x 1 matrix. The vector A is n x n matrix. The summation forms yield scalar. This is used to compute delta and alpha both of which are scalars.
The solution x is updated iteratively by step length alpha in search direction q.
The above algorithm does the computations in mini-batched Gradient descent. Where the updates are available after processing k training samples where 1 < k < all n samples In this case k = 50.
The advantage with processing entire data set is that we get the same optimum solution after the same number of iterations and hence it is deterministic. This is improved to having better answer sooner by doing it in batches of 50.
However, we could arrive at an approximate answer from the get go by training on one sample instead of more samples and doing more iterations by using the previous computations in the adjustments.
the stochastic gradient method mentioned in the earlier post is different from the batch gradient in the following ways:
1) the batch gradient method operates on all the data before each update where as the stochastic gradient method updates after reading one training set.
2) If the number of training samples are very large then batch gradient method can be very slow. On the other hand, stochastic gradient method will be faster because it will improve from the first training set itself.
3) The error function is not as well minimized in SGD as it is in GD. However, it does converge though not to the minimum and oscillate around the optimal value. In fact it has been found to be empirically better than GD.
4) While the batch gradient method performs the following :
Repeat until convergence {
Theta - j = Theta -j + correction ( for every j )
}
the stochastic gradient method performs the following:
Loop {
for i = 1 to m {
Theta-j = Theta-j + correction ( for every j )
}
}
5) The batch gradient method find the global minima and because it evaluates the entire data it is not susceptible to local minima should they occur.
6) The latter method is called stochastic because the approximate direction that can be computed in every step can be thought of as a random variable of a stochastic process.