Today we try to improve stochastic gradient method. We can recall that this is already better than Batch gradient method because it provides earlier solutions than evaluating the entire data. The error function could be steepest descent or sum of squares and in either case, we apply the correction iteratively but after every sample in the case of stochastic gradient method as opposed to evaluating after all the samples as in the case of batch gradient method. Since the update to common parameters such as the residual are after every sample, the stochastic gradient method is not conducive to parallel-ization as opposed to batch gradient method which is in summation form.Each update requires holding a lock on the common parameters which in effect serializes the parallel processing. In batch gradient or mini-batch gradient, we try to get the best of both worlds by reducing the number of times we need to take the lock and distributing the data on parallel processors. However parallelization is not restricted to exploiting summation forms by distributing data to each processor directly or indirectly by distributing the iterations over the processors.
We instead use a technique of leap-frogging over the data by using some statistics/metadata/summary information over the data. The summaries are generated as and when the data becomes available while at the same time we decouple the data processing with the algorithm to find the solution. For example, we can take the data and calculate the step-lengths in the search direction for the data to compute the correction as data processing summary for any or all data as and when it becomes available. But instead of updating the residual after one or several batches, we provide the residual updates only when we want to find the solution and this is independent of the data processing. The new residual is used in the data processing the next time which is what tightly integrates the algorithm to the data. But if we could come up with an algorithm that works off the data summaries instead of the data processing, then we decouple the data processing push operation from the solution finding pull operation of the algorithm. One way to do this would be to generate different summaries by 'guessing' different residuals for the data and then have the algorithm use the summary closest to the one matching the newly computed residual. The trick is to have the guesses closely align with those that cause the convergence (something we do currently by taking feedback from the previous loop and keeping the loop iterations smaller. Another way to do this would be to perform the entire algorithm as is on the local data for each processor to find the local minima of the model and then find the global minima over all the data. Then if we have the luxury of iterating globally, we can update the residual with the weighted averages of the residuals found so far.
The key idea here is to evaluate the data once to find the optimal solution for the data as with conventional algorithm. But when more data becomes available, combine the previous solution/summary with the current data to find the new solution/summary. This is what we call by leap-frogging. In other words, while both stochastic and batch gradient require data to be processed again, we require data to be processed incrementally such that data once processed doesn't need to be processed again. The equation therefore changes from applying newer residual to the same data to one that applies corrections on current data based on previously computed partial results. This way the processing of newer data can be done in parallel while the computation of the result can be done serially and one time. One variation of SGD that comes close along these lines is the use of momentum learning where a factor of the previous gradient is added to the weight update for the current iteration.
We instead use a technique of leap-frogging over the data by using some statistics/metadata/summary information over the data. The summaries are generated as and when the data becomes available while at the same time we decouple the data processing with the algorithm to find the solution. For example, we can take the data and calculate the step-lengths in the search direction for the data to compute the correction as data processing summary for any or all data as and when it becomes available. But instead of updating the residual after one or several batches, we provide the residual updates only when we want to find the solution and this is independent of the data processing. The new residual is used in the data processing the next time which is what tightly integrates the algorithm to the data. But if we could come up with an algorithm that works off the data summaries instead of the data processing, then we decouple the data processing push operation from the solution finding pull operation of the algorithm. One way to do this would be to generate different summaries by 'guessing' different residuals for the data and then have the algorithm use the summary closest to the one matching the newly computed residual. The trick is to have the guesses closely align with those that cause the convergence (something we do currently by taking feedback from the previous loop and keeping the loop iterations smaller. Another way to do this would be to perform the entire algorithm as is on the local data for each processor to find the local minima of the model and then find the global minima over all the data. Then if we have the luxury of iterating globally, we can update the residual with the weighted averages of the residuals found so far.
The key idea here is to evaluate the data once to find the optimal solution for the data as with conventional algorithm. But when more data becomes available, combine the previous solution/summary with the current data to find the new solution/summary. This is what we call by leap-frogging. In other words, while both stochastic and batch gradient require data to be processed again, we require data to be processed incrementally such that data once processed doesn't need to be processed again. The equation therefore changes from applying newer residual to the same data to one that applies corrections on current data based on previously computed partial results. This way the processing of newer data can be done in parallel while the computation of the result can be done serially and one time. One variation of SGD that comes close along these lines is the use of momentum learning where a factor of the previous gradient is added to the weight update for the current iteration.
No comments:
Post a Comment