Today we cover the distributed mini batch algorithm. Previously we were discussing the serial algorithm. We now evaluate the phi rule in a distributed environment. The technique resembles the serial mini-batch and retains some of the steps from that algorithm except that it runs in parallel on each node in the network, and illustrates the overall algorithm workflow. If we specify a batch size as b and assume that k divides b and mu where mu is the additional inputs that arrive at the nodes.
This means that each batch contains b plus mu consecutive inputs. During each batch j, all of the nodes use a common predictor wj. During the first b inputs, the nodes calculate and accumulate the stochastic gradients of the loss function f at wj. Once the nodes have accumulated b gradients altogether, they start a distributed vector sum operation to calculate the sum of these b gradients. While the vector sum completes in the background, mu additional inputs arrive and the system keeps processing them using the same predictor wj. Although these are processed, their gradients are discarded and this waste can be made negligible by choosing appropriate values for b. As a rule of thumb, typically square root of all the m batches is a number around which we can set b because that is the minimum we need to process to get a feel for the b batches. Note mini batch can be half the full batch because it does not offer any significant improvement in iterations. It has to be smaller but it cannot be stochastic either since it is a hybrid between stochastic and full-batch.
When the vector -sum operation completes, each node holds the sum of the b-gradients collected during the batch j. Each node divides this sum by b and obtains the average gradient. Each node uses this average gradient in the update rule phi, which results in a synchronized prediction for the next iteration. Therefore, during batch j each node processes b + mu taken together and divided by k as the number of inputs that are processed with the current predictor but only the first b/k gradients are used to compute the next predictor. All of the b+mu inputs are used to calculate the regret measure.
This is not a no-communication parallelization. In fact we do use the communication to minimize the degradation we suffer with no-communication.
#problemsolving
Given a 2D array of 1 and 0, Find the largest rectangle (may not be square) which is made up of all 1 or 0.
The straightforward solution would be to find the size of the rectangle that the current element is part of and return the maximum if such size encountered. To detect the rectangle we could walk along the boundary if one exists. An optimization is to keep track of all top left and bottom right pairs of rectangles encountered and to skip it when contained. These rectangles can be kept sorted in the order of their top left corners. We check only the rectangles whose topleft is earlier than the current element.
This means that each batch contains b plus mu consecutive inputs. During each batch j, all of the nodes use a common predictor wj. During the first b inputs, the nodes calculate and accumulate the stochastic gradients of the loss function f at wj. Once the nodes have accumulated b gradients altogether, they start a distributed vector sum operation to calculate the sum of these b gradients. While the vector sum completes in the background, mu additional inputs arrive and the system keeps processing them using the same predictor wj. Although these are processed, their gradients are discarded and this waste can be made negligible by choosing appropriate values for b. As a rule of thumb, typically square root of all the m batches is a number around which we can set b because that is the minimum we need to process to get a feel for the b batches. Note mini batch can be half the full batch because it does not offer any significant improvement in iterations. It has to be smaller but it cannot be stochastic either since it is a hybrid between stochastic and full-batch.
When the vector -sum operation completes, each node holds the sum of the b-gradients collected during the batch j. Each node divides this sum by b and obtains the average gradient. Each node uses this average gradient in the update rule phi, which results in a synchronized prediction for the next iteration. Therefore, during batch j each node processes b + mu taken together and divided by k as the number of inputs that are processed with the current predictor but only the first b/k gradients are used to compute the next predictor. All of the b+mu inputs are used to calculate the regret measure.
This is not a no-communication parallelization. In fact we do use the communication to minimize the degradation we suffer with no-communication.
#problemsolving
Given a 2D array of 1 and 0, Find the largest rectangle (may not be square) which is made up of all 1 or 0.
The straightforward solution would be to find the size of the rectangle that the current element is part of and return the maximum if such size encountered. To detect the rectangle we could walk along the boundary if one exists. An optimization is to keep track of all top left and bottom right pairs of rectangles encountered and to skip it when contained. These rectangles can be kept sorted in the order of their top left corners. We check only the rectangles whose topleft is earlier than the current element.
No comments:
Post a Comment