Today we continue to review the paper by Dekel, Gilad-BachRach, Shamir and Xiao which described optimal distributed online prediction using mini-batches. In this section we review serial online prediction using mini-batches.
We saw with the analysis of the parallel DMB algorithm in the earlier post that the expected regret was based on the variance of the stochastic gradients. The dependency on variance implies that we can average the stochastic gradients over mini-batches to reduce the variance.
In the serial mini-batch algorithm we do so. Let us take a look.
The update rule is applied after each input is received in the previous section. However, in the serial mini-batch algorithm, this is done only periodically hence the notion of mini-batches where the batch-size is defined by the user. Here b is a user-defined batch size and every b consecutive inputs is considered as a batch.
The algorithm now becomes:
for j = 1,2, ... do
initialize avg-gradient-j = 0
for s = 1 to b do
define i = (j - 1)b + s;
predict wj;
receive input zi sampled i.i.d from unknown distribution
suffer loss f(wj, zi);
gradient-i = delta-w f(wj, zi);
avg-gradient-j = avg-gradient-j + (1/b) gi;
end
set (wj+1, aj+1) = phi (auxiliary state vector aj, gradient gj, and step-size)
The prediction remains constant for the duration of each batch and is updated only when a batch ends. While processing the b inputs in batch j, the algorithm calculates and accumulates gradients and defines the average gradient. Each such mini batch computes one average gradient. The
appeal of the serial mini-batch setting is that the update rule is used less frequently, which may have computational benefits.
#codingexercise
int max(List<int> numbers)
{
max = INT_MIN;
for (int i = 0; i < numbers.Count; i++)
if (numbers[i] > max)
max = numbers[i];
return max;
}
We saw with the analysis of the parallel DMB algorithm in the earlier post that the expected regret was based on the variance of the stochastic gradients. The dependency on variance implies that we can average the stochastic gradients over mini-batches to reduce the variance.
In the serial mini-batch algorithm we do so. Let us take a look.
The update rule is applied after each input is received in the previous section. However, in the serial mini-batch algorithm, this is done only periodically hence the notion of mini-batches where the batch-size is defined by the user. Here b is a user-defined batch size and every b consecutive inputs is considered as a batch.
The algorithm now becomes:
for j = 1,2, ... do
initialize avg-gradient-j = 0
for s = 1 to b do
define i = (j - 1)b + s;
predict wj;
receive input zi sampled i.i.d from unknown distribution
suffer loss f(wj, zi);
gradient-i = delta-w f(wj, zi);
avg-gradient-j = avg-gradient-j + (1/b) gi;
end
set (wj+1, aj+1) = phi (auxiliary state vector aj, gradient gj, and step-size)
The prediction remains constant for the duration of each batch and is updated only when a batch ends. While processing the b inputs in batch j, the algorithm calculates and accumulates gradients and defines the average gradient. Each such mini batch computes one average gradient. The
appeal of the serial mini-batch setting is that the update rule is used less frequently, which may have computational benefits.
#codingexercise
int max(List<int> numbers)
{
max = INT_MIN;
for (int i = 0; i < numbers.Count; i++)
if (numbers[i] > max)
max = numbers[i];
return max;
}
No comments:
Post a Comment