Today we continue to review the paper by Dekel, Gilad-BachRach, Shamir and Xiao which described optimal distributed online prediction using mini-batches. In this section we discuss their analysis of the No-Communication Parallel solution. Using the abstract notation Psi as a function of variance and number of iterations m for the expected regret bound they proceed with the naiive no communication solution. Here each of the k nodes in the parallel system applies the same serial update rule to its own sub stream of the high rate inputs and no communications take place between them. If the total number of examples processed by the k nodes is m then then it is equally distributed as upper bound m / k inputs to each of the nodes. As we recall from the initial introduction each such input is independent and identically distributed (i.i.d) from the original distribution, with the same variance bound for the stochastic gradients. Therefore each node suffers an expected regret of at most Psi (variance, upper-bound(m/k)) on its portion of the input stream, and the total regret bound is obtained by simply summing over the k nodes. Note that this summation form is enabled because we have the inputs as i.i.d. The expected regret for the entire distribution is therefore k time Psi as found above. Now the authors have mentioned two theorems :
First is that if the loss function is L-smooth and convex in predictions for each data in Z and the stochastic gradient has the bounded variance as above and let D= square root of the max prediction error /2. Then using an iteration dependent parameter as L plus standard deviation divided by D times square root j the iteration number, we can bound the expected regret as error in regret plus D-squared times L plus 2 times D times standard deviation times square root (m). This is true for both the examples cited earlier by the authors - the projected gradient descent and the dual averaging method. Using this form of the bound in the no communication parallel system analysis, we apply it as expected regret bounded by 2 times k times D squared L plus 2 times D times standard deviation times k times square root(m divided by k). If we compare this bound to the one for the ideal serial solution k = 1, we see that it is approximately square root(k) times worse in the leading term. This is the price of no communication in this parallel system. However, the authors improve it by avoiding the square root k cost in their distributed mini-batch algorithm.
Note that as with most mini-batch algorithm, we tradeoff between the full batch and the stochastic method because we can reset after m/k batches. While we use the summation form after every k-batches, we don't go the step of extrapolating the results of the previous mini-batch to the current mini-batch.
#codingexercise
int min(List<int> numbers)
{
min = INT_MAX;
for (int i = 0; i < numbers.Count; i++)
if (numbers[i] < min)
min = numbers[i];
return min;
}
void BillingProvider(Customer c, Dictionary<Metrics,KeyValuePair<quantity,rate>> usage)
{
Double total = 0;
foreach (kvp in usage)
{
Double sum = kvp.value.key*kvp.value.value
Console.WriteLine("{0} costs {1}", kvp.key,sum);
total += sum;
}
Console.WriteLine("Total={0}", total);
}
First is that if the loss function is L-smooth and convex in predictions for each data in Z and the stochastic gradient has the bounded variance as above and let D= square root of the max prediction error /2. Then using an iteration dependent parameter as L plus standard deviation divided by D times square root j the iteration number, we can bound the expected regret as error in regret plus D-squared times L plus 2 times D times standard deviation times square root (m). This is true for both the examples cited earlier by the authors - the projected gradient descent and the dual averaging method. Using this form of the bound in the no communication parallel system analysis, we apply it as expected regret bounded by 2 times k times D squared L plus 2 times D times standard deviation times k times square root(m divided by k). If we compare this bound to the one for the ideal serial solution k = 1, we see that it is approximately square root(k) times worse in the leading term. This is the price of no communication in this parallel system. However, the authors improve it by avoiding the square root k cost in their distributed mini-batch algorithm.
Note that as with most mini-batch algorithm, we tradeoff between the full batch and the stochastic method because we can reset after m/k batches. While we use the summation form after every k-batches, we don't go the step of extrapolating the results of the previous mini-batch to the current mini-batch.
#codingexercise
int min(List<int> numbers)
{
min = INT_MAX;
for (int i = 0; i < numbers.Count; i++)
if (numbers[i] < min)
min = numbers[i];
return min;
}
void BillingProvider(Customer c, Dictionary<Metrics,KeyValuePair<quantity,rate>> usage)
{
Double total = 0;
foreach (kvp in usage)
{
Double sum = kvp.value.key*kvp.value.value
Console.WriteLine("{0} costs {1}", kvp.key,sum);
total += sum;
}
Console.WriteLine("Total={0}", total);
}
No comments:
Post a Comment