Thursday, December 10, 2015

Today we continue to review the paper by Dekel, Gilad-BachRach, Shamir and Xiao which described optimal distributed online prediction using mini-batches. In this section we discuss their analysis of the No-Communication Parallel solution. Using the abstract notation Psi as a function of variance and number of iterations m for the expected regret bound they proceed with the naiive no communication solution.  Here each of the k nodes  in the parallel system applies the same serial update rule to its own sub stream of the high rate inputs and no communications take place between them. If the total number of examples processed by the k nodes is m then then it is equally distributed as upper bound m / k inputs to each of the nodes. As we recall from the initial introduction each such input is independent and identically distributed (i.i.d) from the original distribution, with the same variance bound for the stochastic gradients.  Therefore each node suffers an expected regret of at most Psi (variance, upper-bound(m/k)) on its portion of the input stream, and the total regret bound is obtained by simply summing over the k nodes. Note that this summation form is enabled because we have the inputs as i.i.d. The expected regret for the entire distribution is therefore k time Psi as found above. Now the authors have mentioned two theorems :
First is that if the loss function is L-smooth and convex in predictions for each data in Z and the stochastic gradient has the bounded variance as above and let D= square root of the max prediction error /2. Then using an iteration dependent parameter as L plus standard deviation divided by D  times square root j the iteration number, we can bound the expected regret as error in regret plus D-squared times L plus 2 times D times standard deviation times square root (m). This is true for both the examples cited earlier by the authors - the projected gradient descent and the dual averaging method. Using this form of the bound in the no communication parallel system analysis, we apply it as expected regret bounded by 2 times k times D squared L plus 2 times D times standard deviation times k times square root(m divided by k). If we compare this bound to the one for the ideal serial solution k = 1, we see that it is approximately square root(k) times worse in the leading term. This is the price of no communication in this parallel system. However, the authors improve it by avoiding the square root k cost in their distributed mini-batch algorithm.
Note that as with most mini-batch algorithm, we tradeoff between the full batch and the stochastic method because we can reset after m/k batches. While we use the summation form after every k-batches, we don't go the step of extrapolating the results of the previous mini-batch to the current mini-batch.
#codingexercise
int min(List<int> numbers)
{
min = INT_MAX;
for (int i = 0; i < numbers.Count; i++)
    if (numbers[i] < min)
        min = numbers[i];
return min;
}
void BillingProvider(Customer c, Dictionary<Metrics,KeyValuePair<quantity,rate>> usage)
{
Double total = 0;
foreach (kvp in usage)
{
    Double sum =  kvp.value.key*kvp.value.value
    Console.WriteLine("{0} costs {1}", kvp.key,sum);
    total += sum;
}
Console.WriteLine("Total={0}", total);
}

Tuesday, December 8, 2015

Today we continue discussing the paper by Dekel, Gilad-BachRach, Shamir and Xiao which described optimal distributed online prediction using mini-batches.Their method can use any gradient-based update rule by treating the serial online prediction as a black box and convert it into a parallel or distributed online prediction algorithm. Their regret bounds for smooth loss functions are variance based and state refined.
They provide two examples of the update rule that fits the above template. 
The first is the projected gradient descent update rule where the next prediction is found by applying the Euclidean projection onto the set W to the current prediction without the scaled gradient. This scaling factor is also called the decaying learning rate and is the negative inverse of auxiliary state vector alpha-j that summarizes all of the necessary information of the past. Alpha-j is typically set to be having a complexity of square root(j) where j is the current iteration. In this example of the update rule, we can simply set alpha-j to be current prediction wj and use it as is to get the next prediction as the update rule.
The other example given is the dual averaging method. It's update rule takes the form such that the next prediction is based on the vector inner product of the sum of gradients with the prediction taken together with a monotonically increasing sequence of positive numbers times the strongly convex auxiliary function 
#codingexercise

Given two arrays, first one containing a number sequence and the second one containing number s that can be found in the sequence, find the smallest window in the number sequence that has all the elements from the second array:
#codingexercise
int FindSmallestWindow(List<int> haystack, List<int> needle)
{
// parameter validation and assume no duplicates in haystack or needle.
var indexes= new SortedList();
for (int I =0; I <haystack.Length; I++)
indexes.Add( haystack[i], i);
Array.sort(needle);
// indexes and needle both sorted
now we can use the sorted sequence and index arithmetic linearly
int min = haystack.Length - 1;
int max = 0;
int range = 0;
for (int i = 0; i < needle.length; i++)
{
int index = indexes[needle[i]];
if (index < min) min = index;
if (index > max) max = index;
}
if (min > max) {int temp = min; min = max; max = temp;}
int range = max - min + 1;

return range;
}
int GetWindow(out start, out length, int[] haystack, int[] needle)
{
int minrange = haystack.Length; // default value
for (int start = 0; start < haystack.Length; start++)
{
var found = new Dictionary<int,int>();
for (int i = start; i < haystack.Length; i++)
{
if (needle.contains(haystack[i]))
{
if (found.length == needle.length)
{
int min = found.values.min();
int max = found.values.max();
if (max-min+1 < minrange)
minrange = max-min+1;
break;
}
else
{
found[haystack[i]] = i;
}
}
}
}
return minrange;

}

Monday, December 7, 2015

#codingexercise
Given a binary tree, find the largest path size from one leaf to another:
int getLargestPathSize(node* root)
{
var leafdepths = new List<int>();
setdepth(root, 0, ref leafdepths);
leafdepths.sort();
leafdepths.reverse();
if (leafdepths.count >= 2)
  return leafdepths[0] + leafdepths[1];
 return 0;
}

void setdepth(node *root, int level, ref List<int> leafdepths) {
  if (!root) return;
  int curLevel = level + 1;
  setdepth(root->left, curLevel, ref leafdepths);
  setdepth(root->right, curLevel, ref leafdepths);
  if (root->left == null && root->right == null)
    leafdepths.Add(curLevel);
}


Dekel, Gilad-BachRach, Shamir and Xiao described optimal distributed online prediction using mini-batches. They focus on large-scale and high rate prediction problems. The stochastic online prediction problem is one where a stream of inputs z1, z2,  … zm is sampled independently from a fixed unknown distribution over a sample space Z. Before observing each sample space zi,  a point wi is predicted from the set W. After making the prediction wi, we observe zi and suffer the loss f(wi, zi) where f is a predefined loss function.  Then zi is used to improve the prediction mechanism for the future. The quality of predictions is measured using the notion of regret which is the difference between the cumulative loss of our predictions and the cumulative loss of the fixed predictor which is optimal with respect to the underlying distribution. Since regret relies on stochastic variables, it is a random variable.

Furthermore, they bound this regret with O(sq.rt(m)) which is the same as in the optimal for a serial gradient based algorithm. And their method can use any gradient-based update rule by treating the serial online prediction as a black box and convert it into a parallel or distributed online prediction algorithm.

Their method is therefore succinctly put as

For j = 1, 2, … do

     predict wj;

     receive input zj sampled i.i.d from unknown distribution

     suffer loss f(wj, zj)

     define gradient gj = delta f(wj,zj)

     compute (wj+1, aj+1) = phi (aj, gj, alpha-j);

end

Notice the last step is an unspecified update rule where we take the following three arguments:

-          an auxiliary state vector aj that summarizes all of the necessary information about the past

-          a gradient gj of the loss function evaluated at wj

-          and an iteration dependent parameter alpha-j as stepsize

to compute the   next predictor and the new auxiliary state predictor.

This is very similar in nature to what I tried to by evaluating the data once as it arrives using stochastic gradient based summary but by initializing the current residual with similar update to treat the new data while reducing it completely for the previous iteration data. For example we can initialize the residual for the next iteration with the weighted average of the conventional initial residual and the previous initial residual. And we can measure the loss with the sum of squares.

 

Friday, December 4, 2015

The authors of the paper on Optimal Distributed Online Prediction Using Mini-Batches provide a method that can be called a distributed mini-batch algorithm, a method of converting any serial gradient-based online prediction algorithm into a parallel or distributed algorithm. This method has two important properties:
It can use any gradient based update rule for serial online prediction as a black box and convert it into a parallel or distributed online  prediction algorithm.
If the loss function f(w,z) is smooth in w, the prediction, then this method attains an asymptotically optimal regret bound of O(sq.rt(m)). Moreover, the co-efficient of the dominant term sq.rt(m) is the  same as in the serial bound, and independent of k and of the network topology.
The notion of mini-batches is that we don't process over the entire data set but in iterations. Meanwhile the gradient is not updated after the entire dataset but after a certain number of iterations. It is not the same as stochastic method which can update the gradient after every iteration.  Hence is it is a cross between batch and stochastic gradient method.
They average a mini batch of stochastic gradients computed at each predictor. They propose that the standard deviation of those stochastic gradients reaches the coefficient in the regret bound.
#codingexercise
Given two single digit integer arrays where the elements taken together form a number, maximize the number by replacing it with elements of the second array.
void MaximizeWithReplace(List<int> input, List<int> replace )
{
replace.Sort();
replace.Reverse();
for (int i = 0; i < replace.Length; i++){
  bool replaced = false;
  for (int j = 0; j < input.Length; i++){
         if input[j] < replace [i]{
              input[j] = replace [i];
              replaced = true;
              break;
         }
     }
   if (!replaced) break; // done
   }
}

Thursday, December 3, 2015

Software to detect if a Virtual Machine is busy:
Introduction:
Servers are ubiquitous and a private Cloud provider has hundreds of servers in their inventory. Most of these servers fall under one of the following categories: Initialized, Active, Suspended, and Terminated. A server is generally in the initialized state it is being imaged and prepared for commissioning. A server is active when it is being used frequently by the user. A server is in the suspended state when there is confirmation that the user is no longer using it actively. A server is the terminated state when the server is ready to be reclaimed or repurposed.  Periodically a cloud provider may want to reclaim resources such as servers. In such cases, it is important for the cloud provider mark the servers as suspended before they can be terminated. Going through several hundreds of servers and marking them as suspended only on user confirmation is not very productive. Instead if there were a way to detect whether the servers are active or not, then we could even be more judicious of which users to target for confirmation.
Problem definition: How do we know whether a server is active or not ?
Solution: Servers come in many different OS versions and flavors. A script that might work in one may not work in another. Therefore, we may need to target each OS family with the nuances and metrics that they allow.
On Windows, we have a lot of instrumentation builtin. They came in many different forms such as WMI, perfmon counters, MOM and tracing and logging. There is probably very little need to write sophisticated code to figure out any particular component or system. That said, a Windows system task viewer may show varied level of peaks and valleys of activities even without user input.
On Macintosh, we have plenty of dscripts stashed away under the hood in Automations. In fact, a Mac OS can reliably say when the system is not doing much activity.
On Ubuntu, we have several different counters typically under the proc folder. In fact top and uptime commands often give metrics that give an idea of how active a VM was. The /proc/loadavg has the first three fields as load averages that give an idea of the number of jobs in the run queue. Load average is different from waiting for cpu. Generally if you have 1 process running at 100%, then the load average can be expected to be approach 1.

Conclusion: Periodic cron jobs may collect statistics over time that can reflect the usage of a VM server. Given this visibility, different orchestration frameworks and program may be able to use this information for several purposes such as monitoring, conservation etc.
#codingexercise
int FindSmallestWindow(List<int> haystack, List<int> needle)
{
// parameter validation and assume no duplicates in haystack or needle.
 var indexes= new SortedList();
for (int I =0; I <haystack.Length; I++)
     indexes.Add( haystack[i], i);
Array.sort(needle);
// indexes and needle both sorted
now we can use the sorted sequence and index arithmetic linearly
int min = haystack.Length - 1;
int max =  0;
int range = 0;
for (int i = 0; i < needle.length; i++)
{
int index = indexes[needle[i]];
if (index < min) min = index;
if (index > max) max = index;
}
if (min > max) {int temp = min; min = max; max = temp;}
int range = max - min + 1;
return range;

}

Wednesday, December 2, 2015

Ofer, Ran, Ohad and Lin's paper focus on 'large-scale' and 'high-rate' online prediction problems. Here parallel and distributed computing is critical to providing a realtime service. They define their problem precisely to which they want to apply online stochastic prediction methods. There is  a stream of inputs z1, z2, ... where each zi is sampled independently from a fixed unknown distribution over a sample space Z. Before observing each zi,  a point wi is predicted from a set W. After making the prediciton Wi,  zi is observed and a loss f(wi, zi) is suffered, where f is  a predefined loss function.  Then zi is used to improve the prediction mechanism for the future eg. using a stochastic gradient method. The goal is to accumulate the smallest possible loss as we process the sequence of inputs. The quality of predictions is measured using the notion of regret which is merely a sum of the differences in the loss on the prediction of the input and the loss of the fixed predictor w* which is optimal with respect to the underlying distribution. This regret is a random variable since it relies on stochastic inputs zi each of which is sampled independently from a fixed unknown distribution. Hence we are free to apply Markov chains wherever appropriate. Moreover, the  expected regret is bounded for simplicity and then these results are used to obtain high probability bounds on the actual regret.Also this discussion is restricted to convex prediction problems, those with a global minima such as the ones we have discussed earlier. For example, we get a convex surface when we take  a positive definite matrix A for use as in the linear equation Ax = b.
The distributed computing system in this case is modified as a set of k nodes, each of which is an independent processor, and a network that enables the nodes to communicate with each other.  Each node receives an incoming stream of examples from an outside source so each participates as in a splitter of work load and each performs the same operations under the algorithm.
The performance of the algorithm can be bounded on the distributed environment. On one hand an ideal solution is to run a serial algorithm on a single processor that is k times faster than a single node. The solution is optimal because there is no communication and no partitioning of data and aggregation. Besides most distributed algorithms can work on one processor.For distributed algorithms, there is another paper that already bounds the optimal regret that can be achieved by a gradient based serial algorithm  on an arbitrary convex loss and this has a complexity of the order of square root (m). On the other hand, a trivial solution is to run the serial algorithm entirely and one per each of the nodes.This does not require any communication and can be called the no-communication algorithm. The only trouble is that it scales poorly with the network size k In fact it can be shown to have a square root(k) performance worse than the ideal solution by assuming each node processes m/k inputs, the expected regret per node is sq.rt(m/k)

Tuesday, December 1, 2015

Today we review the paper Optimal Distributed Online Prediction using Mini Batches. by Dekel, Shamir, Xiao et al.  This paper mentions that online prediction methods are typically presented as serial algorithms running on a single processor. But to keep up with the increasing data sets,we need to parallelize it. In this context they present a distributed mini-batch algorithm. - a method that converts many serial gradient based online prediction algorithms into distributed algorithms.In addition, this method takes into account communication latencies between nodes in distributed environment. In fact they show linear speedup of distributed stochastic optimization on multiple processors.They also mention advantages of their approach on a web scale prediction problem.
We will discuss this but let us take a short break to discuss an interview question below.
Given a series of letters and another smaller set find the smallest window in the larger set that has all the elements from the smaller.
One way to do this would be to find all starting positions of a sliding window and then to find the minimum of lengths.