Monday, December 7, 2015

#codingexercise
Given a binary tree, find the largest path size from one leaf to another:
int getLargestPathSize(node* root)
{
var leafdepths = new List<int>();
setdepth(root, 0, ref leafdepths);
leafdepths.sort();
leafdepths.reverse();
if (leafdepths.count >= 2)
  return leafdepths[0] + leafdepths[1];
 return 0;
}

void setdepth(node *root, int level, ref List<int> leafdepths) {
  if (!root) return;
  int curLevel = level + 1;
  setdepth(root->left, curLevel, ref leafdepths);
  setdepth(root->right, curLevel, ref leafdepths);
  if (root->left == null && root->right == null)
    leafdepths.Add(curLevel);
}


Dekel, Gilad-BachRach, Shamir and Xiao described optimal distributed online prediction using mini-batches. They focus on large-scale and high rate prediction problems. The stochastic online prediction problem is one where a stream of inputs z1, z2,  … zm is sampled independently from a fixed unknown distribution over a sample space Z. Before observing each sample space zi,  a point wi is predicted from the set W. After making the prediction wi, we observe zi and suffer the loss f(wi, zi) where f is a predefined loss function.  Then zi is used to improve the prediction mechanism for the future. The quality of predictions is measured using the notion of regret which is the difference between the cumulative loss of our predictions and the cumulative loss of the fixed predictor which is optimal with respect to the underlying distribution. Since regret relies on stochastic variables, it is a random variable.

Furthermore, they bound this regret with O(sq.rt(m)) which is the same as in the optimal for a serial gradient based algorithm. And their method can use any gradient-based update rule by treating the serial online prediction as a black box and convert it into a parallel or distributed online prediction algorithm.

Their method is therefore succinctly put as

For j = 1, 2, … do

     predict wj;

     receive input zj sampled i.i.d from unknown distribution

     suffer loss f(wj, zj)

     define gradient gj = delta f(wj,zj)

     compute (wj+1, aj+1) = phi (aj, gj, alpha-j);

end

Notice the last step is an unspecified update rule where we take the following three arguments:

-          an auxiliary state vector aj that summarizes all of the necessary information about the past

-          a gradient gj of the loss function evaluated at wj

-          and an iteration dependent parameter alpha-j as stepsize

to compute the   next predictor and the new auxiliary state predictor.

This is very similar in nature to what I tried to by evaluating the data once as it arrives using stochastic gradient based summary but by initializing the current residual with similar update to treat the new data while reducing it completely for the previous iteration data. For example we can initialize the residual for the next iteration with the weighted average of the conventional initial residual and the previous initial residual. And we can measure the loss with the sum of squares.

 

Friday, December 4, 2015

The authors of the paper on Optimal Distributed Online Prediction Using Mini-Batches provide a method that can be called a distributed mini-batch algorithm, a method of converting any serial gradient-based online prediction algorithm into a parallel or distributed algorithm. This method has two important properties:
It can use any gradient based update rule for serial online prediction as a black box and convert it into a parallel or distributed online  prediction algorithm.
If the loss function f(w,z) is smooth in w, the prediction, then this method attains an asymptotically optimal regret bound of O(sq.rt(m)). Moreover, the co-efficient of the dominant term sq.rt(m) is the  same as in the serial bound, and independent of k and of the network topology.
The notion of mini-batches is that we don't process over the entire data set but in iterations. Meanwhile the gradient is not updated after the entire dataset but after a certain number of iterations. It is not the same as stochastic method which can update the gradient after every iteration.  Hence is it is a cross between batch and stochastic gradient method.
They average a mini batch of stochastic gradients computed at each predictor. They propose that the standard deviation of those stochastic gradients reaches the coefficient in the regret bound.
#codingexercise
Given two single digit integer arrays where the elements taken together form a number, maximize the number by replacing it with elements of the second array.
void MaximizeWithReplace(List<int> input, List<int> replace )
{
replace.Sort();
replace.Reverse();
for (int i = 0; i < replace.Length; i++){
  bool replaced = false;
  for (int j = 0; j < input.Length; i++){
         if input[j] < replace [i]{
              input[j] = replace [i];
              replaced = true;
              break;
         }
     }
   if (!replaced) break; // done
   }
}

Thursday, December 3, 2015

Software to detect if a Virtual Machine is busy:
Introduction:
Servers are ubiquitous and a private Cloud provider has hundreds of servers in their inventory. Most of these servers fall under one of the following categories: Initialized, Active, Suspended, and Terminated. A server is generally in the initialized state it is being imaged and prepared for commissioning. A server is active when it is being used frequently by the user. A server is in the suspended state when there is confirmation that the user is no longer using it actively. A server is the terminated state when the server is ready to be reclaimed or repurposed.  Periodically a cloud provider may want to reclaim resources such as servers. In such cases, it is important for the cloud provider mark the servers as suspended before they can be terminated. Going through several hundreds of servers and marking them as suspended only on user confirmation is not very productive. Instead if there were a way to detect whether the servers are active or not, then we could even be more judicious of which users to target for confirmation.
Problem definition: How do we know whether a server is active or not ?
Solution: Servers come in many different OS versions and flavors. A script that might work in one may not work in another. Therefore, we may need to target each OS family with the nuances and metrics that they allow.
On Windows, we have a lot of instrumentation builtin. They came in many different forms such as WMI, perfmon counters, MOM and tracing and logging. There is probably very little need to write sophisticated code to figure out any particular component or system. That said, a Windows system task viewer may show varied level of peaks and valleys of activities even without user input.
On Macintosh, we have plenty of dscripts stashed away under the hood in Automations. In fact, a Mac OS can reliably say when the system is not doing much activity.
On Ubuntu, we have several different counters typically under the proc folder. In fact top and uptime commands often give metrics that give an idea of how active a VM was. The /proc/loadavg has the first three fields as load averages that give an idea of the number of jobs in the run queue. Load average is different from waiting for cpu. Generally if you have 1 process running at 100%, then the load average can be expected to be approach 1.

Conclusion: Periodic cron jobs may collect statistics over time that can reflect the usage of a VM server. Given this visibility, different orchestration frameworks and program may be able to use this information for several purposes such as monitoring, conservation etc.
#codingexercise
int FindSmallestWindow(List<int> haystack, List<int> needle)
{
// parameter validation and assume no duplicates in haystack or needle.
 var indexes= new SortedList();
for (int I =0; I <haystack.Length; I++)
     indexes.Add( haystack[i], i);
Array.sort(needle);
// indexes and needle both sorted
now we can use the sorted sequence and index arithmetic linearly
int min = haystack.Length - 1;
int max =  0;
int range = 0;
for (int i = 0; i < needle.length; i++)
{
int index = indexes[needle[i]];
if (index < min) min = index;
if (index > max) max = index;
}
if (min > max) {int temp = min; min = max; max = temp;}
int range = max - min + 1;
return range;

}

Wednesday, December 2, 2015

Ofer, Ran, Ohad and Lin's paper focus on 'large-scale' and 'high-rate' online prediction problems. Here parallel and distributed computing is critical to providing a realtime service. They define their problem precisely to which they want to apply online stochastic prediction methods. There is  a stream of inputs z1, z2, ... where each zi is sampled independently from a fixed unknown distribution over a sample space Z. Before observing each zi,  a point wi is predicted from a set W. After making the prediciton Wi,  zi is observed and a loss f(wi, zi) is suffered, where f is  a predefined loss function.  Then zi is used to improve the prediction mechanism for the future eg. using a stochastic gradient method. The goal is to accumulate the smallest possible loss as we process the sequence of inputs. The quality of predictions is measured using the notion of regret which is merely a sum of the differences in the loss on the prediction of the input and the loss of the fixed predictor w* which is optimal with respect to the underlying distribution. This regret is a random variable since it relies on stochastic inputs zi each of which is sampled independently from a fixed unknown distribution. Hence we are free to apply Markov chains wherever appropriate. Moreover, the  expected regret is bounded for simplicity and then these results are used to obtain high probability bounds on the actual regret.Also this discussion is restricted to convex prediction problems, those with a global minima such as the ones we have discussed earlier. For example, we get a convex surface when we take  a positive definite matrix A for use as in the linear equation Ax = b.
The distributed computing system in this case is modified as a set of k nodes, each of which is an independent processor, and a network that enables the nodes to communicate with each other.  Each node receives an incoming stream of examples from an outside source so each participates as in a splitter of work load and each performs the same operations under the algorithm.
The performance of the algorithm can be bounded on the distributed environment. On one hand an ideal solution is to run a serial algorithm on a single processor that is k times faster than a single node. The solution is optimal because there is no communication and no partitioning of data and aggregation. Besides most distributed algorithms can work on one processor.For distributed algorithms, there is another paper that already bounds the optimal regret that can be achieved by a gradient based serial algorithm  on an arbitrary convex loss and this has a complexity of the order of square root (m). On the other hand, a trivial solution is to run the serial algorithm entirely and one per each of the nodes.This does not require any communication and can be called the no-communication algorithm. The only trouble is that it scales poorly with the network size k In fact it can be shown to have a square root(k) performance worse than the ideal solution by assuming each node processes m/k inputs, the expected regret per node is sq.rt(m/k)

Tuesday, December 1, 2015

Today we review the paper Optimal Distributed Online Prediction using Mini Batches. by Dekel, Shamir, Xiao et al.  This paper mentions that online prediction methods are typically presented as serial algorithms running on a single processor. But to keep up with the increasing data sets,we need to parallelize it. In this context they present a distributed mini-batch algorithm. - a method that converts many serial gradient based online prediction algorithms into distributed algorithms.In addition, this method takes into account communication latencies between nodes in distributed environment. In fact they show linear speedup of distributed stochastic optimization on multiple processors.They also mention advantages of their approach on a web scale prediction problem.
We will discuss this but let us take a short break to discuss an interview question below.
Given a series of letters and another smaller set find the smallest window in the larger set that has all the elements from the smaller.
One way to do this would be to find all starting positions of a sliding window and then to find the minimum of lengths.

Monday, November 30, 2015

#codingexercise
From a stream of numbers, get N minimum assuming no duplicates
SortedList<int> GetMinStream(BinaryReader reader, int N)
{
  var ret = new SortedList<int, int>();
  ret.Capacity = N + 1;
  while  (reader.PeekChar() != -1)
 {
    int val = reader.ReadInt32();
    int count = ret.count;
    if (count > 0 ){
    int min = ret.GetByIndex(count -1);
    if (val < min){
        ret.Add(val, val)
        ret.RemoveAt(count - 1);
    }
    }else{
        ret.Add(val, val);
   }
  }
  return ret;
}

For medians, we can maintain heap.

Sunday, November 29, 2015

Today we try to improve stochastic gradient method. We can recall that this is already better than Batch gradient method because it provides earlier solutions than evaluating the entire data. The error function could be steepest descent or sum of squares and in either case, we apply the correction iteratively but after every sample in the case of stochastic gradient method as opposed to evaluating after all the samples as in the case of batch gradient method. Since the update to common parameters such as the residual are after every sample, the stochastic gradient method is not conducive to parallel-ization as opposed to batch gradient method which is in summation form.Each update requires holding a lock on the common parameters which in effect serializes the parallel processing. In batch gradient or mini-batch gradient, we try to get the best of both worlds by reducing the number of times we need to take the lock and distributing the data on parallel processors. However parallelization is not restricted to exploiting summation forms by distributing data to each processor directly or indirectly by distributing the iterations over the processors.
We instead use a technique of leap-frogging over the data by using some statistics/metadata/summary information over the data. The summaries are generated as and when the data becomes available while at the same time we decouple the data processing with the algorithm to find the solution. For example, we can take the data and calculate the step-lengths in the search direction for the data to compute the correction as data processing summary for any or all data  as and when it becomes available. But instead of updating the residual after one or several batches, we provide the residual updates only when we want to find the solution and this is independent of the data processing. The new residual is used in the data processing the next time which is what tightly integrates the algorithm to the data. But if we could come up with an algorithm that works off the data summaries instead of the data processing, then we decouple the data processing push operation from the solution finding pull operation of the algorithm. One way to do this would be to generate different summaries by 'guessing' different residuals for the data and then have the algorithm use the summary closest to the one matching the newly computed residual. The trick is to have the guesses closely align with those that cause the convergence (something we do currently by taking feedback from the previous loop and keeping the loop iterations smaller.  Another way to do this would be to perform the entire algorithm as is on the local data for each processor to find the local minima of the model and then find the global minima over all the data. Then if we have the luxury of iterating globally, we can update the residual with the weighted averages of the residuals found so far.
The key idea here is to evaluate the data once to find the optimal solution for the data as with conventional algorithm. But when more data becomes available, combine the previous solution/summary with the current data to find the new solution/summary. This is what we call by leap-frogging. In other words, while both stochastic and batch gradient require data to be processed again, we require data to be processed incrementally such that data once processed doesn't need to be processed again. The equation therefore changes from applying newer residual to the same data to one that applies corrections on current data based on previously computed partial results. This way the processing of newer data can be done in parallel while the computation of the result can be done serially and one time. One variation of SGD that comes close along these lines is the use of momentum learning where a factor of the previous gradient is added to the weight update for the current iteration.