Thursday, December 3, 2015

Software to detect if a Virtual Machine is busy:
Introduction:
Servers are ubiquitous and a private Cloud provider has hundreds of servers in their inventory. Most of these servers fall under one of the following categories: Initialized, Active, Suspended, and Terminated. A server is generally in the initialized state it is being imaged and prepared for commissioning. A server is active when it is being used frequently by the user. A server is in the suspended state when there is confirmation that the user is no longer using it actively. A server is the terminated state when the server is ready to be reclaimed or repurposed.  Periodically a cloud provider may want to reclaim resources such as servers. In such cases, it is important for the cloud provider mark the servers as suspended before they can be terminated. Going through several hundreds of servers and marking them as suspended only on user confirmation is not very productive. Instead if there were a way to detect whether the servers are active or not, then we could even be more judicious of which users to target for confirmation.
Problem definition: How do we know whether a server is active or not ?
Solution: Servers come in many different OS versions and flavors. A script that might work in one may not work in another. Therefore, we may need to target each OS family with the nuances and metrics that they allow.
On Windows, we have a lot of instrumentation builtin. They came in many different forms such as WMI, perfmon counters, MOM and tracing and logging. There is probably very little need to write sophisticated code to figure out any particular component or system. That said, a Windows system task viewer may show varied level of peaks and valleys of activities even without user input.
On Macintosh, we have plenty of dscripts stashed away under the hood in Automations. In fact, a Mac OS can reliably say when the system is not doing much activity.
On Ubuntu, we have several different counters typically under the proc folder. In fact top and uptime commands often give metrics that give an idea of how active a VM was. The /proc/loadavg has the first three fields as load averages that give an idea of the number of jobs in the run queue. Load average is different from waiting for cpu. Generally if you have 1 process running at 100%, then the load average can be expected to be approach 1.

Conclusion: Periodic cron jobs may collect statistics over time that can reflect the usage of a VM server. Given this visibility, different orchestration frameworks and program may be able to use this information for several purposes such as monitoring, conservation etc.
#codingexercise
int FindSmallestWindow(List<int> haystack, List<int> needle)
{
// parameter validation and assume no duplicates in haystack or needle.
 var indexes= new SortedList();
for (int I =0; I <haystack.Length; I++)
     indexes.Add( haystack[i], i);
Array.sort(needle);
// indexes and needle both sorted
now we can use the sorted sequence and index arithmetic linearly
int min = haystack.Length - 1;
int max =  0;
int range = 0;
for (int i = 0; i < needle.length; i++)
{
int index = indexes[needle[i]];
if (index < min) min = index;
if (index > max) max = index;
}
if (min > max) {int temp = min; min = max; max = temp;}
int range = max - min + 1;
return range;

}

Wednesday, December 2, 2015

Ofer, Ran, Ohad and Lin's paper focus on 'large-scale' and 'high-rate' online prediction problems. Here parallel and distributed computing is critical to providing a realtime service. They define their problem precisely to which they want to apply online stochastic prediction methods. There is  a stream of inputs z1, z2, ... where each zi is sampled independently from a fixed unknown distribution over a sample space Z. Before observing each zi,  a point wi is predicted from a set W. After making the prediciton Wi,  zi is observed and a loss f(wi, zi) is suffered, where f is  a predefined loss function.  Then zi is used to improve the prediction mechanism for the future eg. using a stochastic gradient method. The goal is to accumulate the smallest possible loss as we process the sequence of inputs. The quality of predictions is measured using the notion of regret which is merely a sum of the differences in the loss on the prediction of the input and the loss of the fixed predictor w* which is optimal with respect to the underlying distribution. This regret is a random variable since it relies on stochastic inputs zi each of which is sampled independently from a fixed unknown distribution. Hence we are free to apply Markov chains wherever appropriate. Moreover, the  expected regret is bounded for simplicity and then these results are used to obtain high probability bounds on the actual regret.Also this discussion is restricted to convex prediction problems, those with a global minima such as the ones we have discussed earlier. For example, we get a convex surface when we take  a positive definite matrix A for use as in the linear equation Ax = b.
The distributed computing system in this case is modified as a set of k nodes, each of which is an independent processor, and a network that enables the nodes to communicate with each other.  Each node receives an incoming stream of examples from an outside source so each participates as in a splitter of work load and each performs the same operations under the algorithm.
The performance of the algorithm can be bounded on the distributed environment. On one hand an ideal solution is to run a serial algorithm on a single processor that is k times faster than a single node. The solution is optimal because there is no communication and no partitioning of data and aggregation. Besides most distributed algorithms can work on one processor.For distributed algorithms, there is another paper that already bounds the optimal regret that can be achieved by a gradient based serial algorithm  on an arbitrary convex loss and this has a complexity of the order of square root (m). On the other hand, a trivial solution is to run the serial algorithm entirely and one per each of the nodes.This does not require any communication and can be called the no-communication algorithm. The only trouble is that it scales poorly with the network size k In fact it can be shown to have a square root(k) performance worse than the ideal solution by assuming each node processes m/k inputs, the expected regret per node is sq.rt(m/k)

Tuesday, December 1, 2015

Today we review the paper Optimal Distributed Online Prediction using Mini Batches. by Dekel, Shamir, Xiao et al.  This paper mentions that online prediction methods are typically presented as serial algorithms running on a single processor. But to keep up with the increasing data sets,we need to parallelize it. In this context they present a distributed mini-batch algorithm. - a method that converts many serial gradient based online prediction algorithms into distributed algorithms.In addition, this method takes into account communication latencies between nodes in distributed environment. In fact they show linear speedup of distributed stochastic optimization on multiple processors.They also mention advantages of their approach on a web scale prediction problem.
We will discuss this but let us take a short break to discuss an interview question below.
Given a series of letters and another smaller set find the smallest window in the larger set that has all the elements from the smaller.
One way to do this would be to find all starting positions of a sliding window and then to find the minimum of lengths.

Monday, November 30, 2015

#codingexercise
From a stream of numbers, get N minimum assuming no duplicates
SortedList<int> GetMinStream(BinaryReader reader, int N)
{
  var ret = new SortedList<int, int>();
  ret.Capacity = N + 1;
  while  (reader.PeekChar() != -1)
 {
    int val = reader.ReadInt32();
    int count = ret.count;
    if (count > 0 ){
    int min = ret.GetByIndex(count -1);
    if (val < min){
        ret.Add(val, val)
        ret.RemoveAt(count - 1);
    }
    }else{
        ret.Add(val, val);
   }
  }
  return ret;
}

For medians, we can maintain heap.

Sunday, November 29, 2015

Today we try to improve stochastic gradient method. We can recall that this is already better than Batch gradient method because it provides earlier solutions than evaluating the entire data. The error function could be steepest descent or sum of squares and in either case, we apply the correction iteratively but after every sample in the case of stochastic gradient method as opposed to evaluating after all the samples as in the case of batch gradient method. Since the update to common parameters such as the residual are after every sample, the stochastic gradient method is not conducive to parallel-ization as opposed to batch gradient method which is in summation form.Each update requires holding a lock on the common parameters which in effect serializes the parallel processing. In batch gradient or mini-batch gradient, we try to get the best of both worlds by reducing the number of times we need to take the lock and distributing the data on parallel processors. However parallelization is not restricted to exploiting summation forms by distributing data to each processor directly or indirectly by distributing the iterations over the processors.
We instead use a technique of leap-frogging over the data by using some statistics/metadata/summary information over the data. The summaries are generated as and when the data becomes available while at the same time we decouple the data processing with the algorithm to find the solution. For example, we can take the data and calculate the step-lengths in the search direction for the data to compute the correction as data processing summary for any or all data  as and when it becomes available. But instead of updating the residual after one or several batches, we provide the residual updates only when we want to find the solution and this is independent of the data processing. The new residual is used in the data processing the next time which is what tightly integrates the algorithm to the data. But if we could come up with an algorithm that works off the data summaries instead of the data processing, then we decouple the data processing push operation from the solution finding pull operation of the algorithm. One way to do this would be to generate different summaries by 'guessing' different residuals for the data and then have the algorithm use the summary closest to the one matching the newly computed residual. The trick is to have the guesses closely align with those that cause the convergence (something we do currently by taking feedback from the previous loop and keeping the loop iterations smaller.  Another way to do this would be to perform the entire algorithm as is on the local data for each processor to find the local minima of the model and then find the global minima over all the data. Then if we have the luxury of iterating globally, we can update the residual with the weighted averages of the residuals found so far.
The key idea here is to evaluate the data once to find the optimal solution for the data as with conventional algorithm. But when more data becomes available, combine the previous solution/summary with the current data to find the new solution/summary. This is what we call by leap-frogging. In other words, while both stochastic and batch gradient require data to be processed again, we require data to be processed incrementally such that data once processed doesn't need to be processed again. The equation therefore changes from applying newer residual to the same data to one that applies corrections on current data based on previously computed partial results. This way the processing of newer data can be done in parallel while the computation of the result can be done serially and one time. One variation of SGD that comes close along these lines is the use of momentum learning where a factor of the previous gradient is added to the weight update for the current iteration. 

Saturday, November 28, 2015

the stochastic gradient method mentioned in the earlier post is different from the batch gradient in the following ways:
1) the batch gradient method operates on all the data before each update  where as the stochastic gradient method updates after reading one training set.
2) If the number of training samples are very large then batch gradient method can be very slow. On the other hand, stochastic gradient method will be faster because it will improve from the first training set itself.
3) The error function is not as well minimized in SGD as it is in GD. However, it does converge though not to the minimum and oscillate around the optimal value. In fact it has been found to be empirically better than GD.
4) While the batch gradient method performs the following :
Repeat until convergence {
  Theta - j = Theta -j + correction  ( for every j )
}
the stochastic gradient method performs the following:
Loop {
for i = 1 to m {
Theta-j = Theta-j + correction ( for every j )
}
}
5) The batch gradient method find the global minima and because it evaluates the entire data it is not susceptible to local minima should they occur.
6) The latter method is called stochastic because the approximate direction that can be computed in every step can be thought of as a random variable of a stochastic process.

Friday, November 27, 2015


Thoughts on parallelizing Steepest Descent algorithm.

The Steepest Descent method is used to solve the equations of the form Ax = b. The method tries to move towards the solution by sliding down the bottom of a paraboloid. Assume the data to be a function f which is a quadratic form. It is minimized by the solution to Ax = b. f(x) is a curved surface for positive-definite A. The paraboloid   is the intersection of n hyperplanes each having dimension n-1. In our case we have two because we have two axes co-ordinates for the solution.  The bottommost point of the paraboloid (gradient = 0) is our target. It happens to be a point where the directional derivative of a function f on x is zero. We take a series of steps until we are satisfied that we are close enough to the solution x.  When we take a step, we choose a direction in which f decreases quickly. We keep track of an ‘error’ e – a vector that tells us how far we are from the solution.  A measure called residual gives us the direction of the steepest descent.  A ‘line search’ is a procedure that chooses a ‘step-length’ alpha to minimize f along a line. The ‘step-length’ can be maximized when the previous residual and the current gradient are orthogonal along the search line. Therefore the algorithm follows a zig zag path to the function f.

Given the inputs A, b, a starting value x,  a maximizing number of iterations i=max  and an error-tolerance of epsilon < 1, this traditional algorithm is described as follows:

The steepest descent algorithm proceeds this way:

set I to 0

set residual to b - Ax

set delta to residual-transposed.residual.

Initialize delta-0 to delta

while I < I-max and delta > epsilon^2 delta-0 do:

    q = Ar 

    alpha = delta / (r-transposed. q)

    x = x + alpha.residual

    If I is divisible by 50

        r = b - Ax

    else

        r = r - alpha.q

    delta = r-transposed. r

     I = I + 1

When parallelizing the algorithm,  mappers could independently calculate the updates to the target along each of the components of the vector by finding the residual, the step length and the error as above and calling it summary and then sum the calculations for these  in the reducer. Note that operations are to be performed for all data points and their ordering doesn’t matter, so partitioning the data into chunks and performing each step in parallel on that data is trivial and a speedup. The reducer can then aggregate the partial results from each of the mappers and then choose the next residual based on error tolerance.
 The key idea here is to express it in summation form. The inner product of two vectors x and y written as x-transposed.y represents the scalar sum from I = 1 to n of xi.yi. Note x-transposed.A.x is also scalar.
 In the algorithm above, r-transposed.q and r-transposed.r are both summation forms above. Note that r, b and x are vectors represented by n x 1 matrix. The vector A is n x n matrix. The summation forms yield scalar. This is used to compute delta and alpha both of which are scalars.
The solution x is updated iteratively by step length alpha in search direction q.

The above algorithm does the computations in mini-batched Gradient descent. Where the updates are available after processing k training samples where 1 < k < all n samples  In this case k = 50.
The advantage with processing entire data set is that we get the same optimum solution after the same number of iterations and hence it is deterministic. This is improved to having better answer sooner by doing it in batches of 50.
However, we could arrive at an approximate answer from the get go by training on one sample instead of more samples and doing more iterations by using the previous computations in the adjustments.
the stochastic gradient method mentioned in the earlier post is different from the batch gradient in the following ways:
1) the batch gradient method operates on all the data before each update  where as the stochastic gradient method updates after reading one training set.
2) If the number of training samples are very large then batch gradient method can be very slow. On the other hand, stochastic gradient method will be faster because it will improve from the first training set itself.
3) The error function is not as well minimized in SGD as it is in GD. However, it does converge though not to the minimum and oscillate around the optimal value. In fact it has been found to be empirically better than GD.
4) While the batch gradient method performs the following :
Repeat until convergence {
  Theta - j = Theta -j + correction  ( for every j )
}
the stochastic gradient method performs the following:
Loop {
for i = 1 to m {
Theta-j = Theta-j + correction ( for every j )
}
}
5) The batch gradient method find the global minima and because it evaluates the entire data it is not susceptible to local minima should they occur.
6) The latter method is called stochastic because the approximate direction that can be computed in every step can be thought of as a random variable of a stochastic process.