Cluster computing

Friday, November 20, 2015

In continuation of our discussion on applying the parallelization technique on more machine learning algorithms, let us discuss a few more. The technique enables incremental processing of more and more data as they become available.
The summation form lends itself to this technique on the examples we say because the computations of the algorithm does not change and while the data may not all be divided and computed in parallel, it is in fact being parallelized and computed the same ways as the data becomes available one chunk after the other.

In these cases we are reusing the previous computation of the data and we don't have to revisit that data again. We make summarized information already which we adjust based on new data. In the case the summary from the first data set is the result of the computation from the first data set.

Now let we can generalize this technique to be applied to all summation forms.

In the summation form, the data is already chunked into subgroups and then the computation is performed. The reducer merely aggregates the results. In other words, the data doesn’t overlap so it is easier to aggregate the results. In a streaming mode for the data when it becomes available the same computations can be done as and when the non-overlapping data arrives. Therefore the summary from each data set now need to be combined and since they represent non-overlapping data, it becomes easier to combine since each summary will contribute independently to the overall picture and the reducer in this case will merely applying smoothing of the aggregated results.
Now let us take another example. We compute k-means. Here the distance between the data points and centroid called the Euclidean distance can be calculated on every new dataset as they arrive on just the same way as for any subgroup or the entire dataset. Moreover combining centroid is also possible because the data points that are close to either centroid will be roughly equidistant to the new midpoint between the previous and the current centroid of the same class. Thus given the update to the previous centroid from existing data is also straightforward from the new data. Recall that the original k-means had an iteration step that aaigns the data points to the nearest cluster. With the technique above we are already prepared for iteration and in fact we do it incrementally. While the original iterations refined the centroids by reassigning, we improve it with newer data giving better answers at each stage and eliminating reprocessing over entire data

Thursday, November 19, 2015

#coding interview question
1) Write a program for run-length encoding
for eg: AAABBCCCCC
is written as 3A2B5C
String RLE(String input)
{
if String.IsNullOrEmpty(input) return input;
int count = 0;
int prevChar = '/0';
StringBuilder output = "";
for (int i =0; i < input.Length; i++)
{
if (input[i] != prevChar){
if (prevChar != '/0') {
output += count.toString();
output += prevChar;
}
count = 1;
prevChar = input[i];
}else{
count+=1;
} // end if
} // end for
output += count.toString();
output += prevChar;
return output;
}

2) And in place encoding is as follows:
<work-in-progress>
Void RLE(ref StringBuilder input)
{
if String.IsNullOrEmpty(input.toString ()) return;

int count = 0;
int prev = -1; // previous index
int prevCount = 0;
Char prevChar = '/0';
String readOnlyCopy = input.SubString(0,1); // just the first bRead to get started
int countLen = count.ToString().Length;
int i = 0; // index to span reading the length of the string
bool readAhead = false;
while ( i < input.Length && readOnlyCopy.Length > 0)
{
if (prev == -1 || input[i] != input[prev] ){
if (prev != -1) {
prevChar = input [prev];
prevCount = count;
countLen = count.toString().Length; // actually we need fixed size packing say 4 bytes for number and one byte for char but the processing is similar
if (countLen >= count){
readOnlyCopy += input.subString (i+1, countLen); // to include count and char
readAhead = true;
}

Int w = prev;
If (prev==-1) w++;
If (countLen > input.Length - i ) input.resize (input.Length×2);
For (int j=0; j < countLen; j++){
Input[w]=count.toString ()[j];
w++;
}
Input [w] = prevChar;
// i = i + countLen;
readOnlyCopy.TrimLeft(0, countLen);

}
count = 1;
prev = i;
prevChar = input [prev];
Countlen=1;
}else{
count+=1;
} // end if
i++;
if (readahead == false)
readOnlyCopy += input.SubString(i, 1); // next bRead
else
readahead=false;
} // end while
// one last while lookahead
int w = prev;
if (w + count.toString().Length + 2 > input.Length) input.resize(input.Length *2 );
countLen = count.toString().Length;
for (int j = 0; j < countLen; j++){
Input[w] = count.toString()[j];
w++;
}
Input[w] = prevChar;
I = i + countLen;
w++;
Input[w] = '/0';
return ;
}

Wednesday, November 18, 2015

In continuation of our discussion on how to apply linear regression in an interative manner to build it as an when the data becomes available, let us know look at GLMnet algorithm,
As we had noted earlier, the GLMNet algorithm can work when we have fewer data because it gives methods to
1) determine if we are overfitting and
2) hedge our answers if we are overfitting
With ordinary least squares regression, objective is to minimize the sum squared error between actual values and our linear approximation. Coefficient shrinkage methods add a penalty on coefficients.
It yields a family of regression solutions - from completely insensitive to input data to unconstrained ordinary least squares.
First we build a Regularized Regression. We want to eliminate over-fitting because it tunes the model and it gives best performance on held out data. In order to avoid this overfitting, we constrain the regression with penalties such as coefficient shrinkage. Two types of coefficient shrinkage are most frequently used : sum of squared coefficients and sum of absolute value of coefficients.GLMNet algorithm incorporates ElasticNet penalty which is a weighted combination of sum squares and sum of absolute values.
We now look at how to use this in an incremental manner.
First, the existing data already has a regularized expression and a fitting line, shaped with coefficient shrinkage. Either the new data affects the existing regression or a new regression is created for the new data and the two fitting lines are replaced by a common one.
Both methods require slightly different computation from the first iteration because we adjust existing line/s instead of fitting a new line. Again we can divide this into non-overlapping data and overlapping data. For the non-overlapping data, a new regression is the same as the first iteration. For the overlapping data, we compute the shrinkage directly against the existing line. The use of shrinkage here directly gives the corrections applied. Although the shrinkage was originally used to eliminate overfitting, it also seems to give an indication of the correction to be applied to the line. The higher the number of data points and the larger the shrinkages, the more the correction applied to the line. In this way we start from a completely data insensitive line to one that fits it best. For the sake of convenience, we use the existing line from previous regression. This is a slight different computation we introduce.
If we don't want to introduce a different computation as above, we can determine a GLMNet regression for the new points exclusively and then proceed to combining the regressions from new and existing datasets
Combining regression is a computation over and on top of the GLMNet computation for any dataset. While GLMNet works well for small dataset, this step enables the algorithm to be applied repeatedly as the data becomes available more and more. This will also be required when we combine regression lines such as for non-overlapping data.

Tuesday, November 17, 2015

Previously we discussed using computations that don’t span the entire data all at once albeit in parallel but compute based on as and when data is made available. We took an example of computing mean this way.

Let us take an example of the summation form to compute the mean

For example if we take 3,1,5, 7, 9 the mean is 25/5 = 5

If we were to add 10,15,17,19,24, the mean would now become 85 + 25 / 10 = 11

We could arrive at this by adding 85 /5 = 17 and taking the sum (5 + 17) / 2 for two equal ranges otherwise we take the weights based on their element count = 11

In this case we are reusing the previous computation of the data and we don't have to revisit that data again. We make summarized information already which we adjust based on new data. In the case the summary from the first data set is the result of the computation from the first data set.

Now let we can generalize this technique to be applied to all summation forms.

For example in a linear regression, the lines may be drawn in chunks across different data as they arrive and then the reducer merely has to sort the summaries and apply a generalized line across the data. In this case it can take the beginning and the end points of the pieces to draw the line or find one that matches most of the slopes. Thus it is possible to aggregate results on non-overlapping data

For the overlapping data however, the summaries are not trivial to combine because the previous summary may be completely overwritten. However, we look at the datapoints that are new in the current summary and calculate their affect on the previous slope. Since the new data points may be scattered over and beyond the previous data points, it should be easy to determine whether they affect the previous slope uniformly or non-uniformly. For the uniform case, there may not need to be a correction as in a change of the gradient and may at most require a displacement of the regression line. For the non-uniform case, depending on which end has more of the newer data points assuming all points contribute equally, a clockwise or anticlockwise rotation may be performed to get the correction to the previous slope. Thus we have an algorithm where we use the previous summary to have created the current summary.

Furthermore, if there are two data point blocks that have a partial overlap, then they are split into two non-overlapping and one overlapping regions The regression line for the non-overlapping regions is preserved. The regression lines from the two contributing blocks in the overlap can merely be replaced by a line that is equidistant from both and passing through their intersection. In other words, summaries are additive too.

Since we already mentioned that summation forms are all applicable this way, and now with the summaries being additive too, we have expanded the kind of algorithms that can be written this way.

Let us take an example of the summation form to compute the mean

For example if we take 3,1,5, 7, 9 the mean is 25/5 = 5

If we were to add 10,15,17,19,24, the mean would now become 85 + 25 / 10 = 11

We could arrive at this by adding 85 /5 = 17 and taking the sum (5 + 17) / 2 for two equal ranges otherwise we take the weights based on their element count = 11

Now let we can generalize this technique to be applied to all summation forms.

Since we already mentioned that summation forms are all applicable this way, and now with the summaries being additive too, we have expanded the kind of algorithms that can be written this way.

Let us take an example of the summation form to compute the mean

For example if we take 3,1,5, 7, 9 the mean is 25/5 = 5

If we were to add 10,15,17,19,24, the mean would now become 85 + 25 / 10 = 11

We could arrive at this by adding 85 /5 = 17 and taking the sum (5 + 17) / 2 for two equal ranges otherwise we take the weights based on their element count = 11

Now let we can generalize this technique to be applied to all summation forms.

Since we already mentioned that summation forms are all applicable this way, and now with the summaries being additive too, we have expanded the kind of algorithms that can be written this way.

Monday, November 16, 2015

#codingquestion
List of stations and distances between them are given and find all pairs shortest distance.
Solution:
A straightforward implementation could be Floyd Warshall algorithm. This algorithm computes the shortest path weights in a bottom-up manner. It exploits the relationship between a pair of intermediary vertices and the shortest paths that pass through them. If there is no intermediary vertex, then such a path has at most one edge and the weight of the edge is the minimum. Otherwise, the minimum weight is the minimum of the path from I to j or the path from I to k and k to j. Thus this algorithm iterates for each of the intermediary vertices for each of the given input of an N*N matrix to compute the shortest path weight.
Algorithm Warshall(A[1..n, 1..n])
{
   R-0 is initialized with the adjacency matrix ; // R-0 is the node to itself.
   for k = 1 to n do
for i = 1 to n do
for j = 1 to n do
          R-k[i, j] = R-(k-1) [i, j] OR
                             (R-(k-1) [i, k] AND R-(k-1) [k,j])
return R-(n)

}
Another method could be to use the Bellman Ford algorithm to find the shortest pair distance between two vertices and also to find all paths.
public static bool GetShortestPath(int[,] graph, int numVertex, int start, ref List<int> distances, ref List<int> parents)
{
// initialize Single Source
for (int i = 0; i < numVertex; i++)
{
distances[i] = int.MaxValue;
parents[i] = -1;
}

distances[start] = 0;

var allEdges = GetAllEdges(graph, numVertex);
for (int k = 0; k < numVertex - 1; k++)
{
for (int i = 0; i < allEdges.Count; i++)
{
// relax
int sum = distances[allEdges[i].Item1] == int.MaxValue ?
distances[allEdges[i].Item1] :
distances[allEdges[i].Item1] + allEdges[i].Item3;
if (distances[allEdges[i].Item2] > sum)
{
distances[allEdges[i].Item2] = distances[allEdges[i].Item1] + allEdges[i].Item3;
parents[allEdges[i].Item2] = allEdges[i].Item1;
}
}
}

for (int i = 0; i < allEdges.Count; i++)
{
if (distances[allEdges[i].Item2] > distances[allEdges[i].Item1] + allEdges[i].Item3)
{
return false; // cycle exists;
}
}

return true;
}
public static void GetAllPaths(int[,] graph, int numVertex, int source, int destination, int threshold, ref List<int> candidatePath, ref List<int> candidateDist, ref List<List<int>> pathList, ref List<List<int>> distanceList)
{
if (candidatePath.Count > threshold) return;

var path = new List<int>();
var distances = new List<int>();
GetOutboundEdges(graph, numVertex, source, ref path, ref distances);
if (path.Contains(destination) && (candidatePath.Count == 0 || (candidatePath.Count > 0 && candidatePath.Last() != destination)))
{
candidatePath.Add(destination);
candidateDist.Add(distances[path.IndexOf(destination)]);
if (pathList.Contains(candidatePath) == false)
pathList.Add(new List<int>(candidatePath));
if (distanceList.Contains(candidateDist) == false)
distanceList.Add(new List<int>(candidateDist));
candidatePath.RemoveAt(candidatePath.Count - 1);
candidateDist.RemoveAt(candidateDist.Count - 1);
}

for (int i = 1; i < path.Count; i++)
{
// if (i != source)
{
candidatePath.Add(path[i]);
candidateDist.Add(distances[i]);
GetAllPaths(graph, numVertex, path[i], destination, threshold, ref candidatePath, ref candidateDist, ref pathList, ref distanceList);
candidatePath.RemoveAt(candidatePath.Count - 1);
candidateDist.RemoveAt(candidateDist.Count - 1);
}
}

}