Cluster computing

Wednesday, April 9, 2014

We talk about a Splunk app today to read installation logs on a machine. The app monitors and targets installation logs. Optionally, it could try to detect when an application is being installed and turn on verbose logging. The application itself can be packaged in Wixsharp. Also, the application works on windows installer technologies and is to be implemented in C#.
In addition the application will send all classification of messages from these files to the Splunk server as events to be indexed. So it could have a modular input for this machine.
As with the Octopus tool, this app could read different machines.
The Octopus tool is known for facilitating MSI installations on datacenter machines because it can do that consistently for any subset of the machines in the data center. It repeats the same process that it does on one machine with others as well.
Octopus also has a web interface. and this is convenient to choose the machines on which it deploys. For example, we can choose the package we want to deploy and the target machines and Octopus can deploy on all these machines. Should there be a need to change the configuration on the machines, they can be parameterized and passed to the application. This is very convenient.
The application could follow the same as what Octopus does in the sense that it reads from multiple machines and collects the data together with the hostname. The hostname has to be granular in that it should resolve to the physical machine in say a cluster. This kind of granularity is important because we want to associate the logs to the machines.
The application could also look for different levels of details such as whether there were errors in the logs, whether there were registries altered, whether there were files touched, whether there were and settings changes, whether there were any custom actions, font files etc. A lot of the installations leave behind the log files and these are more handy than the installation information received from msiinv and msiinfo kind of tools. These tools can be run for any target machine to get information on the current state of all the applications and their installations. That also provides a valuable input for feeding events into the Splunk. The tools can be run at periodic intervals or on demand as well although the former is recommended and only the deltas may need to be fed into Splunk.

Tuesday, April 8, 2014

We continue our discussion on Ford Fulkerson algorithm today. We talked about improving the efficiency of Ford-Fulkerson basic algorithm with a data structure corresponding to the directed graph, G' = (V, E') where E' implies there is an edge from u to v or from v to u in E. and edges E are also in G'. Given a flow f on G, the edges in the residual network Gf, consists of all edges (u,v) of G' such that c(u,v) - f[u,v] <> 0 The time to find a path in the residual network is now O(V + E') = O(E) if we use depth first or breadth first search. This is a significant improvement. Each iteration on the while loop now takes O(E) time making the overall complexity as O(E | F* |)
Since the algorithms works on incremental units of flow, we look at whether this algorithm scales for large networks. For simple networks, we know that it works efficiently since the |f*| is small.
In a large network, maximum flow can have a value to the order of 10 ^ 6 and this many units of flow traverse a path s->u->t while another so many units traverse the path s->v->t. If the first augmenting path found by Ford Fulkerson algorithm is s->u->v->t then the flow has value 1 after first iteration. If the second iteration has the path s->v->u->t, then the flow has value 2. If we choose the former in odd number iterations and the latter in even number iterations, we would perform a total of two million augmentations, increasing the flow value by only 1 unit at each time.
We could improve the bound on the Ford Fulkerson algorithm if we implement the computation of the augmenting path p with a breadth first search that is if the augmenting path is a shortest path from s to t in the residual network where each edge has unit distance (weight). This implementation of the Ford Fulkerson method is called the Edmonds Karp algorithm. The Edmonds Karp algorithm runs in O(V E2) time because it depends on the distances to vertices in the residual network Gf.
This improvement is due to shortest path distance. The shortest path distance in a residual network increases monotonically with each flow augmentation. We prove this lemma this way:

Let us take the vertex v belonging to V - {s,t} where s is the source and t is the sink. We will assume the shortest path distance from s to v decreases with a flow augmentation and then derive the contradiction. .Let f be the flow before the first augmentation and f' just afterwards.
Let v be the vertex with the minimum shortest path distance that was decreased by the augmentation so that the shortest path of v with the new flow is less than the one before.
Let p = s ~ u --> v be a shortest path from s to v in Gf', so that (u,v) belongs to Ef' and
the shortest-path from s to u = shortest path from s to v - 1
Because of how we chose v, we know that u was unchanged.
In the new flow, the shortest-path label of u is at least as much as it was before.
The claim now is that the edge u,v did not belong the previous Ef. We state this because:
if we had that edge belong to Ef, we would also have the shortest-path to v in the old flow
to be less than the shortest path to u with the new flow plus unit-distance by simple triangular inequality.
and definitely less than the shortest path to u with the old flow plus unit-distance because they have to be different
and also equal to the shortest path to v in the new flow given that it got decremented by unit-distance with the augmentation
overall contradicting that u,v was in the earlier Ef
The Edmond Karp algorithm always augments flow along the shortest paths, and therefore the shortest path from s to u in Gf has (v,u) as its last edge.
Therefore, we say that
the shortest path distance (s,v) = shortest path distance (s,u) in old flow- 1
<= shortest path distance (s,u) in new flow - 1 by inequality
= shortest-path distance(s,v) - 2 given that the distance for v decremented.
which is contradicting that the distance did decrement.
We show that such a vertex did not exist and we strictly increment the shortest path distance with each flow augmentation.

Sunday, April 6, 2014

The basic Ford Fulkerson algorithm is given by the following:
Ford-Fulkerson(G, s, t)
for each edge (u,v) belonging to E[G]
do f[u,v] = 0;
f[v,u] = 0;
While there exists a path p from s to t in the residual network Gf
do cf(p) = min[cf(u,v) : (u,v) is in p]
for each edge (u,v) in p
do f[u,v] = f[u,v] + cf(p)
f[v,u] = -f[u,v]

The Ford Fulkerson algorithm expands on the Ford Fulkerson method mentioned in the previous post. The while loop repeatedly finds an augmenting path p in Gf and augments flow f along the residual capacity cf (p). When no augmenting path exists, the flow f is the maximum flow.
How the augmentation is done determine s how well the algorithms performs.
If a breadth first traversal is chosen, the algorithm runs in polynomial time.
However let's first take the case when the augmentation path is chosen arbitrarily and the capacities are integral.
The complexity of the initialization steps are O (E)

The complexity of the while loop is O (E f*) where f* is the maximum flow through the system. The while loop is executed at most F* times. The efficiency of the while loop can be increased with the choice of a suitable data structure.

Saturday, April 5, 2014

Today we discuss Ford Fulkerson method to solve maximum flow problem. A maximum flow problem is one where we can interpret a directed graph as a flow network. Each directed edge has a stated capacity.In the maximum flow problem, we wish to capture the greatest rate at which material can be perceived to flow through the graph without violating any capacity constraints.
The Ford Fulkerson method is defined as follows:
Ford-Fulkerson-method(G,s,t)
initialize flow f to 0
while there exists an augmenting path p
do augment flow f along p
return f
As we can see, this method is iterative. At each iteration, we increase the flow by an augmenting path which is simply a path along which we can send more flow and then augmenting the flow along this path. We repeat this process until no augmenting path can be found. There is a way to show that this yield maximum flow which is the max flow min cut theorem. In the Ford Fulkerson method, there can be different implementations to attain the maximum flow and hence named as a method instead of an algorithm.
In the graph with a flow network and a flow, we will have some edges that can admit more flow. This additional flow that we can push through is called residual capacity. The Ford Fulkerson method repeatedly augments the flow along augmenting paths until a maximum flow can be found. The max-flow min-cut theorem tells us that a flow is maximum if and only if its residual network contains no augmenting path. To prove this theorem, a technique called a cut of flow is used.
A cut (S, T) of flow network is a partition of V into S and T = V - S such that s belongs to S and t belongs to T. If f is the flow, then the net flow across the cut is f(S,T). The capacity of the cut is C(S,T). A minimum cut of a network is a cut whose capacity is minimum over all cuts of the network.
The max-flow min-cut theorem states that if f is a flow in a flow network G = (V, E) with source s and sink t, then the following conditions are equivalent:
1. f is a maximum flow in G
2. The residual network Gf contains no augmenting paths
3. |f| = c(S,T) for some cut (S,T) of G
The first condition implies the second condition. This we can show by proof of contradiction. If f is the maximum flow and there is an augmenting path then the flow sum f + fp has a flow in G with a strictly greater than |f| contradicting the assumption we just made.
The third condition implies the first because the value of any flow in network is bounded from above by the capacity of any cut of G. The condition |f| = c(S,T) thus implies that f is a maximum flow.
In a basic Ford Fulkerson algorithm, in each iteration, we find some augmenting path and increase the flow f on each edge of p by the residual capacity cf(p). In this implementation, we make use of the formula that the net flow across the cut (S,T) is f(S,T) = |f| which lets us calculate the residual capacity. Given that edges have capacity and flow and no edge implies no flow and no capacity, we update the flow f[u,v] between each pair of vertices that are connected by an edge by calculating the residual capacity in a temporary variable and augmenting the flow with this capacity.
The residual capacity is the minimum of the capacities. We update the flow in each step until there is none.

Friday, April 4, 2014

We will continue our discussion on Prims algorithm. We discussed that Prim's algorithm is a greedy method. We look for the minimum cost edges that we can add to our tree. Here we determined that at a vertex v and with a min cost edge e at v, there is a minimum cost spanning tree containing e. We also said that for any vertex v, some minimum cost edge at v is included in every minimum cost spanning tree. We then generalized it with partitions and said that some min cost edge across the partition must be included in every minimum cost spanning tree.
With the lemma above we choose the minimum weight edge that traverses the set of vertices in the tree and those we have not explored yet and keep growing our tree . Thus, Prim's algorithm works in finding the minimum cost spanning tree.

We continue our discussion on proof of correctness for Prims algorithm
To prove lemma 1, we take the edges e and say there is a cycle and we return via edge f at v. Lets suppose that edge e is not included in the tree. We add e to the tree and remove f. Since e is minimum cost, we know that cost of e < cost of f . The new tree must be less than or equal to original. In fact it has to be equal to f. This we see that we can replace min cost edges at a vertex.
We can generalize lemma 2.We take the set of vertices and split it up in two parts - S and V-S. Some min cost edge from V-S to S must be included in every min cost spanning tree.
S = {v} , V - S = V - {v} => Lemma 2
We have an edge with ends in both partitions. By similar proof as earlier where we take a returning edge and swap with our edge, we can show that the cost is unchanged and this generalization works.
Consider lemma 3 with a special case that all edge weights are distinct. These edges traverse from S to V-S. We know that unique min cost edge must be included in every min cost spanning tree.
At any point in Prims algorithm, we have t and V-t and it picks for some i, an edge from i in {S-V} to near(i) in {S} which is minimum cost. By Lemma 3, Prim's algorithm always picks a valid edge to add which gives us the proof of correctness.
We now adapt Lemma 3 and Proof of correctness for Prim with multiple copies of the distinct edge weights which resembles the real case to show that it works there as well.
Here we argue that every copy of the distinct edge weight will also traverse the partitions {S} and {S-V} and by picking the minimum cost we add a valid edge. The order of edges added is irrelevant because we know that we can swap one valid edge for another. In other words, having proved that we do pick a valid edge for distinct values, we pick a valid edge again for a same valued pair (i, near(i)). This gives us the proof of correctness for the real case where the graph can have arbitrary weights.

Wednesday, April 2, 2014

We will now look at the proof of correctness for Prim algorithm. Prim's algorithm will produce different minimum spanning trees depending upon the edges chosen in ties between equal weight edges. In this algorithm, we compute the smallest edge extending the tree.
We begin by picking two vertices i and j and an edge. We maintain a two dimensional data structure that maintains the endpoints of the edges. We also keep track of the mincost. for one of the remaining vertices, if there's no edge to either i or j, we set the cost to -1, otherwise we add the edge.
we proceed with the rest of the tree. We now look at a lemma. Let e be some minimum cost edge attached to vertex v. The lemma states there is some spanning tree (min cost) that contains e (minimum cost) that contains e. We use this lemma in this way. We need not start tree with the smallest edge vertex. If edges are not connected to this vertex then we set the distances to maximum. So the only variation from the original algorithm is the initialization step. The complexity of this algorithm is that the outer loop runs O(n). Inside each iteration, the dominant cost is to scan these pairs (i, near(i) to find minimum. To improve the efficiency, we need a data structure to keep track of near(i). the Red-Black tree is suitable for this need. It allows insert, finding minimum, deleting arbitrary value and searching arbitrary value each of these operations is of the order of logarithmic n. Its once for each edge n. we do nlogn operations to find min . if the edges are size m then we have mlogn operations to update near(). The total complexity is O(m+n)logn. If m is order of n ^2 where there are a lot of edges them r-b tree is not suitable. In summary, we have one greedy implementation of min-cost spanning tree.
we will now look at the proof of correctness for this algorithm. We list two lemma. We've listed the first one already. This is where we say that when we pick a vertex v with a min cost edge e at v there is a min cost spanning tree containing it. The second lemma is that for the vertex v with a min cost edge e, the edge e is included in every min cost spanning tree. to prove the second lemma, we take a vertex v and say that there is a min cost tree t with no min cost edge for some v. we add edges to create a cycle. e and f are two edges to v. cost of edge e < cost of edge f. If we delete f, we restore the tree and we can show that the tree is a min cost spanning tree since everything is connected and it includes e which contradicts the assumption that the t we started out with is a min cost spanning tree.