Cluster computing

Friday, April 4, 2014

We will continue our discussion on Prims algorithm. We discussed that Prim's algorithm is a greedy method. We look for the minimum cost edges that we can add to our tree. Here we determined that at a vertex v and with a min cost edge e at v, there is a minimum cost spanning tree containing e. We also said that for any vertex v, some minimum cost edge at v is included in every minimum cost spanning tree. We then generalized it with partitions and said that some min cost edge across the partition must be included in every minimum cost spanning tree.
With the lemma above we choose the minimum weight edge that traverses the set of vertices in the tree and those we have not explored yet and keep growing our tree . Thus, Prim's algorithm works in finding the minimum cost spanning tree.

We continue our discussion on proof of correctness for Prims algorithm
To prove lemma 1, we take the edges e and say there is a cycle and we return via edge f at v. Lets suppose that edge e is not included in the tree. We add e to the tree and remove f. Since e is minimum cost, we know that cost of e < cost of f . The new tree must be less than or equal to original. In fact it has to be equal to f. This we see that we can replace min cost edges at a vertex.
We can generalize lemma 2.We take the set of vertices and split it up in two parts - S and V-S. Some min cost edge from V-S to S must be included in every min cost spanning tree.
S = {v} , V - S = V - {v} => Lemma 2
We have an edge with ends in both partitions. By similar proof as earlier where we take a returning edge and swap with our edge, we can show that the cost is unchanged and this generalization works.
Consider lemma 3 with a special case that all edge weights are distinct. These edges traverse from S to V-S. We know that unique min cost edge must be included in every min cost spanning tree.
At any point in Prims algorithm, we have t and V-t and it picks for some i, an edge from i in {S-V} to near(i) in {S} which is minimum cost. By Lemma 3, Prim's algorithm always picks a valid edge to add which gives us the proof of correctness.
We now adapt Lemma 3 and Proof of correctness for Prim with multiple copies of the distinct edge weights which resembles the real case to show that it works there as well.
Here we argue that every copy of the distinct edge weight will also traverse the partitions {S} and {S-V} and by picking the minimum cost we add a valid edge. The order of edges added is irrelevant because we know that we can swap one valid edge for another. In other words, having proved that we do pick a valid edge for distinct values, we pick a valid edge again for a same valued pair (i, near(i)). This gives us the proof of correctness for the real case where the graph can have arbitrary weights.

Wednesday, April 2, 2014

We will now look at the proof of correctness for Prim algorithm. Prim's algorithm will produce different minimum spanning trees depending upon the edges chosen in ties between equal weight edges. In this algorithm, we compute the smallest edge extending the tree.
We begin by picking two vertices i and j and an edge. We maintain a two dimensional data structure that maintains the endpoints of the edges. We also keep track of the mincost. for one of the remaining vertices, if there's no edge to either i or j, we set the cost to -1, otherwise we add the edge.
we proceed with the rest of the tree. We now look at a lemma. Let e be some minimum cost edge attached to vertex v. The lemma states there is some spanning tree (min cost) that contains e (minimum cost) that contains e. We use this lemma in this way. We need not start tree with the smallest edge vertex. If edges are not connected to this vertex then we set the distances to maximum. So the only variation from the original algorithm is the initialization step. The complexity of this algorithm is that the outer loop runs O(n). Inside each iteration, the dominant cost is to scan these pairs (i, near(i) to find minimum. To improve the efficiency, we need a data structure to keep track of near(i). the Red-Black tree is suitable for this need. It allows insert, finding minimum, deleting arbitrary value and searching arbitrary value each of these operations is of the order of logarithmic n. Its once for each edge n. we do nlogn operations to find min . if the edges are size m then we have mlogn operations to update near(). The total complexity is O(m+n)logn. If m is order of n ^2 where there are a lot of edges them r-b tree is not suitable. In summary, we have one greedy implementation of min-cost spanning tree.
we will now look at the proof of correctness for this algorithm. We list two lemma. We've listed the first one already. This is where we say that when we pick a vertex v with a min cost edge e at v there is a min cost spanning tree containing it. The second lemma is that for the vertex v with a min cost edge e, the edge e is included in every min cost spanning tree. to prove the second lemma, we take a vertex v and say that there is a min cost tree t with no min cost edge for some v. we add edges to create a cycle. e and f are two edges to v. cost of edge e < cost of edge f. If we delete f, we restore the tree and we can show that the tree is a min cost spanning tree since everything is connected and it includes e which contradicts the assumption that the t we started out with is a min cost spanning tree.

Today we look at prim algorithm. This is also a greedy method. Here we take grow the tree by adding the minimum cost edge that connects to it. That is we look at all the edges that connect to the existing tree and choose the lowest cost.we grow the tree one edge at a time. The algorithm is described as below
Algorithm Prim(E, cost, n, t)
{
Let (k,l) be on edge of minimum cost in E;
t[1,1] = k
t[1,2] = l
mincost = cost[k,l]
for i = 1 to n do
if (cost[i,1] < cost[i,k]) then near[i] = l;
else if cost[i,k] is not large) then near[i] = k
ese near[i] = -1;
near[k] = near[l] = 0;

for ( i = 2 to n-1 ) do
{
Choose j such that near[j] > 0 and cost[j, near[j]] is minimum.
t[i,1] = j; t[i,2] = near[j]; mincost = mincost + cost[j, near[j]]
near[j] = 0;
for k = 1 to n do {
if (near[k] > 0) and (cost[k, near[k]]) > cost[k, j]) or
(near[k] < 0) and (cost[k,j] is not large)
then
near[k] = j;
}
}
return mincost;
}

Tuesday, April 1, 2014

Today we will look at Kruskal algorithm. This algorithm is a greedy method. It enumerates the edges in the order of ascending weights. The edges need not be connected. That is where at any given time we have disjoint sets that are called forests. The algorithm proceeds like this
We have two helper methods
Find (u) that finds the component containing u and
Union (u, v) that combines component s.
Algorithm Kruskal (E, cost, n, t)
{
Organize the edges of the graph in a min heap H , based on cost
For I = 1 to N do component [I] = I
I = 0 mincost =0
while ( i < n - 1)
{
    (u,v) = delete-min(H)
    j = Find(u); k = Find(v);
    if (j != k) then {
        i = i + 1;
        t[i,1] = u; t[i,2] = v;
       mincost = mincost + cost[u, v];
       Union(j, k);
      }
}
return mincost;
}

Monday, March 31, 2014

I want to talk about Splunk saved searches and their visibility aka permissions. Splunk has the following permissions :
A saved search could either be only private (local) or it could be visible to the current app or it could be visible to all applications (global)
The roles that Splunk defines are 'Everyone', 'admin', 'can_delete', 'power', 'splunk-system-role' and 'user'.
Each of these roles can have read or write capability.
Saved searches are searches that can be scheduled for runs. The permissions can also be specified via the configuration file with the following format:
[object_type/object_name]
access = read : [ comma-separated list of roles ], write: [comma-separated list of roles ]
We can also set permissions for all objects of a type
[eventtypes]
access = read : [ *]. write : [admin, power]
Objects can also be made to be globally available by adding an entry for the object in default.meta
as
export = system
As an example, to make all event types in the business analytics app viewable in every app in the Splunk installation, we use the following:
[eventtypes]
access = read : [ * ], write : [admin, power ]
export = system
Every app and its objects is governed by a set of permissions and these can be specified for every role within Splunk.
Every user has their own user directory. Objects visibility can be promoted from user level to app level and to global across all apps. This is called sharing and Splunk implements it by moving that object from the user directory to the app directory.

In today's post, we continue our discussion on Splunk. In addition to the discussion on search earlier, I wanted to bring up that search artifacts are maintained in the dispatch folder. We can view all the processing information in these folders including events, commands, arguments, preview and results.
We now look at alerts. Splunk Enterprise can be configured to send alert messages to anyone when real time or historical search results have met a condition. The conditions can cover a variety of threshold and trend based scenarios.
There are three Splunk alert categories. These are per-result alerts, scheduled alerts and rolling-window alerts. Alerts are based on reports that run on a regular interval over a set of historical time range or in real time (if the report is a real time search) When alerts are triggered, different actions are executed. There are several alert actions such as e-mail notifications that we will cover but later.
The per-result alerts are real-time searches that trigger every time the base search returns a result.
This is usually authored to be invoked when an matching result comes in and is generally used in workflow oriented applications. These alerts can be throttled to ensure they don't fire too often.
Alerts based on historical searches usually run on a regular schedule. This alert type triggers whenever a scheduled run of a historical search returns results that meet a particular condition. These are generally lower priority alerts and more for monitoring over time. For example, trigger an alert whenever the number of 404 errors in any 1 hour interval exceeds 100.
Real time searches can also have alerts that monitor events within a rolling time window. These trigger when its conditions are met by events as they pass through this window in real time. For example, trigger an alert when there are three consecutive failed logins.