Cluster computing

Friday, April 4, 2014

We continue our discussion on proof of correctness for Prims algorithm
To prove lemma 1, we take the edges e and say there is a cycle and we return via edge f at v. Lets suppose that edge e is not included in the tree. We add e to the tree and remove f. Since e is minimum cost, we know that cost of e < cost of f . The new tree must be less than or equal to original. In fact it has to be equal to f. This we see that we can replace min cost edges at a vertex.
We can generalize lemma 2.We take the set of vertices and split it up in two parts - S and V-S. Some min cost edge from V-S to S must be included in every min cost spanning tree.
S = {v} , V - S = V - {v} => Lemma 2
We have an edge with ends in both partitions. By similar proof as earlier where we take a returning edge and swap with our edge, we can show that the cost is unchanged and this generalization works.
Consider lemma 3 with a special case that all edge weights are distinct. These edges traverse from S to V-S. We know that unique min cost edge must be included in every min cost spanning tree.
At any point in Prims algorithm, we have t and V-t and it picks for some i, an edge from i in {S-V} to near(i) in {S} which is minimum cost. By Lemma 3, Prim's algorithm always picks a valid edge to add which gives us the proof of correctness.
We now adapt Lemma 3 and Proof of correctness for Prim with multiple copies of the distinct edge weights which resembles the real case to show that it works there as well.
Here we argue that every copy of the distinct edge weight will also traverse the partitions {S} and {S-V} and by picking the minimum cost we add a valid edge. The order of edges added is irrelevant because we know that we can swap one valid edge for another. In other words, having proved that we do pick a valid edge for distinct values, we pick a valid edge again for a same valued pair (i, near(i)). This gives us the proof of correctness for the real case where the graph can have arbitrary weights.

Wednesday, April 2, 2014

We will now look at the proof of correctness for Prim algorithm. Prim's algorithm will produce different minimum spanning trees depending upon the edges chosen in ties between equal weight edges. In this algorithm, we compute the smallest edge extending the tree.
We begin by picking two vertices i and j and an edge. We maintain a two dimensional data structure that maintains the endpoints of the edges. We also keep track of the mincost. for one of the remaining vertices, if there's no edge to either i or j, we set the cost to -1, otherwise we add the edge.
we proceed with the rest of the tree. We now look at a lemma. Let e be some minimum cost edge attached to vertex v. The lemma states there is some spanning tree (min cost) that contains e (minimum cost) that contains e. We use this lemma in this way. We need not start tree with the smallest edge vertex. If edges are not connected to this vertex then we set the distances to maximum. So the only variation from the original algorithm is the initialization step. The complexity of this algorithm is that the outer loop runs O(n). Inside each iteration, the dominant cost is to scan these pairs (i, near(i) to find minimum. To improve the efficiency, we need a data structure to keep track of near(i). the Red-Black tree is suitable for this need. It allows insert, finding minimum, deleting arbitrary value and searching arbitrary value each of these operations is of the order of logarithmic n. Its once for each edge n. we do nlogn operations to find min . if the edges are size m then we have mlogn operations to update near(). The total complexity is O(m+n)logn. If m is order of n ^2 where there are a lot of edges them r-b tree is not suitable. In summary, we have one greedy implementation of min-cost spanning tree.
we will now look at the proof of correctness for this algorithm. We list two lemma. We've listed the first one already. This is where we say that when we pick a vertex v with a min cost edge e at v there is a min cost spanning tree containing it. The second lemma is that for the vertex v with a min cost edge e, the edge e is included in every min cost spanning tree. to prove the second lemma, we take a vertex v and say that there is a min cost tree t with no min cost edge for some v. we add edges to create a cycle. e and f are two edges to v. cost of edge e < cost of edge f. If we delete f, we restore the tree and we can show that the tree is a min cost spanning tree since everything is connected and it includes e which contradicts the assumption that the t we started out with is a min cost spanning tree.

Today we look at prim algorithm. This is also a greedy method. Here we take grow the tree by adding the minimum cost edge that connects to it. That is we look at all the edges that connect to the existing tree and choose the lowest cost.we grow the tree one edge at a time. The algorithm is described as below
Algorithm Prim(E, cost, n, t)
{
Let (k,l) be on edge of minimum cost in E;
t[1,1] = k
t[1,2] = l
mincost = cost[k,l]
for i = 1 to n do
if (cost[i,1] < cost[i,k]) then near[i] = l;
else if cost[i,k] is not large) then near[i] = k
ese near[i] = -1;
near[k] = near[l] = 0;

for ( i = 2 to n-1 ) do
{
Choose j such that near[j] > 0 and cost[j, near[j]] is minimum.
t[i,1] = j; t[i,2] = near[j]; mincost = mincost + cost[j, near[j]]
near[j] = 0;
for k = 1 to n do {
if (near[k] > 0) and (cost[k, near[k]]) > cost[k, j]) or
(near[k] < 0) and (cost[k,j] is not large)
then
near[k] = j;
}
}
return mincost;
}

Tuesday, April 1, 2014

Today we will look at Kruskal algorithm. This algorithm is a greedy method. It enumerates the edges in the order of ascending weights. The edges need not be connected. That is where at any given time we have disjoint sets that are called forests. The algorithm proceeds like this
We have two helper methods
Find (u) that finds the component containing u and
Union (u, v) that combines component s.
Algorithm Kruskal (E, cost, n, t)
{
Organize the edges of the graph in a min heap H , based on cost
For I = 1 to N do component [I] = I
I = 0 mincost =0
while ( i < n - 1)
{
    (u,v) = delete-min(H)
    j = Find(u); k = Find(v);
    if (j != k) then {
        i = i + 1;
        t[i,1] = u; t[i,2] = v;
       mincost = mincost + cost[u, v];
       Union(j, k);
      }
}
return mincost;
}

Monday, March 31, 2014

I want to talk about Splunk saved searches and their visibility aka permissions. Splunk has the following permissions :
A saved search could either be only private (local) or it could be visible to the current app or it could be visible to all applications (global)
The roles that Splunk defines are 'Everyone', 'admin', 'can_delete', 'power', 'splunk-system-role' and 'user'.
Each of these roles can have read or write capability.
Saved searches are searches that can be scheduled for runs. The permissions can also be specified via the configuration file with the following format:
[object_type/object_name]
access = read : [ comma-separated list of roles ], write: [comma-separated list of roles ]
We can also set permissions for all objects of a type
[eventtypes]
access = read : [ *]. write : [admin, power]
Objects can also be made to be globally available by adding an entry for the object in default.meta
as
export = system
As an example, to make all event types in the business analytics app viewable in every app in the Splunk installation, we use the following:
[eventtypes]
access = read : [ * ], write : [admin, power ]
export = system
Every app and its objects is governed by a set of permissions and these can be specified for every role within Splunk.
Every user has their own user directory. Objects visibility can be promoted from user level to app level and to global across all apps. This is called sharing and Splunk implements it by moving that object from the user directory to the app directory.

In today's post, we continue our discussion on Splunk. In addition to the discussion on search earlier, I wanted to bring up that search artifacts are maintained in the dispatch folder. We can view all the processing information in these folders including events, commands, arguments, preview and results.
We now look at alerts. Splunk Enterprise can be configured to send alert messages to anyone when real time or historical search results have met a condition. The conditions can cover a variety of threshold and trend based scenarios.
There are three Splunk alert categories. These are per-result alerts, scheduled alerts and rolling-window alerts. Alerts are based on reports that run on a regular interval over a set of historical time range or in real time (if the report is a real time search) When alerts are triggered, different actions are executed. There are several alert actions such as e-mail notifications that we will cover but later.
The per-result alerts are real-time searches that trigger every time the base search returns a result.
This is usually authored to be invoked when an matching result comes in and is generally used in workflow oriented applications. These alerts can be throttled to ensure they don't fire too often.
Alerts based on historical searches usually run on a regular schedule. This alert type triggers whenever a scheduled run of a historical search returns results that meet a particular condition. These are generally lower priority alerts and more for monitoring over time. For example, trigger an alert whenever the number of 404 errors in any 1 hour interval exceeds 100.
Real time searches can also have alerts that monitor events within a rolling time window. These trigger when its conditions are met by events as they pass through this window in real time. For example, trigger an alert when there are three consecutive failed logins.

Sunday, March 30, 2014

We talk about distributed search in Splunk today. As we know from the discussion on distributed deployment, Splunk Enterprise performs three major functions as it moves data through the data pipeline. First, it consumes data from files, the network or elsewhere. Then it indexes the data. Finally it runs interactive or scheduled searches on the indexed data Each of this functionality can be split into dedicated instances of Splunk that can number from a few to thousands.
The dedicated instances that consume data are called forwarders and they can be light weight or universal. The indexers do the heavy indexing and they are usually the ones with the higher performance configurations. Typically, there's one indexer for every 150/200 GBs of daily indexing and a 12 to 24 core server for each indexer instance. Indexers can also be clustered and configured to replicate each others data. This process is known as index replication. Replication helps prevent data loss and improves data availability. Clusters also have a builtin distributed search capability which we will get to shortly. The other type of dedicated instances are the search heads. The search heads co-ordinate searches across the set of indexers, consolidate their results and present them to the user. For the largest environments, several search heads sharing a single configuration set can be pooled together. With search head pooling, we can co-ordinate simultaneous searches across a large number of indexers. With these logical definitions, we can now mix and match the components on same or different physical machines and configurations to suit our needs. Often if the instances share the same machine, it will be a high end machine with several cores and GBs of RAM. Deployment decisions are based on the amount of incoming data, the amount of indexed data, the number of concurrent users, the number of saved searches, the types of searches employed and the choice to run Splunk apps.
Note that distributed search is not about deployment. This is a builtin functionality for Splunk.
In distributed search we look at use cases such as the following:
As a Splunk administrator, I want to distribute indexing and searching loads across multiple Splunk instances to that it is possible to search and index large quantities of data in reasonable time.
As a Splunk administrator, I want to distribute search to control access to indexed data such that the admins and security personnel have full access and others require access only to their data.
As a Splunk administrator, I want to have different Splunk instances in different geographical offices such that local offices have access to their own data while the centralized headquarters hae access at the corporate level.
In the distributed search, the Splunk Enterprise instance that does the searching is referred to as the search head and those that participate in the search are called search peers. Typically, the search head will run search across all its search peers unless otherwise limited as specified by the splunk_server field in the query. If you want to pick a subset for searches, we do this with search head pooling.
When a distributed search is initiated, the search head replicates and distributes its knowledge objects to its search peers. Knowledge objects include saved searches, event types and other entities used in searching across indexes. This is packaged in a knowledge bundle and distributed to the peers.
Knowledge bundles typically reside under the $SPLUNK_HOME/var/run folder and have a .bundle extension. Bundles can contain almost all the contents of the search head's apps. Instead of copying over the bundles to different peers, its sometimes advisable to mount the knowledge bundles. Together with the bundle user authorization flows to the peers. The user running the search, the role and the location of the distributed authorize.conf file are included with the bundle.