Cluster computing

Thursday, March 27, 2014

We look at Warshall algorithm for transitive closure. In a directed or undirected graph, we ask if there is a path from node I to mode j. That is transitive closure. As we may recall from previous posts, we represent a graph with an adjacency matrix. For a graph with N nodes, an adjacency matrix answers the question whether there is an edge between node I and J. For both directed and undirected graphs a value of 0 in the adjacency matrix indicates there is no edge. A transitive closure finds whether there is a path between node I and j. Here we could apply a brute force method of enumerating N ^2 pairs of vertices and try to find a path for each pair. Fortunately, Warshall algorithm answers that better by using dynamic programming particularly the step of saving the previously found paths so that we can efficiently compute the transitive closure. Here for node I and J we ask if there is a path through node 1. Next we ask if there's a path through node 1 and 2. Next we ask if there's a path through node 1,2 and 3 and so on.
Warshalls' algorithm has applications in network routing and reachability problems. Its also used to eliminate states.
The algorithm is elaborated as follows:
Algorithm Warshall(A[1..n, 1..n])
{
   R-0 is initialized with the adjacency matrix ; // R-0 is the node to itself.
   for k = 1 to n do
for i = 1 to n do
for j = 1 to n do
          R-k[i, j] = R-(k-1) [i, j] OR
                             (R-(k-1) [i, k] AND R-(k-1) [k,j])
                             // path from i to k thru 1,2 .. k -1
                            // and path from k to j thru 1,2, .. k-1
return R-(n)
}

Wednesday, March 26, 2014

I came across an interesting design pattern for web applications that I hadn't seen before. Basically the model and the view are decoupled. The model can vary for the same view so the view is reused and the models are different. The routes and the models are registered together so they vary depending on the purpose and these are maintained in a configuration file but the same view can load any of these models.
There is a single dispatcher that take care of finding the handler for the model. The handler has methods to generate the content or the model. This is then converted to a json document that is then rendered by the web interface. The web interface is written in javascript so all the rendering of the view and the highlighting is taken care of on the client side.
Instead of the conventional model view controller with explicit routes, this is a fairly short way to get server data to client without any sacrifice of form or functionality of web interface. And its extensible too given the type of models supported. Data is packaged in an Atom entry and there's one entry for each configuration item.
In terms of operations, the get and post are handled as actions requiring different capabilities. These are managed by an access control mechanism called the admin manager
The access control is exercised based on user and resources. The web interface makes purely REST based calls. In this way the interface is a separate layer and the server does most or all of the processing

Dijkstra's algorithm for single source shortest path is a greedy algorithm. Here we find the shortest path from a single source in a weighted say directed edge graph where the weights are positive. In the undirected graph, there are two directed edges replaced by a single edge. There could be multiple paths with different weights but the shortest path is one with the smallest weight. To find the shortest oath, we relax the edges that is we take the minimum cost of the edge joining two vertices u and v and set that ad the edge weight. If the shortest path consists of edges v1, v2, v3....vn, any contiguous subset if these is also a shortest path. We now look at the possibility of a shorter path.
Each new shortest path extends an earlier one. So we build the shortest path in the ascending order of costs. When we add a vertex, we see the cost. If there are no edges to that vertex the cost is infinity.
In the algorithm We first initialize the costs. When we pick a source vertex we do tge following for each of the other vertices.
We choise a vertex u that is not already in the set and has a minimum cost. Then we mark it as included in the shortest path. For each of the neighbors of u, we determine the cost from the set s and with the edge u to w and then relax the edge by picking the minimum costs between vertices.
In this way we include all vertices one by one in ascending order of weights

Tuesday, March 25, 2014

While we have been discussing topological sort I'm the previous to previous post. I wanted to take a minute to discuss the advantages of breadth first traversal and depth first traversal.
As we know dfs uses an explicit agenda. It will recurse from the root to the leaves. It also uses less memory. A bfs on the other hand is exploratory. It covers nodes at a given level.
In the depth first search, the recursion termination is specified first. This takes care of covering the end case. If the end case is specified, then the progress can be specified in terms of smaller subspace.
Take the example of finding the factorial of a number for instance. This has a recursive solution.
but let us study its complexity. We can write that as a recursion as well.
If we take the complexity M(n) as the # of basic operations on input n,
M(n) = 1 + M(n-1), n > 0
and
M(0) = 1 which is the initial condition and base case.
We use backward substitution :
M(n) = M(n-1) + 1 = (M(n-2) + 1) + 1
        = M(n-2) + 2 = (M(n-3) + 1) + 2
       = M(n-3) + 3 = .... M(n-i) + i
= M(n-n) + n = M(0) + n
                     = 1 + n = O(n) multiplications
In general we can study the complexity of recursive algorithms such as DFS by following this:
Identify the input size
Identify the basic operations
Find the dependence on the input size ( the worst case, the best case )
Write the reurrence relation with initial conditions to compute total # of basic operations
and then solve the recurrence
In breadth first search, we rely on a collection or container such as a queue.
We determine the complexity based on the size and opeartions on this container.

Monday, March 24, 2014

I want to continue my discussion on MSMQ trigger integration in Splunk monitoring. I've looked at the sample code for the MSMQ Trigger. We use the methods add rule and add triggers. To construct a rule, we initialize it first. Then we specify the condition and action pair. This we do with parameters to the the add rule method The parameters are the name of the rule, the description of the rule, the condition which evaluates to true if left empty, the action which is specified with the keyword EXE for executable and its arguments, as well as reference to take the guid of the resulting rule.
Similarly the trigger object takes the queue name and rule object and a flag to denote if its a receive or a peek operation.
This trigger invocation can take message properties on both the condition and the action. As we saw earlier, this means we can use the message labels, the message body, the priority, the source machine ID and even any app specific labels.
Note that the matching of the text is handled by the MSMQ trigger thus reducing the onus on the executable that is invoked. I plan to try out this service but even if we specify a single rule to forward all, we get called per message. So we can determine which queue the message belongs to and how to process it including selective actions based on filtering. That is we can move the post peek message processing between the trigger and the executable.
Lastly, I can say that the trigger may or may not be available in all instances, so having the msmq monitor in Splunk be able to read the queues itself is desirable. This means that we can have a layered approach in the monitor where the messages from both the monitor's reader and the executable are processed by a common base layer for modular input to Splunk.
Also, I wanted to add that the rules in the MSMQ trigger can have a many to one relationship. A host can have many queues, a single queue can have many triggers. A trigger may have more than one rules. A rule may have more than one condition. The processing of the queues and the rules should scale.

Today we look at another example of decrease and conquer algorithm. This is the topological sort and we will cover it in two posts. This algorithm is based on depth first search. The Topological Sort works on a directed graph. Directed graph is one where you can traverse the vertices in one direction but not the other. As usual we represent these with adjacency matrix. Bidirectional transfer will have explicit edges for both directions. A directed edge from u to V will have exit number of v less than than u. We justify this claim with the following two cases. In the first case let's say v was unmarked when dfs explores v. Here v comes off the stack first so its exit number is smaller than that of u. In The Second Case, the depth first search visits a marked vertex v. In this case v is already off the stack and finished. Its exit number will be less than that of u.
This gives us a dfs based algorithm for topological sort.
we run DFS(G) and compute exit numbers.
If DFS(G) discovers a back edge, then G is not a directed graph, so we report this as an error and exit.
We list the vertices in decreasing order of exit number. This gives us the entry as the root vertex and the last vertex at the end.
The worst case complexity of this DFS based algorithm for topological sort is based on
DFS + Sorting
The DFS takes O(n^2) as adjacency matrix and O(n+m) as adjacency list.
The sorting takes O(nlogn) and we list the vertices as generated reverse list. So we don't really need to sort but we just need to reverse the list.
We now look at a decrease and conquer algorithm.
The source vertex has no incoming edges.
Every diag has a source vertex. A source vertex is one that has no incoming edges. If there is a cycle , then it is no longer a diag.
So we can enumerate the source vertex first.
Then we remove the source vertex and this implies we should still be left with a diag. This gives us the construct for a decrease and conquer algorithm. We write it this way:

while graph is not empty {
identify source vertex v
append v to topologically sorted list
remove v from G
}

We do this by computing the indegree of each vertex.
first we set this to zero for all vertices.
Then we increment by 1 for all edges to an inbound vertex.
then we identify the source vertex where the indegree is zero and append v to queue.

while (queue is not empty)
{
v is set to head of queue; we enumerate v; delete head of queue;
for each (v, w) in E // as many outbound edges
{
indegree[w] = indegree[w] - 1;
if (indegree[w] == 0) { append w to queue; }
}
}

Notice that initialization steps takes O(n) and O(m).
The while loop takes O(n) while the for loop take O(deg(V)) time
If we use adjacency list, the complexity is comes to O(n+m) where we decrease the complexity from O(n^2) in adjacency matrix because we just reverse the list.
This topological sorting has a practical application in scheduling tasks while respecting dependencies and we have seen two different algorithms for it.

Sunday, March 23, 2014

I want to take a short break discussing the MSMQ implementation and how to read queues for data input to Splunk. I looked at the MSMQ .net library implementation to see if there's a way to use the native MSMQ trigger notification from the .Net library and if so, how to use it. I want to discuss both MSMQ trigger and System.messaging separately. Here the first thing is that the system.messaging does not support MSMQ trigger. And MSMQ Trigger does not support any programmable components. The MSMQ trigger can invoke an executable (hint hint) to relay messages to. This is useful for notification based message processing and reduces the effects of polling. When the rules of a trigger are evaluated. The executable is invoked. I'm thinking of an executable that forwards messages to a modular input for Splunk. The modular input could itself read the messages from the queues directly via the MQReceiveMessage native APIs or process the messages forwarded by an executable that the MSMQ trigger invokes. In both cases, it can serve messages to Splunk and in the latter case, it would be more performant. Please note that MSMQ Trigger is a standalone and must be installed as a service by the administrator.
The system.messaging library transparently exposes the underlying Message queuing windows APIs . For example, it provides GetPublicQueues method that enumerates the public message queues. It takes the message queue criteria as a parameter. This criteria can be specified with parameters such as category and label. It can also take machine name or cluster name, created and modified times as filter parameters. The GetPublicQueuesEnumerator is available to provide an enumerator to iterate over the results.
The MessageQueues can take rules for each of the triggers and we can specify this in ways that the splunk users would like to denote. For example, the rules are written as follows:
a condition that tests the properties of the message in terms of the attributes mentioned and an action that describes an executable and parameters which can include message properties.
SampleRule This_is_a_sample $MSG_LABEL_DOES_NOT_CONTAIN="Redacted" EXE c:\foo.exe MSMQTriggerObjects.MSMQRuleHandler 0, RuleId
Other rule parameters include
$MSG_LABEL_CONTAINS
$MSG_LABEL_DOES_NOT_CONTAIN
$MSG_BODY_CONTAINS
$MSG_BODY_DOES_NOT_CONTAIN
$MSG_PRIORITY_EQUALS
$MSG_PRIORITY_NOT_EQUAL
$MSG_PRIORITY_GREATER_THAN
$MSG_PRIORITY_LESS_THAN
$MSG_APPSPECIFIC_EQUALS
$MSG_APPSPECIFIC_NOT_EQUAL
$MSG_APPSPECIFIC_GREATER_THAN
$MSG_APPSPECIFIC_LESS_THAN
$MSG_SRCMACHINEID_EQUALS
$MSG_SRCMACHINEID_NOT_EQUAL
Triggers can be created with queue path and peek/receive flags.
This rule consists of name, description, condition, action and a reference parameter. Passing an empty string for condition means true for all cases. For details, we can refer to some examples on code pro
This gives the executable a notification mechanism for invocations and the message from the last time the condition was satisfied.