Cluster computing

Monday, March 24, 2014

Today we look at another example of decrease and conquer algorithm. This is the topological sort and we will cover it in two posts. This algorithm is based on depth first search. The Topological Sort works on a directed graph. Directed graph is one where you can traverse the vertices in one direction but not the other. As usual we represent these with adjacency matrix. Bidirectional transfer will have explicit edges for both directions. A directed edge from u to V will have exit number of v less than than u. We justify this claim with the following two cases. In the first case let's say v was unmarked when dfs explores v. Here v comes off the stack first so its exit number is smaller than that of u. In The Second Case, the depth first search visits a marked vertex v. In this case v is already off the stack and finished. Its exit number will be less than that of u.
This gives us a dfs based algorithm for topological sort.
we run DFS(G) and compute exit numbers.
If DFS(G) discovers a back edge, then G is not a directed graph, so we report this as an error and exit.
We list the vertices in decreasing order of exit number. This gives us the entry as the root vertex and the last vertex at the end.
The worst case complexity of this DFS based algorithm for topological sort is based on
DFS + Sorting
The DFS takes O(n^2) as adjacency matrix and O(n+m) as adjacency list.
The sorting takes O(nlogn) and we list the vertices as generated reverse list. So we don't really need to sort but we just need to reverse the list.
We now look at a decrease and conquer algorithm.
The source vertex has no incoming edges.
Every diag has a source vertex. A source vertex is one that has no incoming edges. If there is a cycle , then it is no longer a diag.
So we can enumerate the source vertex first.
Then we remove the source vertex and this implies we should still be left with a diag. This gives us the construct for a decrease and conquer algorithm. We write it this way:

while graph is not empty {
identify source vertex v
append v to topologically sorted list
remove v from G
}

We do this by computing the indegree of each vertex.
first we set this to zero for all vertices.
Then we increment by 1 for all edges to an inbound vertex.
then we identify the source vertex where the indegree is zero and append v to queue.

while (queue is not empty)
{
v is set to head of queue; we enumerate v; delete head of queue;
for each (v, w) in E // as many outbound edges
{
indegree[w] = indegree[w] - 1;
if (indegree[w] == 0) { append w to queue; }
}
}

Notice that initialization steps takes O(n) and O(m).
The while loop takes O(n) while the for loop take O(deg(V)) time
If we use adjacency list, the complexity is comes to O(n+m) where we decrease the complexity from O(n^2) in adjacency matrix because we just reverse the list.
This topological sorting has a practical application in scheduling tasks while respecting dependencies and we have seen two different algorithms for it.

Sunday, March 23, 2014

I want to take a short break discussing the MSMQ implementation and how to read queues for data input to Splunk. I looked at the MSMQ .net library implementation to see if there's a way to use the native MSMQ trigger notification from the .Net library and if so, how to use it. I want to discuss both MSMQ trigger and System.messaging separately. Here the first thing is that the system.messaging does not support MSMQ trigger. And MSMQ Trigger does not support any programmable components. The MSMQ trigger can invoke an executable (hint hint) to relay messages to. This is useful for notification based message processing and reduces the effects of polling. When the rules of a trigger are evaluated. The executable is invoked. I'm thinking of an executable that forwards messages to a modular input for Splunk. The modular input could itself read the messages from the queues directly via the MQReceiveMessage native APIs or process the messages forwarded by an executable that the MSMQ trigger invokes. In both cases, it can serve messages to Splunk and in the latter case, it would be more performant. Please note that MSMQ Trigger is a standalone and must be installed as a service by the administrator.
The system.messaging library transparently exposes the underlying Message queuing windows APIs . For example, it provides GetPublicQueues method that enumerates the public message queues. It takes the message queue criteria as a parameter. This criteria can be specified with parameters such as category and label. It can also take machine name or cluster name, created and modified times as filter parameters. The GetPublicQueuesEnumerator is available to provide an enumerator to iterate over the results.
The MessageQueues can take rules for each of the triggers and we can specify this in ways that the splunk users would like to denote. For example, the rules are written as follows:
a condition that tests the properties of the message in terms of the attributes mentioned and an action that describes an executable and parameters which can include message properties.
SampleRule This_is_a_sample $MSG_LABEL_DOES_NOT_CONTAIN="Redacted" EXE c:\foo.exe MSMQTriggerObjects.MSMQRuleHandler 0, RuleId
Other rule parameters include
$MSG_LABEL_CONTAINS
$MSG_LABEL_DOES_NOT_CONTAIN
$MSG_BODY_CONTAINS
$MSG_BODY_DOES_NOT_CONTAIN
$MSG_PRIORITY_EQUALS
$MSG_PRIORITY_NOT_EQUAL
$MSG_PRIORITY_GREATER_THAN
$MSG_PRIORITY_LESS_THAN
$MSG_APPSPECIFIC_EQUALS
$MSG_APPSPECIFIC_NOT_EQUAL
$MSG_APPSPECIFIC_GREATER_THAN
$MSG_APPSPECIFIC_LESS_THAN
$MSG_SRCMACHINEID_EQUALS
$MSG_SRCMACHINEID_NOT_EQUAL
Triggers can be created with queue path and peek/receive flags.
This rule consists of name, description, condition, action and a reference parameter. Passing an empty string for condition means true for all cases. For details, we can refer to some examples on code pro
This gives the executable a notification mechanism for invocations and the message from the last time the condition was satisfied.

Saturday, March 22, 2014

Having discussed the Bellman ford algorithm, let us quickly revisit the classical depth first search. We can apply this to model computational problems using graph.
Graph representations usually involve an adjacency matrix where the matrix rows and columns are vertices and there is an entry in the cell corresponding to (u,v) if and only if the u,v is connected by a direct edge.
An adjacency list adj(v), on the other hand comprises of list of neighbours of v.
In a depth first traversal, we start from a vertex v. For each neighbour w of v that is not explored yet, we recursively explore graph from w. depth first means we explore one vertex as far as possible before moving to the next one.
That is why this algorithm can be considered a decrease and conquer approach.
In a DFS forest, there are tree edges and back edges. Back edges are the missing edges.
One other thing we could do in a depth first tree is that we could keep track of when we enter and exit a vertex.We do this with the help of a stack.
Every back edge leads to an ancestor. A question is in a depth forest, can there be a vertex edge to a vertex that is not an ancestor. The answer in this case is no because if there was an edge between two vertices, then one of the vertices is unexplored and it would be added below not by going up. So all the edges on the top are explored. So a back edge to something other than an ancestor is not possible.
If we were to write a pseudo code for marking the entry and exit point, it would look something like this:
DFS(G)
{
for each v in V do { entryno[v] = 0; exitno[v] = 0 }
entrycount = 0; exitcount = 0;
for each v in V do
{
if (entrycount(v) == 0) { dfs(v); }
}
}

dfs(v)
{
entrycount = entrycount + 1; entryno[v] = entrycount;
foreach(v,w) in E do
{
if (entrycount[w] = 0) {dfs(w);})
}
exitcount = exitcount + 1; exitno[v] = exitcount;
}
}

Friday, March 21, 2014

We look at Bellman-Ford algorithm today.
Dijkstra's algorithm requires us to have non-negative weights.
This algorithm let us compute the shortest path even when the weights are non-negative. Dijkstra's algorithm does not work with negative edge weights because it does not recompute the costs between vertices as we discover intermediary nodes.
We are now allowing negative edges with the caveat that the negative cycle is not ok.
Shortest path are well defined.
The idea here is that for increasing lengths k, compute shortest paths with at most k edges.
Our observation is that the shortest path can have at most n-1 edges >= n edges-cycle. Non-negative weights are eliminated.
The recurrence is described this way. The path of length k from v to u breaks up as path of length k - 1 from v to some neighbor w of u. Then we add an edge from w to u.
We use the notation dist-k[u] as the shortest path from v to u using at most k edges.
dist-k[u] = min( dist-k-1[u], distk-1[w] + cost[w,u] for each neighbour w of u )
we know that dist 0 of source vertex v = 0 and dist0[u] = infinity.
So we create a grid to compute the distances. We grow the path length from the source vertex to all the others so for a seven vertex graph we have path lengths varying from 0 to 6 For each path length we compute the costs and fill the grid.
The pseudo code is like this:
BellmanFord(v, cost, dist, n) {
for i = 1 to n do
    dist[i] = infinity;
dist[v] = 0;

for k =1 to n-1 do
for each vertex u != v do
     distnew[u] = dist[u]; // upto k-1
     for each edge (i,u) do
      distnew[u] = min(distnew[u], dist[i] + cost(i,u));
for each vertex u != v do
    dist[u] = distnew[u];   // replace
}

Complexity is O(n^3) if we use Adj matrix/
and O(nm) if we use the adjacency list
Optimizations include the following:
At k+1 the only distances that can change are those whose incoming edges have changed in the previous round. So we maintain a queue of those vertices whose distance reduces in iteration k
Only need to update neighbours of these vertices
We can also terminate if no change is seen for some k.

Today we look at some more lectures from mecr.org. we continue our discussion on dynamic programming we organize the subsets to our advantage there are 2 ^N subsets and the optimal solution has N subsets . In dynamic programming we maintain the principle of optimality by breaking the problem b into sub problems and combining them to format the solution the sub problems may not be disjoint so we cannot apply divide and conquer further the there may not be a single strategy to use so we cannot apply the greedy algorithms. We choose appropriate reordering ti reorganize the sub problems fro m exponential to polynomial. We use memo table to avoid wasteful recomputation. And we iteratively fill table by analyzing dependencies

Thursday, March 20, 2014

Today we cover some more discussion on dynamic programming. When we arrange the jobs in terms of their profits and use the dynamic programming, we have a binary tree of jobs to traverse. We have binary because we include and exclude. Some of the nodes will appear in other places as well in this tree. This is what we call overlapping and it costs us re-evaluation which we try to avoid. So the strategy is to store each computed value in a table. We look up the table before making recursive call. This look up table is called a memo table
So we improve the algorithm we described earlier with one which uses lookup of this table.
The algorithm looks like this:
function MaxProfitMemo(n) {
if ( n == 1) {
return(Profit(1));
}
assign j to the last task job that ends before job n begins
if the table look up of (n-1) is not defined
recursively call MaxProfitMemo on n-1
if the table MaxProfitTable[j] is not defined
MaxProfitTable[j] = MaxProfitMemo(j)
finally return the max of either MaxProfitTable(n-1) or
Profit(n) + MaxProfitTable[j]))
}
We can actually do a further improvement. We can analyze the dependency of entries in table.

we iterate of over the entries this way

function MaxProfitIterative(n) {
MaxProfitTable[1] = Profit(1);

for i = 2 to n {
j = last job that ends before job i begins
MaxProfitTable[i] = max(MaxProfitTable[i-1], Profit(i) + MaxProfitTable[j]);
}
return (MaxProfitTable[n]);
}

Note that we don't have to check if i-1 and j are in the table.
And the complexity is )(n) and the assignment of j is only constant time since we keep the jobs in sorted order. However O(nlogn) to sort.

We now look at an important class of algorithms called dynamic programming. These use the principle of optimality which says:
there are subproblems possible for the overall problem
These subproblems can be solved efficiently
These can combine the solutions
This can be done recursively
Note that divide and conquer algorithms were based on the assumption that the subproblems are disjoint.
Also the greedy algorithms differs from the above in that there can be different subproblems possible but it chooses a strategy that does not change for the input.
generally, the sub-problems overlap. This is why we need to find the sub-problem structure.
Secondly, there is no fixed strategy so we may not be able to find a greedy strategy.
This means we have to exhaust all possibilities.
In dynamic programming, we take these constraints as such and try to do it efficiently.
We do this with the following three steps:
1) we come up with a recursive formulation in terms of subproblems
2) we organize the subproblems to make the search efficient.
3) we solve subproblems and keep track of the answers to avoid recomputation.
We will now see an example with the problem of Interval scheduling.
In this problem, we have tasks with predefined start, finish.
We find the subset of non-overlapping tasks and
try to choose the largest such sub-set.
A twist to the above problem is referred to as the weighted scheduling.
In this problem, each task fetches a profit and we try to maximize not the number of jobs but the profit from those jobs.
As we may see, there is no known greedy strategy for this problem. This means we will have to exhaust all combinations and choose the best. This means we have exponentially many combinations.
We will now re-organize the combinations to make the search more efficient.
We will sort the jobs by their finish time where each job ends earlier than the other.
We search based on excluding jobs one at a time. If a job has to be included we look to see what others can be excluded.If we exclude, we look at the remaining sub-problem recursively. The recursion terminates when there is one job and the profit is from that job.
Let MaxProfit(n) denote the max profit for jobs 1 .. n
Let Profit(i) denote the actual profit of job i
Our recursive formulation is as follows:
MaxProfit(k) = max (MaxProfit(k-1), Profit (k) + MaxProfit(j))
J is the latest job that ends before k starts.
and the MaxProfit(1) is the profit(1)
The first component in the recursive formulation above is from the exclusion of job j and the second component is from the inclusion of job k.