Cluster computing

Sunday, March 16, 2014

We look at a job scheduling problem on a single machine however, We look at jobs to be done on a single machine. Job i takes unit time.
Each job has to be finished by a deadline di
and each job has a profit pi
we get the profit only if i i completed before di.
The machine can only do one job at a time. For each job, machine can only do one job at a time.
Select a feasible subset of jobs to maximize the profit.
A feasible set is a subset that can be finished within the deadline.
If we take an example Jobs as follows:
Job        1                2                 3               4
Profit    100              10              15             27
Deadline 2                 1               2                1
When we enumerate the possible solutions, we see that the candidates {1,4 ] with order of processing 4, 1 and value 127 is the best solution.
The greedy strategy is the order by decreasing profit ie 1,4,3,2
The algorithm is as follows
Job GreedyJob(d, J, n)
{
J = {1};
for i = 2 to n do
{
if all deadlines in J u {i} are feasible then
J = J U {i}
}
return J;
}

J already has a job that can be scheduled and then we collect jobs in J.
In the algorithm the theorem we use is as follows:
If J = {1,2,..k} with increasing deadlines d1 <= d2 <= .... dk
then J is feasible iff 1,2, ... k is a feasible order.
and we use this fact: To check J is feasible sort it by deadlines and check that di >= i which is the time for every i.
We extend this J to J U {i}
J is the current solution. j1, j2, ... jk
d1 <= d2 <= ... dk
When we insert i we check if i is feasible with di > insert position.
also the jobs after the insert position were feasible for their positions but now that they are disturbed they need to be checked again.

Here we looked at single machine. The comparision between single and multi machines is that the former uses durations while the latter uses start, end pair. The former uses 0/1 profit and the latter uses penalty for delays. In general, job scheduling is not necessarily only greedy. It can be dynamic programming as well. Depending on the parameters of the scheduling problem, different algorithms may be required. If we use a greedy solution, we must show a proof of optimality.

Today I will take a short break to discuss another modular input for Splunk. We will look at how to tie RSS feeds into Splunk. This is another input for Splunk that something that makes subscription easier. The same is true for any kind of notification services. The advantage in this kind of an input is that the application doesn't have to spin to poll.
This saves on CPU. If you look at the cpu usage by something like the Microsoft Outlook application, its very very low even with high loads. I'm not saying thats enabled merely by a push model, but it certainly contributes to it.
Another example of event notification model is the dot net API for MSMQ. The APIs were designed later than the windows ones and come with a better model to send and receive from the message queues. The event notification model of the C# language is also very helpful in this regard.
If we look at a problem of reading the current message from multiple queues and we look at the API model that would best suit this need, we would like an event notification for any of the queues. This model helps us very well. Otherwise we wake up and poll on the queues and those that have the messages will need to be serviced. Since the queues can vary in size and in traffic, we want a solution that is both scalable and resource savvy. In this case we could treat all the queues and all the messages to be similar even if they are transactional or non-transactional, binary or xml messages etc.
and the task of reading a message from a queue can be delegated to the different workers.

Friday, March 14, 2014

In today's post
we look at some more greedy algorithms again based on mecr.org lectures The knapsack problem is a well known problem to discuss greedy algorithm. This problem is stated this way there are bags of chemicals. Each bag has a specific weight and price. The goal is to fill the knapsack so as to maximize the price. Weights are measured in kilo grams and price is measured on dollars if you have more chemicals that can fit in a knapsack you have to make choices .
We do this by arranging the bags in order of priority. We measure this as price over weight ratio we fill the knapsack my picking items in increasing order of this price to weight ratio. There may be a point when we cannot add a chemical without splitting it into a smaller portion all the chemicals beyond that one cannot be considered since our knapsack is full
I wanted to finish this post but got sidetracked. I will do this now
We have taken items by unit cost. We go one item by item in that order until we fill each item
We sorted our bags by unit cost. i.e P1/W1, P2/W2, ... PN/WN in that order. Then we picked up as much of the first bag as possible, typically all of it it and continue down the line. We stop at some j and ignore th rest j+1 to N bags.
That is we have a sequence of 1 for all the bags we selected and a sequence of zero for all the bags we ignored.
We now show the proof of optimality We use a standard technique to transform any optimal solution to a greedy one
We begin with an optimal solution. We show that at each item we add an optimal step.
Let y = y1, y2, ... yN is Optimal.
The Greedy x = x1, x2, ... xyi, Xn be the greedy choices
y is not equal to x.
Let K be the smallest index such that yk is not equal to xk
The claim is yk < xk
x1, x2, ... xj-1 xj xj+1 .. xN is the sequence of choices
Let xi = 1
if yk != xk => yk < 1 => yk < xk
for xj, we know xj is the maximum, then the above statement mean yj > xj which is infeasible.
So we have yj < xj
for xj+ 1 to xN, we trivially show that yk > xk is not possible. upto j y and x agree. after that x is 0. Now yhas also filled up the knapsack so y is also zero after that point.
Our goal is to transform y into x.
We have yk < xk. Increase yk to be xk. Reduce yk+1 .. yN to restore feasibility.
The total weight comes back to 1
y is new feasible solution z = z1, z2, ... zn
for 1 < =i <= k, xi = zi
after k if we increase the weight wk (xk - yk ) to sum i = K + 1 to N wi(yi -zi)

Thursday, March 13, 2014

Today we look at the greedy method for job scheduling based on deadlines - with an optimization. The complexity is sort by price which is O(nlogn). We insert each job in current solution. This is the main concern. We maintain J in sorted order of deadline The worst case happens when each new job is feasible and has a smaller deadline than all of J.
To check the feasibility of job i, we begin by inserting in J of size i - 1
This takes i - 1 steps.
Therefore the complexity of insertion is 1 + 2 + ... + n - 1 = O(n ^ 2)
The final complexity is the complexity of sorting as well as insertion and this comes out to be O(n ^ The Bottleneck that we have is inserting each job in J.
This can probably be addressed with a once and for all slot for each new job. In this case, the addition of the job i with deadline di will be at lesser time since we find the rightmost free slot before di. We call this slot alpha-i. We would like to schedule i at slot alpha-i and since its the rightmost we don't need to move it afterwards. That is when a job is scheduled, it is scheduled as late as possible given the current constraints and we don't need to move it afterwards because no job is ever going to move to the right and we don't modify our constraints. The time slot i is actually the range [i - 1, i ]. Let Ni be the first free slot to the left of i. Then ni <= i. If i < j, ni = nj. For both of them the first free slot is on the left. i.e i and j are in the same set if ni = nj . So all { i, i + 1, i + 2, j-1, j } incoming jobs are all in one set. Given a set k in general, f(k) is the value ni. the largest interval free to the left of all of k. Initially no jobs need to be scheduled beyond the largest deadline, Time slots are 1, 2, .. D . D is the maximum deadline of all jobs. All slots are free, so ni = 1 for all i. Sets {1}, {2}, .. , {D} F({j}) = j. Add a job with deadline d, we compute nd. To compute nd, check all k such that d belongs to k,
n(d) = F(k) and this tends to zero when the job is not feasible. Otherwise we can schedule new job at time F(k). Now we have to update F(k). The next free slot must be further to the left. So the new value must be less than F(k) - 1 . We know there is a set contains this value. So we just have to merge the sets.Merge set K with set containing F(k) - 1. So I will have n jobs and n disjoint operations on this data set.
We can explain this with an example. We have six jobs and their deadlines are as shown 2,1,2,6,1,6 K = {0},{1},{2},{3},{4},{5},{6} F(k) for these are correspondingly 0,1,2,3,4,5,6
We create a line of time slots and we add zero to it. So we have 0,1,2,3,4,5, and 6
when you look at the first job, the deadline for this job is 2 so we have it in its own set {2} and F({2}) = 2. Implicitly this means that we are going to assign job 1 to time slot 2. and we have to update the fact that the time slot 2 is no longer available and this we do by merging {2} with the set containing F({2}) - 1 which is equal to 1. and the set corresponding is {1} and we merge {2} with {1} and this will gives us a new picture. Merge {2} with {1} and the number of sets have collapsed by 1 resulting in {0} {1,2}, {3}, {4}, {5}, {6} with F values 0 ,1, 3, 4, 5, 6 . We put 2 at 1 and merged {1,2} so we have to continue that with the set F({1, 2}) - 1 = 0 resulting in a new set {0, 1, 2 } so continuing d[3] = 2 and 2 belongs to {0,1,2} but F of this set is 0 so the job 3 is not feasible. We discard it. d[4] = 6 so we assign it to timeslot 6. This belongs to set {6} and we merge it with the F({6}) - 1 = 5 which is the set {5} resulting in {5,6}
We now have sets {0,1,2}, {3}, {4}, {5,6} with F values 0,3,4,5 respectively. d[5] = 1 and 1 belongs to set {0,1,2}.This is not feasible. So finally we come to the last job d[6] and d[6] = 6 and 6 belongs to {5,6} and F is 5 so we implicitly schedule job 6 at 5 and we merge {5,6} with 4 and there are no more jobs. We have so far picked out the jobs 1,2, 4 and 6 and we have a schedule.

Wednesday, March 12, 2014

We will cover some more discussion the traveling salesman problem. We have discussed memorization and branching and bounding so far. We will consider NP completeness today. The class P kind of problems are based on Yes/No answers. They have a deterministic polynomial time algorithm. The class NP kind of problems are based on Yes/No answers for which a non-deterministic polynomial time algorithm is known. The advantage with the NP kind of problems is that it can guess many different choices for the solution and if one of the choices is correct, it always has the magical ability to guess that particular choice.
Note that the traveling salesman problem is not phrased as a yes no problem. Instead it is stated as given a weighted undirected graph find a TSP tour visiting each vertex only once coming back to the start and with the minimum weight w <=k But we can rephrase this as a yes no problem by asking if the cost is less than k.
Moreover, we can
guess the order in which to tour the vertices and
check to see that the total weights is <= k
Note that the answers are dependent on the value of k. If none of the total weights of different choices are < k then no matter how we guess the total weight will be greater than k and we will answer no.
Now recalling that P is a subset of NP. we will say that P space problems are also NP. We haven't been able to find an algorithm in NP yet and that is why it begs the question if P = NP. What we do know in addition to this is that there are a class of problems that are really really difficult to solve and these are called NP-complete problems and they lie on the perimeter of the NP space problems.
We will now look to see if TSP problem is NP-complete.
We answer the question Is there a TSP tour of weight <=k in G
We ask a different question for NP-complete:
Given a graph G with n vertices, is there a path in G of length n - 1 ? and this is a well known way of asking this problem as NP-complete and is referred to as the Hamiltonian path.
Suppose there is a deterministic polynomial time algorithm TSP(G,k)
Then we show that there is a deterministic polynomial time algorithm HAMPATH(G). If we can accomplish that then this will mean that TSP problem is NP-complete because we stated the Hamiltonian path as an NP-complete problem.
How we can use the TSP(G,k) to solve the Hamiltonian path problem, we will now see. We have to answer the question : - is there a path of length n-1 in a given n-vertex graph G ?
We add an extra vertex and connect all the other vertices with a new edge each. This does not take too long. This we call the graph G' where every edge has weight 1 and missing edges have weight infinity.
We now write the Hampath(G) steps as
construct G' as above
return T(G', n + 1)
Now G has a path of length n - 1if and only if G' has a tour of weight <= n + 1
In other words, if we take two edges to connect the new vertex on the way out and on the way in, then the remaining n-1 edges will be the path for the graph G.

We continue to take a look at the traveling salesman problem today. We can improve the memoziation with branching and bounding. The idea here is that the recursive function call in memoziation step earlier could be very expensive so we make it faster with branching by first looking up a table to see if the value was previously computed. We branch if it was previously computed and this makes it faster. If not, we replace the expensive recursive call with a cheaper call for the lower bound of the subspace. If this lower bound and the wmeight of the edge connecting s collected vertices with the new vertex is less than the value of the variable Best we have computed so far, then we perform the regular recursive call
for all m not = 1 on S - k
If M [S - k, m] is defined
Best = min ( Best, M[S-k, m] + w(m, k)
else
temp = LB (S - k, m )
If temp + w (m, k) < best
Best = min ( Best, Memo(S-k, m) + w (m, k))

Tuesday, March 11, 2014

we discuss the traveling salesman problem. In this problem, we have an undirected weighted graph where the vertices represent cities and the weights represent the cost to travel between cities since the graph is undirected, the cost to travel from one city to another is the same in both directions. The goal 8 the problem is to reduce the cost of visiting each city in a tour. We start from a city and return back to the city.
We solve this problem by dynamic programming.

For any set s belonging to V such that t belongs to S, we compute a quantity M[s, k] which represents the minimum cost of travel from a vertex t, visiting each vertex in S and then visiting k
M [s, k] is the minimum cost = w (1, k) is s = 1, k or it is M [s - k, m ] +w (m, k)

We could also write an equivalent using memoziation

minTSP (v):

Best = infinity

for k = 2 to N

Best = min ( Best, memo (s, V), w (1, k) )

return next

memo(s, k):

If M (s, k) is defined

Return M (s, k)

If s == 2

Return M (s, k) = w (1, k)

Best = infinity

For all m not = 1 in S - k

Best = min ( Best, Memo({s}-k, m), w (m, k) )

Return M (s, k) = Best