Cluster computing

Wednesday, October 16, 2013

Non token based solutions to mutual exclusion in distributed computing include Lamport Algorithm. There was a single centralized pending request queue which allowed the processes to enter their critical section on a first come first serve basis. This queue is just a data structure like any other we could support a distributed version of a data structure i.e an integer with increment or double operations. We can use this same strategy to support a distribution of the pending request queue data structure.
Each process keeps a queue of time-stamped requests for critical section sorted in ascending order and a list of known times for all other processes.
To request entry to its critical section P broadcasts <req,t> to all other processes.
When Pj receives a request to enter critical section, a timestamped acknowledgement is returned.
To enter the critical section, P must have req at the head of the reqQ and its timestamp should be smaller than all the known times for all other processes.
On the other hand, to release the critical section, P removes the request from its reqQ and broadcasts a release message to all the other processes.
When a process receives a release message, it removes the P's request from its reqQ. Note that this may cause this process to enter its critical section since it may now have the request with the minimum timestamp.
We can prove that this works by using contradiction. Suppose two processes were both in their critical section These processes Pa and Pb would have their <reqa,ta> and <reqb, tb> respectively at the head of their sorted reqQ.
If we assume that ta < tb. But Pb is in its critical section and so its time stamp must be less than the known time of a at Pb. Hence reqa with timestamp of ta must be in the reqQ at Pb. Therefore reqb is not at the head of Pb's reqQ.
Thus we know that only one process can be in the critical section meeting our goal.

Tuesday, October 15, 2013

When defining the correctness of a program, it is helpful to do it in steps. First, consider the invariant of the program and a fixed point. If the program reaches the fixed point, it terminates. If the program terminates, then we know both the conditions are satisfied - the invariant and the termination.
The second step of the program is to establish that the program indeed terminates. This step involves finding a metric. A metric should be guaranteed to change eventually otherwise the program does not progress. Further a metric should be non-decreasing and bounded below or non-increasing and bounded above for it to indicate that the fixed point has been reached.
Skilled programmers can easily correlate this to the recursive functions. The termination condition for the recursion is specified first so that the function can terminate. The metric is the condition on which the recursion continues. The condition is enforcing that there is a convergence to the fixed point.
Take the example of the Earliest meeting time. This problem consists of finding the first time at which three people are all simultaneously available. The three people are named A, B and C. With each person, we associate a function f, g and h respectively. These functions represent the earliest time each is available. For example, f.t is the earliest time at or after time t at which A is available.
f.t = t => A being available at time t.
The earliest that all three can meet is represented by M which is the minimum time at which all the f.t = g.t = h.t
To calculate this M, we define a metric r that denotes time. We try different values of time. We initialize it to zero. We assign r to f.r or g.r or h.r
The goal is to get to a fixed point which we define as
r = f.r = h.r = g.r
This fixed point implies that r >= MThe r is guaranteed not to decrease because f.t > t.
The invariant we define as r <= M
Therefore at termination r = M
The steps above are in the order in which we stated earlier to discuss the correctness of the program. It follows the metric. We already noted the fixed point and the invariant. We guarantee that r will change if r is below M because we consider the case where r is < M. In this case, one of the persons is not available otherwise all would be available and this would be M (proof by contradiction) Therefore f.r > r so the action r = f.r increases the metric.
Since r increases and r reaches M, we know that the program terminates.

In the previous post we discussed the conservative approach to handling event queues together with the refinements or extensions. By conservative we meant that we handled only events that are safe. Safe events are those that have a timestamp greater than the current time.
In todays post, we discuss the optimistic approach that doesn't require that condition to be met. The processes simulate their queue with the hope that events with an earlier time stamp will not arrive . If an event with an earlier timestamp does arrive, the process must undo part of the simulation. This going back in time is called time warp and the events that arrive with an earlier time stamp is called a straggler. The optimistic algorithms deal with stragglers.
To address the straggler that arrives with a timestamp t, the process records the last valid state before time t. The process then rolls back to this last valid state.
However, events may have been sent to other processes after time t. These events must be undone as well. So the process can send a so called anti event to these other processes. This anti-event cancels the corresponding event.
The anti-event could arrive in two different ways. It could arrive with a time-stamp greater than the current local time. In such a case, the corresponding event has not been processed. So the cancelation action simply removes the corresponding event from the queue.
The second case is when the anti-event arrives with a timestamp less than the current time. The anti-event is itself a straggler now. To handle this straggler, more anti-events may need to be fired. In such a case, there is a cascading anti-events. How do we know that it does not continue indefinitely ? We use the concept of a global virtual time. This is the minimum time stamp in the system. There are two advantages now.
The algorithm never rolls back to before GVT.
The event scheduled at time GVT can be executed without causing a rollback.
In the approach we discussed so far, we need not roll back all actions. As an optimization, we could skip the roll back on actions that would happen regardless. If a process proceeded optimistically on an assumption that later turned out to be false, it maybe that the events were scheduled to happen regardless. So a lazy cancellation is introduced. After a process rolls back, it resumes the simulation without sending anti-events. An anti-event is sent only when the optimistic solution generated an event that is not generated by the new simulation.

Monday, October 14, 2013

Today we will continue our discussion on the DES and see if we can consider each source and server to be a separate process. Clients are passed as messages between the processes. We assume a queue for transit so that the messages appear sequentially.
Now since every process maintains a single queue for event, it has to decide when to process it. If it processed it rightaway, there might be another message from another source that should have been processed earlier. So the server has to wait for the incoming messages on all its inbound channels before deciding which one to execute. It must wait. But this could lead to a deadlock because two process could be waiting on each others inbound channel.In order to advance, we must ensure that such events arrive.
One solution is to send update events called null events. They don't do anything other than to break the impass and ensure progress. But even the null events need to be time-stamped properly.
Since null events don't take any time to process, they don't advance the clock. For each process in the cycle, the empty queue should contain the minimum time, but this is deadlock. So a null event has to be timestamped strictly larger than the one at the current process. This null event is then sent out to all processes.
The null events break the impass because they represent a guarantee that no events will arrive earlier than this event. The increment of the null message timestamp above the current time is represented by the lookahead.
Several refinements of this basic algorithm are available. Some of these are :
1) pre-compute service time. If the service time is independent of the particulars client, the length of the time required for the next service request can be calculated before the next event has arrived. and this is typically larger than the original look-aheads.
2) Request time-updates explicitly - Rather than sending null events, processes can explicitly request null events. This occurs when a queue for a particular in-neighbor becomes empty.
3) Allow deadlock - Yet another approach is to not use null messages at all. Instead a deadlock detection scheme is used.
4) Increased synchronization - A tighter synchronization between processes can be used to ensure that no processes get too far ahead of any other. A safety window is maintained.
All of these variants behave differently and often some experimentation is involved. Generally speaking, the magnitude of the look-ahead affects performance and this is highly application specific.

In the time driven simulation approach we mentioned earlier, how to "discretize" continuous state changes. The simulation is advanced by small increments in physical time. This time interval must be chosen carefully because if its too large, it will not detect the events of interest and if it is too small, it will make the simulation very slow.
Together with the overall amount of physical time being simulated, the time granularity adds to the computational complexity of the simulation. Hence both the factors deserve attention so that we can simulate just what we want.
In a true DES, the absolute sizes of the time intervals between events do not affect how quickly the simulation can proceed. Only the number of events matter.
In the event queue example we mentioned earlier, we use an arrival event to seed the queue. The arrrival queue contained a single event with the description as arrival c1 and timestamp 1 represented as <arrival c1, 1>. This event is now dequeued, the "current time" is updated to 1 and simulated. Simulating this event sends the first client to the server, where there is no wait. The only time taken is the time to service the client. So the client begins service immediately.
This immediately schedules a future event to be scheduled: the completion of service of c1 at this server. This event is represented as say <complete c1, 10>.
When that event occurs, a second event is generated : The arrival of the next client. Say this event is <arrival c2, 4>. Now both the events are inserted into the now empty event queue in the increasing order of time.
Thus we set up a loop for events to be processed at this queue. Note that we keep track of the timestamps only, so from our earlier discussion, this is local. When we want to capture the state across different systems we use snapshots.
Notice that the events are queued in a single queue. To alleviate the bottleneck, we could have a few more queues depending on how many the processor can handle. Typically one more queue per processor could help.
If the processor would like to keep track of the priority of the clients, then that could be maintained as separate queues for priority or another data structure.
In the parallelized case, we still honor the arrival events and mark with appropriate time stamps.
This is how DES systems do time based simulation.

Sunday, October 13, 2013

Discrete Event simulation
This is yet another interesting topic in distributed computing. Consider a network of nodes and directed links between the nodes. There are three categories of nodes. source servers and sinks.
Source generates clients that travel through this network along the links. The clients get service at the server node. Servers can provide service to at most a single client at a time. Clients are in a queue at the server. Finally, the clients disappear at the sink.
The behavior of the system over time can be defined as a series of discrete events over time.
There are four different kinds of events:
A client arrives in the system
A client arrives at a server
A client completes service
A client leaves the system
With a precise event, we can assume no two events happen at the same time.
DES is therefore characterized by a list of time-stamped events described above that occur in increasing order of time. One use for this list is that we can gather metrics such as average queue length etc.
These systems can be simulated to gather the relevant measurements and statistics. System can then be tuned to optimize performance.
Events are generally not affected in their service by previous clients at the same server. Therefore clients are independent.
The data structure used in DES is an event queue. This queue maintains a list of scheduled future events in increasing order of time. Simulating an event from this queue could generate new events that are inserted into the event queue.
Clients are generated at the source using an initial arrival event. Now simulating this arrival event involves both the insertion of new clients and the scheduling the next arrival.
One limitation with the approach mentioned here is that we have a single event queue that can be a bottleneck. Since events have to wait for earlier events to be simulated, one way to remove this bottleneck would be to parallelize the events.
There are also other simulations such as time driven simulations. In discrete event simulation, the absolute sizes of the time intervals between events does not affect how quickly the simulation can proceed. Only the number of events matter.
For continuous systems, this is not suitable. For example, take the cueing of billiard balls or when they collide. If we discretize the state changes we cannot predict the next event. So we have to take finer time slices.
In general, the computational complexity is proportional to the amount of physical time being simulated and the granularity of the time steps.

In the previous post, we discussed synchronous network with faults. In today's post, we will continue the discussion and look at specific faults/ One specific failure is called the Byzantine fault. Here two processes must agree but their messages may get lost in transit. If there were n rounds of messages, each could have a failure so the processes may each start out with a binary value and at the end of each round, may still have opposing binary values. Consequently, there is no convergence as the rounds decrease from n to 1. Such faults are generally more difficult to tolerate than crash faults because the outcome is to simply stop execution . There is no solution when there are three processes.
Authenticated messages alleviate this problem. Because now the processes can now that the message was signed by the sender and no one else. This message is also tamper proof and the receiver can know that this was the intended message by the sender.
Now the same can be used to address the Byzantine fault with just two rounds. The signed messages are compared and the proposed value is discarded if the same value was not proposed to all the parties.
So each process takes the following steps:
1) broadcast signed value to all other processes
2) broadcast set of received (signed) values
3) compare received values for each of the other processes
4) if they agree, include the process that sent the value, else exclude it
5) choose minimum value of the included process
As compared to the crash failure where the fault could occur in the first or second round, in this fault the faulty process continues to execute so that there may be faulty messages in both rounds. And after the second round, the processes know that whether any faulty message was sent in the first round regardless of any faulty messages in the second round.
This problem becomes more interesting when there are more than one processes that can fail. In this case, rounds don't suffice. Instead a consensus tree is constructed. In this tree, the nodes at level i represent message chains of length i. The message at depth 3 has signed messages contained one within the other from three processes. The root has n children representing the messages received in the broadcast of the first round. Each of those nodes have N-1 children from the messages received in the second round and each of those nodes have N-2 children from the messages received in the third round and so on.
At a given level say N-3 the messages look like (l,k,j,1), (l,k,j,2), (l,k,j,3)...(l,k,j,N). In this case, Pj relayed the same message regarding Pk which in turn relayed Pl to all the processes. So the messages all agree. The decision value of these children can then be bubbled up to the parent.
In order for this tree to calculate the same value for all processes, every message chain in the leaf nodes must include at least one non-faulty process. Thus in order to tolerate f faults, the leaf nodes must represent message chains of length f + 1. That is a consensus tree of height f+1 is required.
The consensus tree also helps in cases when the authenticated signatures are not available. In this case, the value from the children will be bubbled up to the parent only when a majority is there. This is based on the assumption that the majority will be non-faulty and so a quorum suffices and the correct value will be bubbled up.