Cluster computing

Sunday, October 13, 2013

Discrete Event simulation
This is yet another interesting topic in distributed computing. Consider a network of nodes and directed links between the nodes. There are three categories of nodes. source servers and sinks.
Source generates clients that travel through this network along the links. The clients get service at the server node. Servers can provide service to at most a single client at a time. Clients are in a queue at the server. Finally, the clients disappear at the sink.
The behavior of the system over time can be defined as a series of discrete events over time.
There are four different kinds of events:
A client arrives in the system
A client arrives at a server
A client completes service
A client leaves the system
With a precise event, we can assume no two events happen at the same time.
DES is therefore characterized by a list of time-stamped events described above that occur in increasing order of time. One use for this list is that we can gather metrics such as average queue length etc.
These systems can be simulated to gather the relevant measurements and statistics. System can then be tuned to optimize performance.
Events are generally not affected in their service by previous clients at the same server. Therefore clients are independent.
The data structure used in DES is an event queue. This queue maintains a list of scheduled future events in increasing order of time. Simulating an event from this queue could generate new events that are inserted into the event queue.
Clients are generated at the source using an initial arrival event. Now simulating this arrival event involves both the insertion of new clients and the scheduling the next arrival.
One limitation with the approach mentioned here is that we have a single event queue that can be a bottleneck. Since events have to wait for earlier events to be simulated, one way to remove this bottleneck would be to parallelize the events.
There are also other simulations such as time driven simulations. In discrete event simulation, the absolute sizes of the time intervals between events does not affect how quickly the simulation can proceed. Only the number of events matter.
For continuous systems, this is not suitable. For example, take the cueing of billiard balls or when they collide. If we discretize the state changes we cannot predict the next event. So we have to take finer time slices.
In general, the computational complexity is proportional to the amount of physical time being simulated and the granularity of the time steps.

In the previous post, we discussed synchronous network with faults. In today's post, we will continue the discussion and look at specific faults/ One specific failure is called the Byzantine fault. Here two processes must agree but their messages may get lost in transit. If there were n rounds of messages, each could have a failure so the processes may each start out with a binary value and at the end of each round, may still have opposing binary values. Consequently, there is no convergence as the rounds decrease from n to 1. Such faults are generally more difficult to tolerate than crash faults because the outcome is to simply stop execution . There is no solution when there are three processes.
Authenticated messages alleviate this problem. Because now the processes can now that the message was signed by the sender and no one else. This message is also tamper proof and the receiver can know that this was the intended message by the sender.
Now the same can be used to address the Byzantine fault with just two rounds. The signed messages are compared and the proposed value is discarded if the same value was not proposed to all the parties.
So each process takes the following steps:
1) broadcast signed value to all other processes
2) broadcast set of received (signed) values
3) compare received values for each of the other processes
4) if they agree, include the process that sent the value, else exclude it
5) choose minimum value of the included process
As compared to the crash failure where the fault could occur in the first or second round, in this fault the faulty process continues to execute so that there may be faulty messages in both rounds. And after the second round, the processes know that whether any faulty message was sent in the first round regardless of any faulty messages in the second round.
This problem becomes more interesting when there are more than one processes that can fail. In this case, rounds don't suffice. Instead a consensus tree is constructed. In this tree, the nodes at level i represent message chains of length i. The message at depth 3 has signed messages contained one within the other from three processes. The root has n children representing the messages received in the broadcast of the first round. Each of those nodes have N-1 children from the messages received in the second round and each of those nodes have N-2 children from the messages received in the third round and so on.
At a given level say N-3 the messages look like (l,k,j,1), (l,k,j,2), (l,k,j,3)...(l,k,j,N). In this case, Pj relayed the same message regarding Pk which in turn relayed Pl to all the processes. So the messages all agree. The decision value of these children can then be bubbled up to the parent.
In order for this tree to calculate the same value for all processes, every message chain in the leaf nodes must include at least one non-faulty process. Thus in order to tolerate f faults, the leaf nodes must represent message chains of length f + 1. That is a consensus tree of height f+1 is required.
The consensus tree also helps in cases when the authenticated signatures are not available. In this case, the value from the children will be bubbled up to the parent only when a majority is there. This is based on the assumption that the majority will be non-faulty and so a quorum suffices and the correct value will be bubbled up.

Saturday, October 12, 2013

Asynchronous consensus with process failures.
When there is one fault, it is impossible to solve the consensus problem in a distributed system that involves asynchronous model. Unfortunately, asynchronous design is common. Faults also happen such as failures of disk. And it only takes one fault to disrupt consensus.
This has been proved based on the observation that a protocol begins in a decision state value is either 0 or 1. And it continues in which either result is still possible.
In transaction systems, for commit/abort we used the two-phase locking semantics. In group membership, leader election and reliable multicast too , this problem is common Hence, some suggestions to modify the assumptions of the problem include:
1) Synchrony - If a system can provide an upper bound on the delay or recovery, then this problem can be solved.
2) Randomness - Some probabilistic algorithms can help
3) Failure Detectors - a delayed process is difficult to differentiate from a crashed process without a failure detector.
4) Approximation - Even if all the process cannot agree on the same value, some arbitrary value can suffice.
Let us look at synchronous agreement with process failures.
Every process broadcasts to all other processes its initial value and this can be done in a single round of messages. All of them can then agree on say the minimum.
If there were to be a crash before the first round completes, then they may not have all the values. So a second round of messages could be initiated with all the values from the first round so that everyone has the same set.
Notice that if there are more failures, then there must be more rounds before the agreement. In such a case, there must be an upper bound on the number of rounds otherwise this may not converge.
Also there could be authentication and encryption involved in the messages so that it is not tampered and that it can be shared effectively. We don't focus on those at this time because we are addressing a different issue.
When we discussed the solution for synchronous systems, we may see how the assumption modifications come useful for asynchronous networks.

We will continue our discussion on garbage collection. The process has some steps that are not obvious. We will go over them in this post.
First, we look at the propagator step we added. The propagator tests a condition for each edge. This condition marks an edge between x and y as ok if x is a root and with an edge between x and y, they are connected with y as root. That is the propagator takes each garbage and connects them to food. It continues execution until a termination condition has been reached where the root is connected with all the vertices visited such that each edge x,y is ok.
Between two edges there is only one condition when the edge is not ok. This is when x is not a root. If x is not a root, it is manure. A manure is now easy to collect.
The fact that x is manure and is not a root is maintained as an invariant with the progress because we add edges. Notice that in the ok condition, we checked and maintained that y is root. This is a refinement from the earlier definition where we considered y is food. This let us come up with two separate connected graph where one is manure and the other is food. Everything else is unconnected and garbage.
To be able to prove that this is the case, we considered our actions and maintain the invariant that if x is manure it is not a root. Said differently, each vertex is neither manure or not a root for the marker to propagate. We show that this invariant holds throughout the progress.
Initially there is a root for food and manure is part of garbage. This is easy to establish because the food is mutually exclusive. By definition, food is any vertex that is reachable to the root. So our initial condition holds.
Next in our actions, we look at the two that change a vertex from being or not being a root.
The first is when we add the edge between x and y and y is food. In this case since y is food, it is not manure. Since the initial invariant holds for all vertex that is not y, adding an edge from that vertex to y, maintains the invariant.
The other action we do is when x is a root and now there exists an edge between x and y. In this case we proceed from the left. If x is not manure and we connect y, then y is not manure. if y is not manure, then y is root. This re-establishes our invariant.
Thus by maintaining our invariant and our progress, we are able to propagate. we detect the termination condition when the root is chosen and all the edges reaching to it are ok.

Friday, October 11, 2013

In the previous post we described how the garbage collection works however we did not explain the second attempt or provide proof of correctness. We will look into that now. Remember that for our jargon a vertex is food if it can be reached from the root. If it is not a food, it is garbage. Now onto garbage collection.
The mutator and the marker program superimpose their actions. The mutator actions are not affected and it can continue to add and remove the edges. However we cannot say the same for the marker because the mutator actions can frustrate the marker. For example, marking a single vertex as food could depend on the edges added or removed. To fix this difficulty, we may have to take another step.
We modify the mutator program from the original where it maintained that a vertex that is food is connected to the root and a vertex that is not food is garbage. In the original program, it assigned an edge between x and y if y was food and set the edge to true. If this was not the case, it would delete the edge.
Now we add an extra step that we set y as root when y is food. Also, if x is root and adding an edge to x,y implies only y is root. This is called the propagator step How do we know that this program is correct ? Because we now work off of a root. Two vertices will not satisfy the propagator step only when x is manure. We know the marker has terminated when we mark the root and for all edges we have completed the propagator.
Note that we mentioned that manure vertices need not be marked. Since they were marked earlier and we will collect them in the next iteration. As an aside, some of the properties that remain true are as follows: a garbage remains a garbage, a food remains a food, a manure is a garbage. Each of the category can only be reached from themselves.
Now we discuss the proof of correctness. Here we have to prove the invariant that if x is manure then x cannot be root. For all vertex, each is neither a manure nor a root.
There are two actions that can modify the root property.
First is the action step we discussed. That is, if x is added to y and y is food, then edge exists between x and y and y is root. Since y is food, y cannot be manure. For all x that is not y, if x is not manure or x is not root and y is not manure, we maintain the invariant.
For the other action where x is root and the edge between x,y is added with y becoming root, the invariant is maintained because x is not a manure. That and y not being a manure implies y can remain a root. Thus maintaining the invariant.

Another interesting application in distributed computing is garbage collection. When a program runs there are objects created on the heap. These objects are cleaned up by a managed runtime. This is called a garbage collector. All the objects on the heap derive from a base object. So the objects have a common root. During a collection, the entire set of objects on the heap then translates to an object graph which is finite and directed with a fixed set of vertices representing the objects but not the edges. The edges are added and deleted by a mutator process that modifies the graph objects that are connected can be called food and those that are not can be called garbage so we know which objects are to be collected. The mutator only adds edges if it has been marked as food.
The task is to define a marker process that runs concurrently with the mutator and is responsible for marking things as food. The marker need not be exact. It can be conservative and mark things that are not food. At the same time, we do not want the marker to be weak to mark at least all the food. i.e. we want to mark as few non-food vertices as possible. we can skip a few objects in the current iteration because they may change and we will call this set manure.
The root vertex in a garbage collector when applied to managing runtime memory, keeps track of the segment header and the free store list. The edges are merely the references to memory. Each vertex keeps track of the users of its memory with a reference count. When the reference count is zero, it is garbage. We reclaim useless memory and add them to the free store list.
We use the marker algorithm to mark the things that are food and this can keep running in the background. When the mutator deletes an edge, the reference count decreases. In a way, the marker algorithm is like the termination detection problem in that it has to know that the marking process has terminated. So we add a special node that every active process has a reference to. When a process goes idle, it deletes its reference to the special node.
As earlier, we have two sets of vertex, one that is food and the other that is garbage. The garbage may have some special vertices called the manure.
The marker program begins by adding a marked field to each vertex. It also adds a boolean to say that the marker has completed. The safety of the program is that there is a final invariant that all vertices have the completed set to true. The progress is in marking each vertex boolean for completed as true.
The mutator and the marker do not interfere with each others running because they work on basis of superposition. That is they touch data that are not related to each other.
We begin with the root and simply mark vertices that can be reached from the already marked ones. This behaves just like the gossip program mentioned earlier. However, the mutator can still frustrate the marker. So we add a second logic to say that we add an edge to x,y when y is food and set both x and y to completed, we delete an edge between x and y when either of them is false
and if x is connected to root, then we set y to completed.

Thursday, October 10, 2013

In distributed computing, termination detection is another interesting topic and one which has an application for snapshot we discussed earlier. Consider a topology of processes with unidirectional channels for communication. The graph need not be connected. Here each process can be in one of two states - an idle state or active state. A process that is in the active state can either send a message, receive a message or switch to idle. An idle process can only receive messages.
Initially all processes are active and all channels are empty. The system has come to termination when the there are no active processes and there are no messages in the channels.
To detect the termination, we introduce a special process called the termination detector. The detector is connected to all processes. When a process is idle, it sends a message to the detector. So the detector knows when all the processes are idle. However is this a sufficient condition for the termination detection. No, a process could send a message to another and both may have gone idle after the send and before the receive. So the detector gets to know that both are idle but the detector has not yet seen the message in transit.
Therefore, something more needs to be sent from each process. This information is the number of messages on each of its outgoing channels and the number of messages on each of its incoming channels.
The detector then detects whether all the processes are idle and for each channel, whether the number of messages put into the channel is equal to the number of messages taken out of the channel.
The detector works well when the termination is a valid snapshot. The state seen by the detector should be a valid snapshot.
As before, the marker algorithm comes useful to take a valid snapshot. Here we send out markers to flush the messages in the channels prior to taking a snapshot. Each process receives a marker, records its local state, and records the incoming channel used by the marker as empty and then sends the markers along each outgoing channel. When a process receives a subsequent marker, it records the state of that incoming channel as completed. The process is complete when all the process have received markers along all incoming channels.
This way the snapshot is determined to be valid and the detector can safely perform termination detection.