Cluster computing

Friday, October 11, 2013

In the previous post we described how the garbage collection works however we did not explain the second attempt or provide proof of correctness. We will look into that now. Remember that for our jargon a vertex is food if it can be reached from the root. If it is not a food, it is garbage. Now onto garbage collection.
The mutator and the marker program superimpose their actions. The mutator actions are not affected and it can continue to add and remove the edges. However we cannot say the same for the marker because the mutator actions can frustrate the marker. For example, marking a single vertex as food could depend on the edges added or removed. To fix this difficulty, we may have to take another step.
We modify the mutator program from the original where it maintained that a vertex that is food is connected to the root and a vertex that is not food is garbage. In the original program, it assigned an edge between x and y if y was food and set the edge to true. If this was not the case, it would delete the edge.
Now we add an extra step that we set y as root when y is food. Also, if x is root and adding an edge to x,y implies only y is root. This is called the propagator step How do we know that this program is correct ? Because we now work off of a root. Two vertices will not satisfy the propagator step only when x is manure. We know the marker has terminated when we mark the root and for all edges we have completed the propagator.
Note that we mentioned that manure vertices need not be marked. Since they were marked earlier and we will collect them in the next iteration. As an aside, some of the properties that remain true are as follows: a garbage remains a garbage, a food remains a food, a manure is a garbage. Each of the category can only be reached from themselves.
Now we discuss the proof of correctness. Here we have to prove the invariant that if x is manure then x cannot be root. For all vertex, each is neither a manure nor a root.
There are two actions that can modify the root property.
First is the action step we discussed. That is, if x is added to y and y is food, then edge exists between x and y and y is root. Since y is food, y cannot be manure. For all x that is not y, if x is not manure or x is not root and y is not manure, we maintain the invariant.
For the other action where x is root and the edge between x,y is added with y becoming root, the invariant is maintained because x is not a manure. That and y not being a manure implies y can remain a root. Thus maintaining the invariant.

Another interesting application in distributed computing is garbage collection. When a program runs there are objects created on the heap. These objects are cleaned up by a managed runtime. This is called a garbage collector. All the objects on the heap derive from a base object. So the objects have a common root. During a collection, the entire set of objects on the heap then translates to an object graph which is finite and directed with a fixed set of vertices representing the objects but not the edges. The edges are added and deleted by a mutator process that modifies the graph objects that are connected can be called food and those that are not can be called garbage so we know which objects are to be collected. The mutator only adds edges if it has been marked as food.
The task is to define a marker process that runs concurrently with the mutator and is responsible for marking things as food. The marker need not be exact. It can be conservative and mark things that are not food. At the same time, we do not want the marker to be weak to mark at least all the food. i.e. we want to mark as few non-food vertices as possible. we can skip a few objects in the current iteration because they may change and we will call this set manure.
The root vertex in a garbage collector when applied to managing runtime memory, keeps track of the segment header and the free store list. The edges are merely the references to memory. Each vertex keeps track of the users of its memory with a reference count. When the reference count is zero, it is garbage. We reclaim useless memory and add them to the free store list.
We use the marker algorithm to mark the things that are food and this can keep running in the background. When the mutator deletes an edge, the reference count decreases. In a way, the marker algorithm is like the termination detection problem in that it has to know that the marking process has terminated. So we add a special node that every active process has a reference to. When a process goes idle, it deletes its reference to the special node.
As earlier, we have two sets of vertex, one that is food and the other that is garbage. The garbage may have some special vertices called the manure.
The marker program begins by adding a marked field to each vertex. It also adds a boolean to say that the marker has completed. The safety of the program is that there is a final invariant that all vertices have the completed set to true. The progress is in marking each vertex boolean for completed as true.
The mutator and the marker do not interfere with each others running because they work on basis of superposition. That is they touch data that are not related to each other.
We begin with the root and simply mark vertices that can be reached from the already marked ones. This behaves just like the gossip program mentioned earlier. However, the mutator can still frustrate the marker. So we add a second logic to say that we add an edge to x,y when y is food and set both x and y to completed, we delete an edge between x and y when either of them is false
and if x is connected to root, then we set y to completed.

Thursday, October 10, 2013

In distributed computing, termination detection is another interesting topic and one which has an application for snapshot we discussed earlier. Consider a topology of processes with unidirectional channels for communication. The graph need not be connected. Here each process can be in one of two states - an idle state or active state. A process that is in the active state can either send a message, receive a message or switch to idle. An idle process can only receive messages.
Initially all processes are active and all channels are empty. The system has come to termination when the there are no active processes and there are no messages in the channels.
To detect the termination, we introduce a special process called the termination detector. The detector is connected to all processes. When a process is idle, it sends a message to the detector. So the detector knows when all the processes are idle. However is this a sufficient condition for the termination detection. No, a process could send a message to another and both may have gone idle after the send and before the receive. So the detector gets to know that both are idle but the detector has not yet seen the message in transit.
Therefore, something more needs to be sent from each process. This information is the number of messages on each of its outgoing channels and the number of messages on each of its incoming channels.
The detector then detects whether all the processes are idle and for each channel, whether the number of messages put into the channel is equal to the number of messages taken out of the channel.
The detector works well when the termination is a valid snapshot. The state seen by the detector should be a valid snapshot.
As before, the marker algorithm comes useful to take a valid snapshot. Here we send out markers to flush the messages in the channels prior to taking a snapshot. Each process receives a marker, records its local state, and records the incoming channel used by the marker as empty and then sends the markers along each outgoing channel. When a process receives a subsequent marker, it records the state of that incoming channel as completed. The process is complete when all the process have received markers along all incoming channels.
This way the snapshot is determined to be valid and the detector can safely perform termination detection.

In the previous post, we talked about snapshots. However, we did not give the rigor for the correctness of the marker approach. In the current post, I want to show just that. The snapshot is the accurate picture of the global state.
The goal here is to make sure that each computation occurred. We do this by labeling each action in the computation as either pre or post. This helps us know which actions occured before the local state was saved and which actions occurred after the state was saved. A pre action is the action that was recorded before the local state was saved. And the post action is the action that was taken after the state was saved. The actual computation occurred with a interleaved sequence of pre and post actions. We will swap adjacent pre and post actions in the computation. We will retain the order of all the pre's as they appear in the original. We will also retain the order of all the post's as they appear in the original. This way we have ordered all the pres to occur before the state was saved and all the posts to occur after the state was saved. By swapping and retaining the order, we want to have an equivalent computation as the original where each computation can be acertained to have occurred.
To complete the equivalency, let us look at the swapping of a pair of actions that are out of order. Such a pair has the form <apost, bpre>. The a and b are different processes.
There are three cases for bpre:
bpre is a local action. In this case, local actions can be swapped since the two does not affect each other.
bpre is a send action. In this case too the swap can be done.
bpre is a receive action. The only case we cannot swap with apost is when apost is the corresponding send. We show that it is impossible for apost to be a corresponding send because the markers must have been sent and since the channels are a queue, the marker must be delivered before the message sent in action a. So b must be a post and they cannot be out of order.
Thus in all cases we are able to swap. By continuing with the swapping until the desired sequence of all pres and all posts, we establish equivalency and the correctness that the snapshot is accurately taken by the markers approach.

Wednesday, October 9, 2013

In today's post, we will see an interesting use for the logical time we discussed earlier. and we will discuss snapshots.
Snapshots are a capture of a global state of a system. The global state is the combination of all the local states of processes and the channels that connect them. We treated the channels earlier as message queues where processes send messages to each other. For our discussion, processes can do one of three things, they can change their internal state, they can send a message or they can receive a message. The snapshot is this global state capture without affecting the process computations. Looking back at the event sequence diagram and the directed graph we saw earlier with processes, this snapshot is a record of the state on each of the processes, this snapshot appears as a boundary across processes in a wave like manner with regard to the real time. If all the processes shared the same time from a shared clock, this snapshot would have been a straight line at that time. This is where the concept of logical time helps us. We can take a snapshot at the same logical time across processes.
But next we will discuss whether this guarantees correctness.
Take an example of two bank accounts where one has a balance of 1000 dollars and the other has no balance and one sends half the amount to the other. Now if a snapshot is taken where the banks record their state right after send at the sender and right after the receive at the receiver, the snapshot has a net worth of 1500 which is an inconsistent state. Hence a consistent cut is required where the cut is a valid snapshot, cut is consistent and cut has no incoming edges.
How do we make a consistent cut? Use logical time because an inconsistency occurs only because the message between two process is in flight and the source and destination have not had a chance to synchronize their logical time. The logical time is updated with the max between the local time and the max of the message received.
This is made possible by recording the snapshot at the same logical time as one solution.

Another solution to the taking consistent snapshots is to send out markers through the channels to flush the messages.The initiator records its local state and sends out marker along each outgoing channel. Similarly a process receives a marker, records its local state, and records the incoming channel used by marker as empty and then sends markers along each outgoing channel. When a process receives a subsequent marker, it records the state of that incoming channel as completed. The process is complete when all the process have received markers along all incoming channels.

Time, Clocks and synchronization
In a distributed world, the concept of time is very important for the progress of operations. Time is available on a local processor because it relies on a clock or counter. In the distributed world, there is no shared clock. Yet the shared notion of time is critical. Instead of maintaining a shared central counter for the the time, let us look at how this time is used.
The processors want to know about time because they are looking to order their events. Time is used to place events such that some events "happen before" others. In a distributed system, there are three ways in which a certain event affects another.
1) A and B are on the same processor and A is earlier in computation before B
2) A and B are on different processors and A sends a message to B
3) A can affect a third event C and C can affect B
Let us look at how to represent these events. We know that events occurs on different processors on different time. So we can structure a diagram where all events that occur on a processor P are are arranged in a straight lines and there are parallel lines for different processors. The events are represented by the vertices on these lines and the "happens-before" relation is represented by directed edges between the vertices. Time proceeds to the right so the edges are directed up or down or towards the right. A horizontal axis represents the "true time"
An abstraction of a monotonic counter is used for a clock. The values read from this counter is the timestamp. But there is no central clock. Each processor maintains its own clock.
Timestamps are issued such that the timestamp is greater than all the events that happened before.
There are three cases :
Local event: the clock is merely incremented
Send event: the clock is incremented and the updated value is the timestamp for the event and the message
Receive evnet: : the clock is updated by the maximum of the current clock and the timestamp of the message.
The invariant here is that the clock shows the most recently assigned timestamp. This is the concept of the logical clock.

Tuesday, October 8, 2013

In distributed computing, there is a very interesting technique called gossip. We will describe it in this post. By the way my next several posts are going to be in distributed computing just like the one this Sunday. This is because I'm reading that book I cited earlier. Gossip refers to the mechanism of diffusing computations from one process to all the other processes. The computations are forwarded as messages and although the mechanism may not be obvious, the goal is that all the processes will have received the computations.
First let us consider the topology of processes as an undirected graph such that processes are able to communicate to their neighbor. By undirected I mean neighbors can send messages to each other in any order. We will shortly see why the ordering is not important but if we try to focus on the exact order in which the messages will be sent, there are so many that we can easily lose ourselves. Instead let us focus on the tenets of this algorithm.
The safety of the algorithm is that there is a final invariant. Each process will have computations.
The progress of the algorithm is satisfied by some forwarding of messages from processes to their neighbors such that the computations spread to everyone. When one process forwards a message, it informs the others and it wants to be informed that the others received theirs. So each process forwards the messages onto its immediate neighbor but sends an acknowledgement only to its parent. So there seems to be two different messages, the computations and the acknowledgements. For the sake of efficiency, we can eliminate the acknowledgements and say that a process received the computations the first time and everything else is an acknowledgement now that it has received it's own. When the sender gets back the computation from the receiver, the same message acts as an acknowledgement. However there is a possibility to confuse the computations and acknowledgements due to race conditions. Assume that X and Y send computations to each other. X may not know if Y was sending the computation to it before X sent or whether Y was sending an acknowledgement after it received.
Therefore convincing ourselves that the algorithm works requires some rigor. And there are great applications to this technique such as barrier synchronization.
And this is how we present the proof of correctness. A process can only be in one of three different states - idle, active and, complete. A process starts out in the idle state. Once it hears the gossip and passes it to its children, it becomes active. Once it receives the acknowledgements and sends the acknowledgement to its parent, it becomes complete.
We divide the graph into two directed graphs- the first consists of the active or completed nodes and the second consists of the nodes that are active. Both include the initiator and the edges that connect. The second starts out as a proper subset of first. We notice that the directed graphs are trees where the first tree grows and the second tree grows and shrinks. We determine that the second tree shrinks only because a node moves into the first tree as completed.