Cluster computing

Wednesday, October 9, 2013

Time, Clocks and synchronization
In a distributed world, the concept of time is very important for the progress of operations. Time is available on a local processor because it relies on a clock or counter. In the distributed world, there is no shared clock. Yet the shared notion of time is critical. Instead of maintaining a shared central counter for the the time, let us look at how this time is used.
The processors want to know about time because they are looking to order their events. Time is used to place events such that some events "happen before" others. In a distributed system, there are three ways in which a certain event affects another.
1) A and B are on the same processor and A is earlier in computation before B
2) A and B are on different processors and A sends a message to B
3) A can affect a third event C and C can affect B
Let us look at how to represent these events. We know that events occurs on different processors on different time. So we can structure a diagram where all events that occur on a processor P are are arranged in a straight lines and there are parallel lines for different processors. The events are represented by the vertices on these lines and the "happens-before" relation is represented by directed edges between the vertices. Time proceeds to the right so the edges are directed up or down or towards the right. A horizontal axis represents the "true time"
An abstraction of a monotonic counter is used for a clock. The values read from this counter is the timestamp. But there is no central clock. Each processor maintains its own clock.
Timestamps are issued such that the timestamp is greater than all the events that happened before.
There are three cases :
Local event: the clock is merely incremented
Send event: the clock is incremented and the updated value is the timestamp for the event and the message
Receive evnet: : the clock is updated by the maximum of the current clock and the timestamp of the message.
The invariant here is that the clock shows the most recently assigned timestamp. This is the concept of the logical clock.

Tuesday, October 8, 2013

In distributed computing, there is a very interesting technique called gossip. We will describe it in this post. By the way my next several posts are going to be in distributed computing just like the one this Sunday. This is because I'm reading that book I cited earlier. Gossip refers to the mechanism of diffusing computations from one process to all the other processes. The computations are forwarded as messages and although the mechanism may not be obvious, the goal is that all the processes will have received the computations.
First let us consider the topology of processes as an undirected graph such that processes are able to communicate to their neighbor. By undirected I mean neighbors can send messages to each other in any order. We will shortly see why the ordering is not important but if we try to focus on the exact order in which the messages will be sent, there are so many that we can easily lose ourselves. Instead let us focus on the tenets of this algorithm.
The safety of the algorithm is that there is a final invariant. Each process will have computations.
The progress of the algorithm is satisfied by some forwarding of messages from processes to their neighbors such that the computations spread to everyone. When one process forwards a message, it informs the others and it wants to be informed that the others received theirs. So each process forwards the messages onto its immediate neighbor but sends an acknowledgement only to its parent. So there seems to be two different messages, the computations and the acknowledgements. For the sake of efficiency, we can eliminate the acknowledgements and say that a process received the computations the first time and everything else is an acknowledgement now that it has received it's own. When the sender gets back the computation from the receiver, the same message acts as an acknowledgement. However there is a possibility to confuse the computations and acknowledgements due to race conditions. Assume that X and Y send computations to each other. X may not know if Y was sending the computation to it before X sent or whether Y was sending an acknowledgement after it received.
Therefore convincing ourselves that the algorithm works requires some rigor. And there are great applications to this technique such as barrier synchronization.
And this is how we present the proof of correctness. A process can only be in one of three different states - idle, active and, complete. A process starts out in the idle state. Once it hears the gossip and passes it to its children, it becomes active. Once it receives the acknowledgements and sends the acknowledgement to its parent, it becomes complete.
We divide the graph into two directed graphs- the first consists of the active or completed nodes and the second consists of the nodes that are active. Both include the initiator and the edges that connect. The second starts out as a proper subset of first. We notice that the directed graphs are trees where the first tree grows and the second tree grows and shrinks. We determine that the second tree shrinks only because a node moves into the first tree as completed.

In the previous post we had talked about cshtml and aspx page. Both may require database and this can be done with the same enterprise data block as mentioned earlier.

Monday, October 7, 2013

In today's post, I want to talk about the differences between the aspx and cshtml page. The main difference between the two is that aspx page is a server side dynamic page and the cshtml page is a Razor view engine page. The aspx page is subject to the page life cycle events that ASP.NET is known for. The cshtml pages work with a model data structure for the data to be edited or rendered in the page. Both aspx and cshtml page can use Controller and model as well as viewmodels.
When using an aspx page, the syntax for populating the fields of the model is with the <% and that for the cshtml is using the @Html. The @notation comes from the Razor engine.
The page life cycle events in the aspx allows for fine grained control of the server page and controls. This means that some of the controls can process their states independently and leads to organization of code based on controls on the page. The cshtml page can also organize based on partial and full pages. However each view is associated with a viewmodel or model.
The Razor syntax enables the HTML generation with terse code. This is a view engine. This is more suited for stateless http requests and responses.
The web forms on the other hand enable state to be maintained between requests.
In terms of testing, it is easier to test cshtml rather than aspx because the view is separated from the ther concerns.
Similarly, the idea of the code behind occurs only in the aspx pages where the logic is moved out from the view declaration. However it is still a tightly coupled architecture and requires integrated testing for the pages to work correctly. The has often been troublesome.
The aspx page and the cshtml page can both include client side scripts and content however the aspx pages clutter the view with almost duplicate server side controls and notation as opposed to cshtml which is cleaner in the sense that it has only HTML5 notation. Much of this is evident in the absence of any server side controls in cshtml.
Similarly, the aspx page allows writing custom controls and logic that add to the variety and complexity of the view whereas with the separation of model and view and HTML only syntax the cshtml is far more clean.
Lastly, there is a convenience to store and render state with aspx that is unparalleled in cshtml. If anything the state can be consolidated across application instance and with web farms. This is done via models and model state in cshtml to some degree .

Combining user accounts with client registration pages for an API implementation web site integrates the developer and the user. Development requires an account to test the user page and registration combines the step for the two. Besides both are the company's assets. A client registration page has details such as an account for registering, a display name which is the name that others will see, an e-mail where the validations are sent to. Validations are required to confirm that the data entry is not automated. Other details include the password for the account and password security requirements enforcement, the details for the name of the application, the web site, what the application will do. The application description is important because it informs what the application intends to do. This we use to validate the client applications when it is being approved.
Also a callback URL that is registered. This callback URL is important for all OAuth client validations and hence this is critical to the OAuth logins from the client. The OAuth spec demands that the client redirect URLs be validated so this is important.

Sunday, October 6, 2013

I'm listing the various data mining algorithms.
These are as follows:
Classification algorithms : A classification algorithm predicts one or more discrete variables, based on the other attributes in the dataset.
A regression algorithm predicts one or more continuous variable such as profit or loss, based on other attributes in the dataset.
A segmentation algorithm divides data into groups or clusters or items that have similar properties.
Association algorithms finds correlations between different attributes in a dataset. This is used for creating association rules.
Sequence analysis algorithms summarize frequent sequences or episodes in data
An algorithm can create a mining model that comprises of
a set of clusters that describe groupings within the data set
a decision tree that predicts an outcome
a mathematical model that forecasts sales
a set of rules that describe how items are grouped together.
Algorithms can be picked based on the purpose at hand.
Decision tree algorithm can be used to predict a discrete or continuous attribute. It can also be used to find groups of common items in transactions.
A Naive Bayes algorithm works best to predict a discrete attribute. A neural network algorithm could be used too. A clustering algorithm also works well to predict a discrete attribute. However, it is better suited for grouping of similar items. A sequence clustering algorithm can be used to find groups of similar items as well as to predict a sequence.
A time series algorithm and a linear regression algorithm works best to determine a continuous attribute.
Association algorithm works well to find groups of common items by establishing correlations between attributes.