Cluster computing

Sunday, October 20, 2013

Search algorithms are of different types. We discuss four algorithms
1) Unordered Linear Search algorithm : This is very similar to a student searching for an exam score in a collection of unordered exams
2) Ordered Linear Search : Sorting helps exclude portions of the list during search.
3) Chunk Search : This is the search typically used in a phone book. A few pages may be looked at a time. These are called chunks. Each chunk could then be searched using an ordered linear search.
4) Binary Search : We used a divide and conquer approach where we split the data set into two halves and proceed along only one half of the set at each level.
I mentioned the Search algorithms as an introduction to a book I want to pick up for readings on this topic. But meanwhile today I'm experimenting with an https based web service deployment to cloud that will try out OAuth redirections.
This is very similar to a usual deployment of any .Net web application to the cloud but differs in the following steps:
1) obtain an SSL certificate. This is done by generating a certificate signing request with IIS Manager to send to the Certificate Authority.
2) After submitting the certificate to the CSR, an SSL certificate is obtained.
3) the CSR is then completed with the certificate provided by the certificate authority.
4) The certificate is then exported from the IIS Manager.
5) It is then added to the certificate store for the hosted service.
6) The thumbprint of the certificate is then added to the service configuration file.
7) The service definition file is updated to identify the certificate to the hosted service.
8) The HTTPs endpoint is then configured in the service definition file.
9) a CNAME record for the domain name is then setup. This is the name the SSL certificate is issued against to redirect traffic to the service.
The HTTPS endpoint can be tested in a standalone environment where all authentication occurs using a self-signed certificate provided for localhost.

With the development of Cassini web server or IIS Express, it is easier to develop the web service for https. With the properties view if the project, one can turn on SSL enabled to true.

Intermediate tutorial on data mining
We continue our discussion on using Analysis Services to provide an integrated environment for creating and working with data mining models. We mentioned how we bind to data sources, create and test models and use with predictive analysis. In this tutorial, we build on the previous review to include new scenarios such as forecasting and market basket analysis.
In forecasting we create a time series model to forecast the sales of products in different regions around the world. You will develop individual models for each region and learn how to use cross prediction.
In market basket analysis, we will create an association model, to analyze groupings of the product that are purchased online so that we can recommend products to customers.
A forecasting structure is created using an existing relational database or data warehouse. The Microsoft Time Series is selected and and the SalesByRegion is selected in this case.
The input and predictable columns can then be selected followed by specifying column contents and data types. The columns can be dragged and dropped into the forecasting mining structure.
the mining models
The forecasting model can be customized with the FORECAST_METHOD and PREDICTION_SMOOTHING. The forecast method parameter controls whether the time series algorithm is optimized for short term or long term predictions. The prediction smoothing parameter combines the way short term and long term predictions are mixed. Typically a fifty fifty split works.
Missing data can also be handled.
The forecasting model can be explored with a mining model viewer. Predictions and deviations can both be viewed here.
Similarly, the market basket analysis is also similar. The mining structure is created using the wizard by specifying association rules and available data sources. We select the option to allow drill through and display the mining structure we create.
In this case, the algorithm parameters are MINIMUM_PROBABILITY and MINIMUM_SUPPORT. Typical values are 0.1 and 0.01 respectively.
The association viewer contains three tabs Rules, Itemsets, and dependency network. The dependency network allows us to investigate the interaction of the different items in the model. Each node in the viewer represents an item while the line between them represents a rule. A line connecting two items means that these items are likely to appear in a transaction together.
Associations from the model can then be used to make predictions. The is done by building prediction queries against the mining models.

Saturday, October 19, 2013

This post is about the Microsoft basic data mining tutorial on MSDN. The goal is to create a mining database, set up several data mining models, and use the models to make predictions. This feature is provided as part of analysis services and I wanted to try out the Microsoft SQL Server 2012 Data Mining Add-Ins together with Excel but turns out that I'm running into an installation bug with Microsoft SQL Server 2012 Data Mining Add-In that keeps asking for a database permissions for my login even when its all there. It doesn't occur with any other database connections. Nevertheless, I will first lay out the interpretations from the text and then attempt to workaround the bug. Meanwhile, back to the data mining tutorial.
The tutorial teaches us how to create and work with different type of data mining models. It also teaches us how to create a copy of the mining model, and apply a filter to the mining model. After the model is complete, we can use it to drill-through results.
At the time of creating the model, we can split the data into training and test sets. This improves accuracy.
The model filters are filters that can be applied on both training and test sets.
The drill-through results follow from the pattern identified from the mining model which translates to actions on the data source.
The first step in these is to use the Analysis Services Multidimensional and Data-Mining Project templates. We can change the instance where the data mining objects are stored.
The data source can be specified using the connection manager and to select the native OLE DB\SQL Server native client. The tables and Views can then be selected from the source database.
The second step is to build a targeted mining model structure. This can be done by selecting the definition method from existing relational database or data warehouse and then choosing say the Microsoft Decision trees as the data mining structure. Table types and training data will then need to be specified. Specifying the data type and the content type will be next. There is a wizard option that automatically detects these but they may need to be reviewed.
The test data set is then carved out from the sample. The default value is thirty percent and this could be retained as-is.
Adding new models is easy and can be done by switching to mining models tab in SSDT and right-clicking the structure column and selecting new models. Naive Bayes and clustering model can be similarly added.
Each of the models can then be explored. For example in the decision tree, we can view all the nodes.
The models can the be tested using lift charts. A lift chart helps compare how well each of the model makes predictions and compare the result of each model directly against the results of other models.
After the accuracy has been found satisfactory, the prediction query builder can then be used to design and run prediction queries.

In this post, we will talk about fairness both weak and strong. Every program has a skip action and every program has multiple actions. In an informal model, we reach into a hat and pick an action. If we are very unlucky we pick the skip action over and over again. None of the others are ever chosen and the program does nothing. To prevent this kind of selection, we impose a fairness requirement on the selection of actions. There are two kinds of fairness - weak and strong.
Under weak fairness, every action is guaranteed to be selected infinitely often. Said another way between any two selections of the same action, there are a finite number of selections of other actions. There is no however guarantee how many actions may be selected before all actions have been selected at least once.
In we take an example where the program maintains a state based on a boolean variable and increments a counter modulo four times. b is assigned the result of the modulo. If n is zero, the boolean is assigned false.
A directed graph can be drawn representing the program with the vertex as states of the program and the directed edges representing the actions. So weak fairness says that each edge label is guaranteed to repeat often in an infinite path.
If the program begins execution in a state satisfying the predicate n = 1, the initial state is <false,1> and then each possible computation leaves the state unchanged. If on the other hand, the initial state is the state <true, 1>, then there is no guarantee the state reaches the fix point of <false, 1> and it cycles repeatedly between <true,1>, <true, 2>, <true, 3> and <true, 0>. The action that assigns false to b must be selected an infinite number of times, but it may be selected every time n = 3 and the action has no effect. In such a program, every action is selected infinitely often and this satisfies the weak requirement.
Under strong fairness, in addition to the weakness requirement, if an action is enabled infinitely often, it is selected infinitely often. Note that in the previous example, the selection was extraordinarily unlucky because one action was only chosen at particular times when it happens to be unenabled. In the strong fairness program, the cycle would not be allowed to repeat without selecting the action that assigns false to b. This program could then be guaranteed to reach the fixed point and terminate.
In general, if we come up with a property for a program then that property holds regardless of whether the actual execution is weak or strong. However is a property relies on strong fairness, that property may or may not hold for weak fairness. The program could choose to assume weak fairness.

Friday, October 18, 2013

In today's post we will talk about one more distributed computing before we move on to the next topic. We will talk about time synchronization of physical clocks. We mentioned that virtual clocks and logical time can capture a partial order of events in a distributed system. But in this post, we will talk about say a real clock. Such a clock would advance at an appropriate rate. eg. one tick per millisecond. Each process may have its own clock.
The issue here is that the current time displayed by the clock is not a shared state. Moreover, a watch that ticks accurately is not enough. It can not be used to tell time though it can be used to time the duration of some interval. The clocks may need to be synchronized to tell time and even then they can drift over time requiring synchronization again.
Consider a process receiving a time notification from another process. The receiving process has no way of knowing the delay from the transmission One solution could involve sending a message first and then waiting for an echo. The receiving process can then know the time elapsed.
Since the process knows the timestamp on the echo, it can split the elapsed time to know what the delay was in receiving the echo.
Sending a request and receiving an echo can be repeated with the hope to improve accuracy. If the delays are the same, there is no improvement. If the delays vary, then the repetitions narrow the possible interval. This improves accuracy of the clock.
As an example, lets say the timestamp of the first echo is 3 and that of the second echo is 3:30. Since the elapsed time for the earlier was ten minutes, and the elapsed time for the other is twelve minutes, we have two intervals for the current time at the other process. The interval intersecting these two intervals is now much smaller thus improving the accuracy in predicting the time at the other process.
When this is repeated with several servers, and each server finds the interval its working on, the intersection of all these intervals will be even narrower thus increasing the accuracy of the time.
It might happen that the intersection for the intervals could come out to be empty.
This means that one or more of the servers have a wider drift. So we take each interval as a likely vote and go with the majority.
One way to do such a tally is as follows is to see how many intervals are satisfied by a given time. We increment a counter whenever we are within an interval and decrement the counter whenever we leave the interval. We also update our minimum and maximum bound on each entry and exit. At the maximum count of overlapping intervals, we know what intersection majority have.

Thursday, October 17, 2013

In some of the previous posts we discussed distribute computing . My next book reading is going to be on search.
Meanwhile here is a quick summary of things that we discussed.
1) We talked about reasoning of programs and how to give proof of correctness for the program.
2) We talked about some sample programs such as Earliest Meeting time and Greatest Common Divisor of two integers X and Y
3) We talked about time, clocks and synchronization. In particular we discussed what a logical time is and how to maintain a logical clock. There are vector clocks that keep track of time at all other processes. We talked about synchronizing clocks.
4) We talked about diffusing computations or gossip and how we know that it progresses and terminates. We mentioned interesting applications of gossip.
5) We discussed mutual exclusion as a layer that resolves conflicts between processes.
6) We discussed solutions for mutual exclusion that relied on distributed atomic variables and non-token based solutions including Lamport's algorithm of sending acknowledgements and optimizations.
7) we discussed token based solutions including a simple token ring and token ring with requests and a token tree
8) The token tree approach let us generalize a token graph solution. The key ideas were that tokens were neither created nor destroyed which guarantees safety.
9) Tree is directed and token is always at the root. Process sends only one request and may maintain a list of pending requests. The tree is generalized to a partial order. To maintain the partial order, make all the edges incoming for node with token.
10) We discussed dining philosophers problem and the partial order solution aka hygienic solution. The key idea was to break the symmetry by forming a partial order. To lower a philosopher in the partial order, make all the edges outgoing. Tokens called forks are used for mutual exclusion. Priority is encoded in clean and dirty forks. Request tokens are used to determine whether a neighbor is hungry.
11) We discussed snapshots and how to cut through timeline for various processes. We discussed solutions to taking snapshot including logical time based solution and marker algorithm. We included a proof of correctness. The marker algorithm was meant to flush the channel of messages.
12) We discussed termination detection and the use of special detector process. We applied snapshot to help with the detection.
13) We discussed garbage collection and its relationship to termination detection. We discussed the principle of superposition and the use of a marker witht the mutator. We made first and second attempts with a propagator to spread the marks in a gossip like manner and check to see which garbage is not root i.e. it's manure.
14) We talked about tolerating faults in a distributed system both for byzantine failures and crashes. We talked about the use of authenticated messages and building a consensus tree.
15) We also talked about discrete event simulation with sequential and time driven simulation. We talked about conservative approaches using null events and look aheads. We also talked about optimistic approaches where we allow mistakes that can be undone i.e allow stragglers and do roll-backs.

Wednesday, October 16, 2013

In addition to the previous post, we can consider some optimizations as well. When a request enters the queue, it was acknowledged immediately. However, one optimization is that not all acknowledgements are needed. For example, if a request has already been sent with a later timestamp, then that request acts as an acknowledgement. The timestamp in the packet is used to guarantee a known time for this process that is greater than the request time.
Another optimization from Ricart-Agrawala involves eliminating acknowledgements for even more requests. If a request has already been sent with a time stamp with an earlier timestamp than a received request and that request is still pending, there is no need to send an acknowledgement. When that request is granted, we will send a release message and that will serve as an acknowledgement. When the process Pt receives a request,timestamp from a process Pj, it defers an acknowledgement if Pt is in critical section or if the request from Pj with a later timestamp than the one we sent, is still pending. When the process Pt leaves the critical section, it will send out an acknowledgement.
This completes the non-token based approach for mutual exclusion that we wanted to discuss.
Reverting to the program proof of correctness topic, here is one more:
Take the program to find the greatest common divisor of two given integers. The program is described with the following:
Let X and Y be two integers that are given, we select two numbers x and y such that x = y = gcd(X,Y)
Initially the program sets x to be greater than zero and equal to X, y to be greater than zero and equal to Y.
at each step the program assigns x to x-y if x > y or y to y - x if y > x. By doing this repeatedly, the program converges to the gcd.
The fixed point of the program can be interepreted to be y = 0 when x > y and x = 0 when y > x from the above.
The invariant adds the condition the gcd of (x,y) should be the same as the gcd for (X,Y) in addition to requiring that both x and y are positive integers.
We could use the metric as x+y since this value continually decreases and remains bounded below because on each step a positive number is subtracted from either x or y. We know the program can guarantee the metric will change because if x and y are different, at the next computation, their sum will be decreased. Thus the program terminates.