Monday, October 21, 2013

In today's post we describe the automatic discovery of similar words in a dictionary. In particular we discuss the method proposed by the same authors as in the previous post. This is a refinement over the generalized method provided by Kleinberg for searching the web. These methods could have  been improved by others but their reference here is a good read nevertheless.
First we construct a dictionary graph G where each word is a vertex in G and there is an edge from u to v if v appears in the definition of u. Then associated with a given query word we construct a neighborhood graph which is a subgraph of G whose vertices are those that are pointing to or pointed by the vertex in G.
The problem is simplified to searching for authoritative pages in a subgraph that is focused on the original page. These pages are similar to the hubs in the structure graph.  An authority page is one that categorizes the other pages such as automobile makers for Ford, Toyota and other car makers whereas web pages that list these as home pages are hubs.
The method to find hubs and authorities proceeds by initially setting a weight of 1 to both hub and authority and simultaneously updating them according to a mutually reinforcing rule.  The hub score of the vertex i, xj is set to equal the sum of the authority scores pointed to by i and similarly the authority scores of the vertex j is equal to the sum of the hub squares of all vertices pointing to by j. The hub score and the authority score can be seen as a measure of similarity between vertices in each other's graph.
However hub and authority pages are both not good at discovering synonyms.
This can be improved  by maintaining three scores or as many scores as there are vertices in the structure graph. We then iteratively update the scores simultaneously based on a mutually reinforcing rule : the first scores are set to the sum of the scores of all vertices j pointed to by i, the second scores are set equal to the sum of third scores of vertices pointed to by i and the third scores as the sum of the second squares pointed to by i.  Let M be the adjacency matrix associated with G. The process converges with the list of synonyms appearing as those in decreasing order of Eigen value of M Mtransposed + MtransposedM.
I came across a paper on Automatic Discovery of Similar Words by Senellart and Blondel. It talks about algorithms to extract similar words from a large corpus of documents. In this paper it refers to a partial syntactic  analysis from another paper which is interesting. This method is called the SEXTANT (Semantic Extraction from Text via Analyzed Network of Terms) and uses the steps as mentioned below:
Lexical Analysis :
Words in the corpus are separated using a simple lexical analysis. Proper names are also recognized. Then each word is looked up in a lexicon and is assigned a part of speech. If a word has several possible parts of speech, a disambiguator is used to choose the most probable one.
Noun and verb phrase bracketing:
Noun and verb phrases are then detected in the sentences of the corpus. This is done with starting, ending and continuation rules such as a determiner can start a noun phrase, a noun can follow a determiner in a noun phrase, an adjective cannot start, end or follow any kind of word in a verb phrase and others.
Using multiple passes for parsing the text, several syntactic relations or contexts are then extracted from the bracketed
These are denoted as follows
ADJ : an adjective modifies a noun eg. civil unrest
NN :  a noun modifies  a noun eg. animal rights
NNPREP : a noun that is the object of a preposition modifies a preceding noun eg. measurements along the chest
SUBJ: a noun is the subject of a verb. eg. the table shook
DOBJ:  a noun is the direct object of a verb. (e.g. shook the table )
IOBJ: a noun in a prepositional phrase modifying a verb. (eg. the book was placed on the table)

This kind of parsing was computationally intensive and multiple passes over large corpus added to it. Hence only the few above were used.
Only the similarity between nouns is focused on and other parts of speech are not discussed. After the parsing step, a noun has a number of attributes, all the words that modify it along with a kind of syntactical relation. This leads to high dimensions such as a noun that appears 83 times has 67 unique attributes.
Each attribute is assigned a weight using the probability of the attribute among all those for the noun.
These attributes are then used for with a similarity measure such as a weighted Jaccard similarity.
An interesting observation from such similarity measures is that the corpus has a great impact on the meaning of the word according to which similar words are selected. 

Sunday, October 20, 2013

Search algorithms are of different types. We discuss four algorithms
1) Unordered Linear Search algorithm : This is very similar to a student searching for an exam score in a collection of unordered exams
2) Ordered Linear Search : Sorting helps exclude portions of the list during search.
3) Chunk Search : This is the search typically used in a phone book. A few pages may be looked at a time. These are called chunks. Each chunk could then be searched using an ordered linear search.
4) Binary Search : We used a divide and conquer approach where we split the data set into two halves and proceed along only one half of the set at each level.
I mentioned the Search algorithms as an introduction to a book I want to pick up for readings on this topic. But meanwhile today I'm experimenting with an https based web service deployment to cloud that will try out OAuth redirections.
This is very similar to a usual deployment of any .Net web application to the cloud but differs in the following steps:
1) obtain an SSL certificate. This is done by generating a certificate signing request with IIS Manager to send to the Certificate Authority.
2) After submitting the certificate to the CSR, an SSL certificate is obtained.
3) the CSR is then completed with the certificate provided by the certificate authority.
4) The certificate is then exported from the IIS Manager.
5) It is then added to the certificate store for the hosted service.
6) The thumbprint of the certificate is then added to the service configuration file.
7) The service definition file is updated to identify the certificate to the hosted service.
8) The HTTPs endpoint is then configured in the service definition file.
9) a CNAME record for the domain name is then setup. This is the name the SSL certificate is issued against to redirect traffic to the service.
The HTTPS endpoint can be tested in a standalone environment where all authentication occurs using a self-signed certificate provided for localhost.

With the development of Cassini web server or IIS Express, it is easier to develop the web service for https. With the properties view if the project, one can turn on SSL enabled to true.

Intermediate tutorial on data mining
We continue our discussion on using Analysis Services to provide an integrated environment for creating and working with data mining models. We mentioned how we bind to data sources, create and test models and use with predictive analysis. In this tutorial, we build on the previous review to include new scenarios such as forecasting and market basket analysis.
In forecasting we create a time series model to forecast the sales of products in different regions around the world. You will develop individual models for each region and learn how to use cross prediction.
In market basket analysis, we will create an association model, to analyze groupings of the product that are purchased online so that we can recommend products to customers.
A forecasting structure is created using an existing relational database or data warehouse. The Microsoft Time Series is selected and and the SalesByRegion is selected in this case.
The input and predictable columns can then be selected followed by specifying column contents and data types. The columns can be dragged and dropped into the forecasting mining structure.
the mining models
The forecasting model can be customized with the FORECAST_METHOD and PREDICTION_SMOOTHING. The forecast method parameter controls whether the time series algorithm is optimized for short term or long term predictions. The prediction smoothing parameter combines the way short term and long term predictions are mixed. Typically a fifty fifty split works.
Missing data can also be handled.
The forecasting model can be explored with a mining model viewer. Predictions and deviations can both be viewed here.
Similarly, the market basket analysis is also similar.  The mining structure is created using the wizard by specifying association rules and available data sources. We select the option to allow drill through and display the mining structure we create.
In this case, the algorithm parameters are MINIMUM_PROBABILITY and MINIMUM_SUPPORT. Typical values are 0.1 and 0.01 respectively.
The association viewer contains three tabs Rules, Itemsets, and dependency network. The dependency network allows us to investigate the interaction of the different items in the model. Each node in the viewer represents an item while the line between them represents a rule. A line connecting two items means that these items are likely to appear in a transaction together.
Associations from the model can then be used to make predictions. The is done by building prediction queries against the mining models.

Saturday, October 19, 2013

This post is about the Microsoft basic data mining tutorial  on MSDN. The goal is to create a mining database, set up several data mining models, and use the models to make predictions.  This feature is provided as part of analysis services and I wanted to try out the Microsoft SQL Server 2012 Data Mining Add-Ins together with Excel but turns out that I'm running into an installation bug with Microsoft SQL Server 2012 Data Mining Add-In that keeps asking for a database permissions for my login even when its all there. It doesn't occur with any other database connections. Nevertheless, I will first lay out the interpretations from the text and then attempt to workaround the bug. Meanwhile, back to the data mining tutorial.
The tutorial teaches us how to create and work with different type of data mining models. It also teaches us how to create  a copy of the mining model, and apply a filter to the mining model. After the model is complete, we can use it to drill-through results.
At the time of creating the model, we can split the data into training and test sets. This improves accuracy.
The model filters are filters that can be applied on both training and test sets.
The drill-through results follow from the pattern identified from the mining model which translates to actions on the data source.
The first step in these is to use the Analysis Services Multidimensional and Data-Mining Project templates. We can change the instance where the data mining objects are stored.
The data source can be specified using the connection manager and to select the native OLE DB\SQL Server native client. The tables and Views can then be selected from the source database.
The second step is to build a targeted mining model structure. This can be done by selecting the definition method from existing relational database or data warehouse and then choosing say the Microsoft Decision trees as the data mining structure. Table types and training data will then need to be specified. Specifying the data type and the content type will be next.  There is a wizard option  that automatically detects these but they may need to be reviewed.
The test data set is then carved out from the sample. The default value is thirty percent and this could be retained as-is.
Adding new models is easy and can be done by switching to mining models tab in SSDT and right-clicking the structure column and selecting new models. Naive Bayes and clustering model can be similarly added.
Each of the models can then be explored. For example in the decision tree, we can view all the nodes.
The models can the be tested using lift charts. A lift chart helps compare how well each of the model makes predictions and compare the result of each model directly against the results of other models.
After the accuracy has been found satisfactory,  the prediction query builder can then be used to design and run prediction queries.
In this post, we will talk about fairness both weak and strong. Every program has a skip action and every program has multiple actions. In an informal model, we reach into a hat and pick an action. If we are very unlucky we pick the skip action over and over again. None of the others are ever chosen and the program does nothing. To prevent this kind of selection, we impose a fairness requirement on the selection of actions. There are two kinds of fairness -  weak and strong.
Under weak fairness, every action is guaranteed to be selected infinitely often. Said another way between any two selections of the same action, there are a finite number of selections of other actions. There is no however guarantee how many actions may be selected before all actions have been selected at least once.
In we take an example where the program maintains a state based on a boolean variable  and increments a counter modulo four times. b is assigned the result of the modulo. If n  is zero, the boolean is assigned false.
A directed graph can be drawn representing the program with the vertex as states of the program and the directed edges representing the actions. So weak fairness says that each edge label is guaranteed to repeat often in an infinite path.
If the program begins execution in a state satisfying the predicate n = 1, the initial state is <false,1> and then each possible computation leaves the state unchanged. If on the other hand, the initial state is the state <true, 1>, then there is no guarantee the state reaches the fix point of <false, 1> and it cycles repeatedly between <true,1>, <true, 2>, <true, 3> and <true, 0>. The action that assigns false to b must be selected an infinite number of times, but it may be selected every time n = 3 and the action has no effect. In such a program, every action is selected infinitely often and this satisfies the weak requirement.
Under strong fairness, in addition to the weakness requirement, if an action is enabled infinitely often, it is selected infinitely often. Note that in the previous example, the selection was extraordinarily unlucky because one action was only chosen at particular times when it happens to be unenabled. In the strong fairness program, the cycle would not be allowed to repeat without selecting the action that assigns false to b. This program could then be guaranteed to reach the fixed point and terminate.
In general, if we come up with a property for a program then that property holds regardless of whether the actual execution is weak or strong. However is a property relies on strong fairness, that property may or may not hold for weak fairness. The program could choose to assume weak fairness.


Friday, October 18, 2013

In today's post we will talk about one more distributed computing before we move on to the next topic. We will talk about time synchronization of physical clocks. We mentioned that virtual clocks and logical time can capture a partial order of events in a distributed system. But in this post, we will talk about say a real clock. Such a clock would advance at an appropriate rate. eg. one tick per millisecond. Each process may have its own clock.
The issue here is that the current time displayed by the clock is not a shared state. Moreover, a watch that ticks accurately is not enough. It can not be used to tell time though it can be used to time the duration of some interval. The clocks may need to be synchronized to tell time and even then they can drift over time requiring synchronization again.
Consider a process receiving a time notification from another process. The receiving process has no way of knowing the delay from the transmission  One solution could involve sending a message first and then waiting for an echo. The receiving process can then know the time elapsed.
Since the process knows the timestamp on the echo, it can split the elapsed time to know what the delay was in receiving the echo.
Sending a request and receiving an echo can be repeated with the hope to improve accuracy. If the delays are the same, there is no improvement. If the delays vary, then the repetitions narrow the possible interval. This improves accuracy of the clock.
As an example, lets say the timestamp of the first echo is 3 and that of the second echo is 3:30. Since the elapsed time for the earlier was ten minutes, and the elapsed time for the other is twelve minutes, we have two intervals for the current time at the other process. The interval intersecting these two intervals is now much smaller thus improving the accuracy in predicting the time at the other process.
When this is repeated with several servers, and each server finds the interval its working on, the intersection of all these intervals will be even narrower thus increasing the accuracy of the time.
It might happen that the intersection for the intervals could come out to be empty.
This means that one or more of the servers have a wider drift. So we take each interval as a likely vote and go with the majority.
One way to do such a tally is as follows is to see how many intervals are satisfied by a given time. We increment a counter whenever we are within an interval and decrement the counter whenever we leave the interval. We also update our minimum and maximum bound on each entry and exit. At the maximum count of overlapping intervals, we know what intersection majority have.