Wednesday, October 23, 2013

In the previous post, we introduced SWKIC from the paper by Frigui and Nasraoui. In this post, we take a look at how they update keyword weights and their clustering.
SKWIC algorithm is given as :
Fix the number of clusters C
Initialize the centers by randomly selecting C documents
Initialize the partitions X, using equal feature weights (1/n) where n is the total number of items in a collection of N documents.
REPEAT
Compute cosine based distance between document xj and ith cluster center vector using a formula that involves the kth component of the ith cluster center vector
Compute the aggregated sum of the above
Update the cluster partitioning
Update the centers
Update the delta-i
UNTIL centers stabilize
We will discuss  the steps above. Let xi and xj be the document frequency vectors defined on a vocabulary of n terms. Then the cosine similarity measure between xi and xj to categorize the corresponding documents is taken as the fraction of the dot product of xik and xjk over their magnitudes.
They also introduce  a dissimilarity measure between documents xj and the ith cluster center vector as the weighted aggregated sum cosine based distances along the individual dimensions.
This distance is bounded by the inverse of the total number of terms and measures to be the difference between the inverse and the dot product of xjk and the kth component of the cluster I.
The weighted aggregated sum is not normalized with the magnitude because both the weights and the distances are normalized to unit length.
The weights in the weighted aggregated sum is dynamically obtained as the relevance weight of the keyword k in cluster I.
Thus we have both the similarity measure and the dissimilarity measure for use with the document.
SKWIC is designed to search for the optimal cluster centers C and the optimal set of features weights V simultaneously. Each cluster is allowed to have its own Vi = [vi1,...,vin] and there is an objective function to get to the optimal.

Tuesday, October 22, 2013


While we discussed graph based algorithms, we mentioned that there have been improvements since. We will discuss some of these today. We will use the survey by Frigui and Nasraoui who come up with their own technique as well. One improvement has been to use clustering. Clustering is efficient for finding the nearest neighbors be it terms or documents. We use a vector space model where each term is represented by a vector in the document space. The document is considered a bag of words. The vectors can be assigned based on term-frequencies or other statistical or probabilistic metrics. If term-frequencies are used, frequent words with little discriminating power can be discounted with its Inverse Document Frequency in the document collection. Since documents vary a lot, both TF and IDF or their combination is not sufficient. So we go by a large corpus where the text has been manually classified. and this provides a rich base for classifying terms or documents.  However, both the categories and their respective keyword sets need to be  discovered simultaneously. Frigui and Nasraoui mention that selecting and weighting subsets of keyword in text documents is similar to the problem of feature selection and weighting in pattern recognition and data mining. Selecting the best set of features is important for real world tasks because there could be noise from irrelevant features that can severely degrade performance. Even if the data samples have already been classified into known classes, the authors suggest that it is preferable to model each complex class by several simple sub-classes or clusters and to use different set of feature weights for each cluster. This can help in classifying documents into one of the preexisting categories. Since clustering and feature selection have been dealt with independently, here's where the authors suggest a scheme that does both simultaneously.  They call this scheme SCAD which stands for simultaneous clustering and attribute discrimination.
They also improve on weighting by using a continuous feature weighting method. So feature relevance representation is better than binary or other discrete basis. Also, SCAD learns a different feature relevance representation for each cluster in an unsupervised manner.Hence the keyword weighting is both dynamic and category-dependent while being simultaneous with document clustering. One thing to note is that we described metrics for distance vectors and these can be Euclidean however that is not appropriate for text. For text, we rely on cosine similarity or Jaccard index to assess the similarity between documents. Moreover, a text document can belong to several categories  with different degrees. This is done by assigning fuzzy or soft labels instead of assigning them to single label category. There are also noise magnet clusters that take away the outliers from the clusters. So SCAD is refined to work on text documents and this is called Simultaneous Keyword Identification and Clustering of text documents (SKWIC).

We briefly discuss other synonym extraction methods. One is the Distance method and another is the ArcRank method. In the distance method, one possible way of defining  a synonym distance is to see if the words appear in the definitions of many common words or have many common words in their definition. A way of formalizing this is to count the number of words that appear in one of the definitions but not in both in which case the distance can be expressed as |Ai - Aj| + |Aj - Ai| where A is the adjacency matrix.
Unlike other methods presented in this chapter, we can apply this algorithm directly to the entire dictionary graph rather than the neigborhood graph. However, it has been found that better results are obtained when the neighborhood graph is used.
ArcRank is a method for building a thesaurus. The intent with the approach was not to find synonyms but related words. Both the ArcRank method and the method discussed in the previous post had used the 1913 Webster dictionary for source. Here a pagerank is associated with each vertex of the dictionary graph in the following way: All vertices start with identical initial ranking and then iteratively distribute it to the vertices they point to. And they receive the sum of ranks from the vertices they point to. Here too the normalized ranking converges to the principal eigen- vector of the adjacency matrix of the graph so there is a natural order in the ranking. The algorithm is slightly modified so that the edge cases don't have extreme values i.e nodes with no incoming edges or outgoing edges. If a were the number of outgoing edges and pi the page rank of vertex i, the edge relevance between s and t is defined as rs,t = (ps/as)/pt
These edge relevances are then coverted to rankings which are computed only once. Related words for a word w can then be found by selecting the edges with the best ranking and extract the incident vertices.

Monday, October 21, 2013

In today's post we describe the automatic discovery of similar words in a dictionary. In particular we discuss the method proposed by the same authors as in the previous post. This is a refinement over the generalized method provided by Kleinberg for searching the web. These methods could have  been improved by others but their reference here is a good read nevertheless.
First we construct a dictionary graph G where each word is a vertex in G and there is an edge from u to v if v appears in the definition of u. Then associated with a given query word we construct a neighborhood graph which is a subgraph of G whose vertices are those that are pointing to or pointed by the vertex in G.
The problem is simplified to searching for authoritative pages in a subgraph that is focused on the original page. These pages are similar to the hubs in the structure graph.  An authority page is one that categorizes the other pages such as automobile makers for Ford, Toyota and other car makers whereas web pages that list these as home pages are hubs.
The method to find hubs and authorities proceeds by initially setting a weight of 1 to both hub and authority and simultaneously updating them according to a mutually reinforcing rule.  The hub score of the vertex i, xj is set to equal the sum of the authority scores pointed to by i and similarly the authority scores of the vertex j is equal to the sum of the hub squares of all vertices pointing to by j. The hub score and the authority score can be seen as a measure of similarity between vertices in each other's graph.
However hub and authority pages are both not good at discovering synonyms.
This can be improved  by maintaining three scores or as many scores as there are vertices in the structure graph. We then iteratively update the scores simultaneously based on a mutually reinforcing rule : the first scores are set to the sum of the scores of all vertices j pointed to by i, the second scores are set equal to the sum of third scores of vertices pointed to by i and the third scores as the sum of the second squares pointed to by i.  Let M be the adjacency matrix associated with G. The process converges with the list of synonyms appearing as those in decreasing order of Eigen value of M Mtransposed + MtransposedM.
I came across a paper on Automatic Discovery of Similar Words by Senellart and Blondel. It talks about algorithms to extract similar words from a large corpus of documents. In this paper it refers to a partial syntactic  analysis from another paper which is interesting. This method is called the SEXTANT (Semantic Extraction from Text via Analyzed Network of Terms) and uses the steps as mentioned below:
Lexical Analysis :
Words in the corpus are separated using a simple lexical analysis. Proper names are also recognized. Then each word is looked up in a lexicon and is assigned a part of speech. If a word has several possible parts of speech, a disambiguator is used to choose the most probable one.
Noun and verb phrase bracketing:
Noun and verb phrases are then detected in the sentences of the corpus. This is done with starting, ending and continuation rules such as a determiner can start a noun phrase, a noun can follow a determiner in a noun phrase, an adjective cannot start, end or follow any kind of word in a verb phrase and others.
Using multiple passes for parsing the text, several syntactic relations or contexts are then extracted from the bracketed
These are denoted as follows
ADJ : an adjective modifies a noun eg. civil unrest
NN :  a noun modifies  a noun eg. animal rights
NNPREP : a noun that is the object of a preposition modifies a preceding noun eg. measurements along the chest
SUBJ: a noun is the subject of a verb. eg. the table shook
DOBJ:  a noun is the direct object of a verb. (e.g. shook the table )
IOBJ: a noun in a prepositional phrase modifying a verb. (eg. the book was placed on the table)

This kind of parsing was computationally intensive and multiple passes over large corpus added to it. Hence only the few above were used.
Only the similarity between nouns is focused on and other parts of speech are not discussed. After the parsing step, a noun has a number of attributes, all the words that modify it along with a kind of syntactical relation. This leads to high dimensions such as a noun that appears 83 times has 67 unique attributes.
Each attribute is assigned a weight using the probability of the attribute among all those for the noun.
These attributes are then used for with a similarity measure such as a weighted Jaccard similarity.
An interesting observation from such similarity measures is that the corpus has a great impact on the meaning of the word according to which similar words are selected. 

Sunday, October 20, 2013

Search algorithms are of different types. We discuss four algorithms
1) Unordered Linear Search algorithm : This is very similar to a student searching for an exam score in a collection of unordered exams
2) Ordered Linear Search : Sorting helps exclude portions of the list during search.
3) Chunk Search : This is the search typically used in a phone book. A few pages may be looked at a time. These are called chunks. Each chunk could then be searched using an ordered linear search.
4) Binary Search : We used a divide and conquer approach where we split the data set into two halves and proceed along only one half of the set at each level.
I mentioned the Search algorithms as an introduction to a book I want to pick up for readings on this topic. But meanwhile today I'm experimenting with an https based web service deployment to cloud that will try out OAuth redirections.
This is very similar to a usual deployment of any .Net web application to the cloud but differs in the following steps:
1) obtain an SSL certificate. This is done by generating a certificate signing request with IIS Manager to send to the Certificate Authority.
2) After submitting the certificate to the CSR, an SSL certificate is obtained.
3) the CSR is then completed with the certificate provided by the certificate authority.
4) The certificate is then exported from the IIS Manager.
5) It is then added to the certificate store for the hosted service.
6) The thumbprint of the certificate is then added to the service configuration file.
7) The service definition file is updated to identify the certificate to the hosted service.
8) The HTTPs endpoint is then configured in the service definition file.
9) a CNAME record for the domain name is then setup. This is the name the SSL certificate is issued against to redirect traffic to the service.
The HTTPS endpoint can be tested in a standalone environment where all authentication occurs using a self-signed certificate provided for localhost.

With the development of Cassini web server or IIS Express, it is easier to develop the web service for https. With the properties view if the project, one can turn on SSL enabled to true.

Intermediate tutorial on data mining
We continue our discussion on using Analysis Services to provide an integrated environment for creating and working with data mining models. We mentioned how we bind to data sources, create and test models and use with predictive analysis. In this tutorial, we build on the previous review to include new scenarios such as forecasting and market basket analysis.
In forecasting we create a time series model to forecast the sales of products in different regions around the world. You will develop individual models for each region and learn how to use cross prediction.
In market basket analysis, we will create an association model, to analyze groupings of the product that are purchased online so that we can recommend products to customers.
A forecasting structure is created using an existing relational database or data warehouse. The Microsoft Time Series is selected and and the SalesByRegion is selected in this case.
The input and predictable columns can then be selected followed by specifying column contents and data types. The columns can be dragged and dropped into the forecasting mining structure.
the mining models
The forecasting model can be customized with the FORECAST_METHOD and PREDICTION_SMOOTHING. The forecast method parameter controls whether the time series algorithm is optimized for short term or long term predictions. The prediction smoothing parameter combines the way short term and long term predictions are mixed. Typically a fifty fifty split works.
Missing data can also be handled.
The forecasting model can be explored with a mining model viewer. Predictions and deviations can both be viewed here.
Similarly, the market basket analysis is also similar.  The mining structure is created using the wizard by specifying association rules and available data sources. We select the option to allow drill through and display the mining structure we create.
In this case, the algorithm parameters are MINIMUM_PROBABILITY and MINIMUM_SUPPORT. Typical values are 0.1 and 0.01 respectively.
The association viewer contains three tabs Rules, Itemsets, and dependency network. The dependency network allows us to investigate the interaction of the different items in the model. Each node in the viewer represents an item while the line between them represents a rule. A line connecting two items means that these items are likely to appear in a transaction together.
Associations from the model can then be used to make predictions. The is done by building prediction queries against the mining models.