Thursday, October 24, 2013

We saw that the objective function is expressed in cluster center errors and feature weights. Each cluster i is allowed to have its own set of feature weights Vi = [vi1, vi2, ..., vin]  The feature weights are of at most unit length and after clustering, the sum of all feature weights vi1 +vi2 ... + vin = 1. If we denote the objective function by J, then by setting the gradient of J to zero, we get the partial derivative equations we wanted to solve. The partial derivatives are with respect to the Lagrange multiplier in the first equation and with respect to feature weight in the second equation.
The objective function has reached its final state when it does not change with regard to the changes in the multiplier or the feature weight, That is why we set the gradient to zero and then solve for the feature weight vik.
We solve this equation mathematically to write an equation in terms of vik. This is only one time. The resulting equation for vik has two parts. The first part is a constant 1/n and is the default value if all attributes/keywords are treated equally and no discrimination is performed.  The second term is a bias that can be either positive or negative. It comprises of the aggregated difference between the average and the actual cosine based distances along the individual dimensions.It is positive for compact attributes where the distance along this dimension is, on the average, less than the total distance using all of its dimensions. If an attribute is very compact, it is much more relevant to the cluster and could even be the center. Since the individual cosine based distance along a term-wise individual dimension could be negative, we allow them to be dissimilar and shows up as negative value. The resulting bias only increases emphasizing that dimension further. If the total aggregate dissimilarity becomes negative, then it no longer forms part of the cluster since we partition the data based on minimum distance. Thus we have obtained the attribute weight vik and are comfortable with the second component of the objective function.
The choice of the tuning parameter tau with the objective function is important because it reflects the importance of the second term with respect to the first term. If the tau for the ith cluster is very small then only one keyword in that cluster will be relevant. The rest will receive zero weights. If it is large, then all the terms in that cluster will be relevant and may receive equal weights. The tau could be chosen such that both the terms of the objective function have the same order of magnitude.
We compute this tuning parameter tau numerically in iterations. In iteration t we use the aggregated cluster center errors from t-1 iteration divided by the sum of the squared feature weights again from
t-1 iteration. The resulting fraction is taken with a constant for tau to adjust the magnitude.
The feature relevance values could change to a value in excess of 1
In our iterations we may find the equations to change a lot initially however there is little or no changes towards the later iterations.

Wednesday, October 23, 2013

In this post we will take an in depth look at all the formula behind each of the steps mentioned in the previous post. Our goal is to understand not only how the SWKIC works but also describe it in stages that enables an implementation. We will assume that given distance vectors with precomputed distances, we know how to cluster them.
In our case we begin with the similarity measure between document frequency vectors xi and xj.
This distance is measured as the dot product over their magnitudes. By that we mean we take an arbitrary term k between 1 and n which is the number of terms. We know that the document frequency vectors for terms to appear together such as xik and xjk can be calculated. This has to be repeated for k = 1 to n so we take the aggregated sum of the dot product xik * xjk over this range and normalize them by dividing with their magnitudes to compute the cosine similarity. Intuitively, cosine similarity could be positive and negative since the vectors may point in different directions. Here the steps involved are iteration, summation and normalization.  So this is easy to do.
We have a computational complexity of O(N^2) We will call this the first step.
Now lets look at the next formula: Here we said we need to find the dissimilarity measure across the different attribute directions.we defined the dissimilarity between document frequency vector xj and the ith cluster center vector using the cosine measure. This we define as the second equation which requires  a third formula for the distance along the individual dimensions.
The third formula is easy because the distance along individual dimensions is something smaller than 1/n and this is the dot product of xjk and cik
Note that the cik is the kth component of the ith cluster center vector and the cluster centers are terms themselves so cik is a specially marked xik. That is why the third step is easy to compute given that we did first step.
Next we look at the objective function. This has several components so we will take a look at each of them one at a time. The objective function searches for optimal cluster centers and the optimal set of feature weights simultaneously.
The first part of the objective function is the error to the cluster centers. It is minimized when only one keyword in each cluster is completely relevant and all other keywords are irrelevant. So we sum over each of the clusters.
The Second part of the objective function is the sum of the squared keyword weights. When all the keywords have equal weights, we get the global minimum of this component.  We also have a tuning parameter to use with the second component that we can set. Together, the objective function minimizes the sum of intra-cluster weighted distances, where the keyword weights are optimized for each cluster.
Instead of using the objective function directly, it is decomposed into C independent problems. This way we remove the summation for all the clusters. and we use a multiplier technique with partial derivatives. To get the equations for partial derivatives, we set the gradient to zero. We will also look at these equations in detail shortly.
In the previous post, we introduced SWKIC from the paper by Frigui and Nasraoui. In this post, we take a look at how they update keyword weights and their clustering.
SKWIC algorithm is given as :
Fix the number of clusters C
Initialize the centers by randomly selecting C documents
Initialize the partitions X, using equal feature weights (1/n) where n is the total number of items in a collection of N documents.
REPEAT
Compute cosine based distance between document xj and ith cluster center vector using a formula that involves the kth component of the ith cluster center vector
Compute the aggregated sum of the above
Update the cluster partitioning
Update the centers
Update the delta-i
UNTIL centers stabilize
We will discuss  the steps above. Let xi and xj be the document frequency vectors defined on a vocabulary of n terms. Then the cosine similarity measure between xi and xj to categorize the corresponding documents is taken as the fraction of the dot product of xik and xjk over their magnitudes.
They also introduce  a dissimilarity measure between documents xj and the ith cluster center vector as the weighted aggregated sum cosine based distances along the individual dimensions.
This distance is bounded by the inverse of the total number of terms and measures to be the difference between the inverse and the dot product of xjk and the kth component of the cluster I.
The weighted aggregated sum is not normalized with the magnitude because both the weights and the distances are normalized to unit length.
The weights in the weighted aggregated sum is dynamically obtained as the relevance weight of the keyword k in cluster I.
Thus we have both the similarity measure and the dissimilarity measure for use with the document.
SKWIC is designed to search for the optimal cluster centers C and the optimal set of features weights V simultaneously. Each cluster is allowed to have its own Vi = [vi1,...,vin] and there is an objective function to get to the optimal.

Tuesday, October 22, 2013


While we discussed graph based algorithms, we mentioned that there have been improvements since. We will discuss some of these today. We will use the survey by Frigui and Nasraoui who come up with their own technique as well. One improvement has been to use clustering. Clustering is efficient for finding the nearest neighbors be it terms or documents. We use a vector space model where each term is represented by a vector in the document space. The document is considered a bag of words. The vectors can be assigned based on term-frequencies or other statistical or probabilistic metrics. If term-frequencies are used, frequent words with little discriminating power can be discounted with its Inverse Document Frequency in the document collection. Since documents vary a lot, both TF and IDF or their combination is not sufficient. So we go by a large corpus where the text has been manually classified. and this provides a rich base for classifying terms or documents.  However, both the categories and their respective keyword sets need to be  discovered simultaneously. Frigui and Nasraoui mention that selecting and weighting subsets of keyword in text documents is similar to the problem of feature selection and weighting in pattern recognition and data mining. Selecting the best set of features is important for real world tasks because there could be noise from irrelevant features that can severely degrade performance. Even if the data samples have already been classified into known classes, the authors suggest that it is preferable to model each complex class by several simple sub-classes or clusters and to use different set of feature weights for each cluster. This can help in classifying documents into one of the preexisting categories. Since clustering and feature selection have been dealt with independently, here's where the authors suggest a scheme that does both simultaneously.  They call this scheme SCAD which stands for simultaneous clustering and attribute discrimination.
They also improve on weighting by using a continuous feature weighting method. So feature relevance representation is better than binary or other discrete basis. Also, SCAD learns a different feature relevance representation for each cluster in an unsupervised manner.Hence the keyword weighting is both dynamic and category-dependent while being simultaneous with document clustering. One thing to note is that we described metrics for distance vectors and these can be Euclidean however that is not appropriate for text. For text, we rely on cosine similarity or Jaccard index to assess the similarity between documents. Moreover, a text document can belong to several categories  with different degrees. This is done by assigning fuzzy or soft labels instead of assigning them to single label category. There are also noise magnet clusters that take away the outliers from the clusters. So SCAD is refined to work on text documents and this is called Simultaneous Keyword Identification and Clustering of text documents (SKWIC).

We briefly discuss other synonym extraction methods. One is the Distance method and another is the ArcRank method. In the distance method, one possible way of defining  a synonym distance is to see if the words appear in the definitions of many common words or have many common words in their definition. A way of formalizing this is to count the number of words that appear in one of the definitions but not in both in which case the distance can be expressed as |Ai - Aj| + |Aj - Ai| where A is the adjacency matrix.
Unlike other methods presented in this chapter, we can apply this algorithm directly to the entire dictionary graph rather than the neigborhood graph. However, it has been found that better results are obtained when the neighborhood graph is used.
ArcRank is a method for building a thesaurus. The intent with the approach was not to find synonyms but related words. Both the ArcRank method and the method discussed in the previous post had used the 1913 Webster dictionary for source. Here a pagerank is associated with each vertex of the dictionary graph in the following way: All vertices start with identical initial ranking and then iteratively distribute it to the vertices they point to. And they receive the sum of ranks from the vertices they point to. Here too the normalized ranking converges to the principal eigen- vector of the adjacency matrix of the graph so there is a natural order in the ranking. The algorithm is slightly modified so that the edge cases don't have extreme values i.e nodes with no incoming edges or outgoing edges. If a were the number of outgoing edges and pi the page rank of vertex i, the edge relevance between s and t is defined as rs,t = (ps/as)/pt
These edge relevances are then coverted to rankings which are computed only once. Related words for a word w can then be found by selecting the edges with the best ranking and extract the incident vertices.

Monday, October 21, 2013

In today's post we describe the automatic discovery of similar words in a dictionary. In particular we discuss the method proposed by the same authors as in the previous post. This is a refinement over the generalized method provided by Kleinberg for searching the web. These methods could have  been improved by others but their reference here is a good read nevertheless.
First we construct a dictionary graph G where each word is a vertex in G and there is an edge from u to v if v appears in the definition of u. Then associated with a given query word we construct a neighborhood graph which is a subgraph of G whose vertices are those that are pointing to or pointed by the vertex in G.
The problem is simplified to searching for authoritative pages in a subgraph that is focused on the original page. These pages are similar to the hubs in the structure graph.  An authority page is one that categorizes the other pages such as automobile makers for Ford, Toyota and other car makers whereas web pages that list these as home pages are hubs.
The method to find hubs and authorities proceeds by initially setting a weight of 1 to both hub and authority and simultaneously updating them according to a mutually reinforcing rule.  The hub score of the vertex i, xj is set to equal the sum of the authority scores pointed to by i and similarly the authority scores of the vertex j is equal to the sum of the hub squares of all vertices pointing to by j. The hub score and the authority score can be seen as a measure of similarity between vertices in each other's graph.
However hub and authority pages are both not good at discovering synonyms.
This can be improved  by maintaining three scores or as many scores as there are vertices in the structure graph. We then iteratively update the scores simultaneously based on a mutually reinforcing rule : the first scores are set to the sum of the scores of all vertices j pointed to by i, the second scores are set equal to the sum of third scores of vertices pointed to by i and the third scores as the sum of the second squares pointed to by i.  Let M be the adjacency matrix associated with G. The process converges with the list of synonyms appearing as those in decreasing order of Eigen value of M Mtransposed + MtransposedM.
I came across a paper on Automatic Discovery of Similar Words by Senellart and Blondel. It talks about algorithms to extract similar words from a large corpus of documents. In this paper it refers to a partial syntactic  analysis from another paper which is interesting. This method is called the SEXTANT (Semantic Extraction from Text via Analyzed Network of Terms) and uses the steps as mentioned below:
Lexical Analysis :
Words in the corpus are separated using a simple lexical analysis. Proper names are also recognized. Then each word is looked up in a lexicon and is assigned a part of speech. If a word has several possible parts of speech, a disambiguator is used to choose the most probable one.
Noun and verb phrase bracketing:
Noun and verb phrases are then detected in the sentences of the corpus. This is done with starting, ending and continuation rules such as a determiner can start a noun phrase, a noun can follow a determiner in a noun phrase, an adjective cannot start, end or follow any kind of word in a verb phrase and others.
Using multiple passes for parsing the text, several syntactic relations or contexts are then extracted from the bracketed
These are denoted as follows
ADJ : an adjective modifies a noun eg. civil unrest
NN :  a noun modifies  a noun eg. animal rights
NNPREP : a noun that is the object of a preposition modifies a preceding noun eg. measurements along the chest
SUBJ: a noun is the subject of a verb. eg. the table shook
DOBJ:  a noun is the direct object of a verb. (e.g. shook the table )
IOBJ: a noun in a prepositional phrase modifying a verb. (eg. the book was placed on the table)

This kind of parsing was computationally intensive and multiple passes over large corpus added to it. Hence only the few above were used.
Only the similarity between nouns is focused on and other parts of speech are not discussed. After the parsing step, a noun has a number of attributes, all the words that modify it along with a kind of syntactical relation. This leads to high dimensions such as a noun that appears 83 times has 67 unique attributes.
Each attribute is assigned a weight using the probability of the attribute among all those for the noun.
These attributes are then used for with a similarity measure such as a weighted Jaccard similarity.
An interesting observation from such similarity measures is that the corpus has a great impact on the meaning of the word according to which similar words are selected.