Friday, October 25, 2013

When we discussed how the SKWIC works in the previous post, we described a cluster partitioning method that assigns a document to a single cluster. However, this is rarely the case when a document is for a single subject only. Most documents will tend to straddle the subjects of two or more different documents. And even manual classification is difficult and poor. When a document is classified to a single group, it affects the retrieval ability once a classification model is built. K-means and SKWIC are both hard partitioning models and by assigning the documents to the same category, they have limited use in real large document collections.
Instead if we assign soft labels to documents that is the documents are not confined to  a single category, then this can be improved. Especially when we use fuzzy or soft partition models, the data is modeled better than their hard partitioning counterparts . This is because fuzzy memberships are richer than crisp memberships in that they describe the degrees of associations of data points lying in overlapping clusters.
If we denote this fuzzy membership as uii which a datapoint may have as varying degrees to each cluster Xi for a datapoint x, then the fuzzy partition is described as a C x N matrix U = |uii|. This matrix is subject to the following constraints:
1. The degrees of association uij can range only between 0  and 1. This means that we assign them continuous values that are bounded by 0 and 1
2. The sum of the degrees of association uij for  all the N terms must range between 0 and N
3. The sum of the degrees of association uij for all the clusters must equal 1.
We define a new objective function where we take this fuzzy memberships together with the sum of the errors to the cluster center as the first component . This component is minimized when only one keyword in each cluster is completely relevant and all other keywords are irrelevant. The other keyword is the sum of the squared keyword weights. The global minimum of this keyword is achieved when all the keywords are equally weighted. The goal of the objective function is to reduce the sum of the intracluster weighted distances.
We find similar partial differential equations to solve this objective function in the same manner as the previous one.

Thursday, October 24, 2013

In the previous post, we talked about finding tau and adjusting it to have the same magnitude between the two components of the objective function. What we will see next is that the cluster partition that minimizes J is the one that assigns each data sample to the cluster with the nearest center. In other words, the fraction of xj over the cosine based distance Dwcij will always be less than the weighted aggregated cosine based distance for all i that is not at the center. Ties are resolved arbitrarily for candidates that are vie for center.
Since the optimization function cannot be minimized with respect to the centers, we compute new cluster centroids and normalize them to unit-length.
Notice that this results in two cases for the feature weight vik which we may remember from the previous post was the default value (1/n) and the bias. The latter can be both positive and negative.
So the first case is when the vik = 0
This means that the kth feature is completely irrelevant relative to the ith cluster. Hence, regardless of the value of cik, the value of this feature will not contribute to the overall weighted distance computation. Therefore, in this situation, any arbitrary value can be chosen for cik and is generally set to 0.
The second case is when vik != 0
In this case the center is recomputed as cik is equal to the normalized aggregated sum of the document frequency vectors across the different attribute directions. i.e we are saying the kth feature has some relevance to the ith cluster we will simply pick nearly the mid-point.
Now, let us look at summarizing the above steps for clustering a collection of N normalized document vectors defined over a vocabulary of n keywords:
Fix the number of clusters C
Initialize the centers by randomly selecting C documents
Initialize the partitions X, using equal feature weights (1/n) where n is the total number of items in a collection of N documents.
REPEAT
Compute cosine based distance between document xj and ith cluster center vector using a formula that involves the kth component of the ith cluster center vector as 1/n-(xjk,cik). Repeat this for every cluster and for every term from 1 to N and along every dimension from 1 to k
Update the relevance weights vik by using the default value and computing the bias along this dimension after computing that across all dimensions
Compute the weighted aggregated sum of cosine based distances along the individual dimensions ( dotproduct of feature weights and cosine based distance.
Update the cluster partitioning with the method discussed above
Update the tau
UNTIL centers stabilize
The difference between some similar approaches in pattern recognition and here is that the former has an assumption that data has a multivariate Gaussian distribution where as this treats them as independent.
We saw that the objective function is expressed in cluster center errors and feature weights. Each cluster i is allowed to have its own set of feature weights Vi = [vi1, vi2, ..., vin]  The feature weights are of at most unit length and after clustering, the sum of all feature weights vi1 +vi2 ... + vin = 1. If we denote the objective function by J, then by setting the gradient of J to zero, we get the partial derivative equations we wanted to solve. The partial derivatives are with respect to the Lagrange multiplier in the first equation and with respect to feature weight in the second equation.
The objective function has reached its final state when it does not change with regard to the changes in the multiplier or the feature weight, That is why we set the gradient to zero and then solve for the feature weight vik.
We solve this equation mathematically to write an equation in terms of vik. This is only one time. The resulting equation for vik has two parts. The first part is a constant 1/n and is the default value if all attributes/keywords are treated equally and no discrimination is performed.  The second term is a bias that can be either positive or negative. It comprises of the aggregated difference between the average and the actual cosine based distances along the individual dimensions.It is positive for compact attributes where the distance along this dimension is, on the average, less than the total distance using all of its dimensions. If an attribute is very compact, it is much more relevant to the cluster and could even be the center. Since the individual cosine based distance along a term-wise individual dimension could be negative, we allow them to be dissimilar and shows up as negative value. The resulting bias only increases emphasizing that dimension further. If the total aggregate dissimilarity becomes negative, then it no longer forms part of the cluster since we partition the data based on minimum distance. Thus we have obtained the attribute weight vik and are comfortable with the second component of the objective function.
The choice of the tuning parameter tau with the objective function is important because it reflects the importance of the second term with respect to the first term. If the tau for the ith cluster is very small then only one keyword in that cluster will be relevant. The rest will receive zero weights. If it is large, then all the terms in that cluster will be relevant and may receive equal weights. The tau could be chosen such that both the terms of the objective function have the same order of magnitude.
We compute this tuning parameter tau numerically in iterations. In iteration t we use the aggregated cluster center errors from t-1 iteration divided by the sum of the squared feature weights again from
t-1 iteration. The resulting fraction is taken with a constant for tau to adjust the magnitude.
The feature relevance values could change to a value in excess of 1
In our iterations we may find the equations to change a lot initially however there is little or no changes towards the later iterations.

Wednesday, October 23, 2013

In this post we will take an in depth look at all the formula behind each of the steps mentioned in the previous post. Our goal is to understand not only how the SWKIC works but also describe it in stages that enables an implementation. We will assume that given distance vectors with precomputed distances, we know how to cluster them.
In our case we begin with the similarity measure between document frequency vectors xi and xj.
This distance is measured as the dot product over their magnitudes. By that we mean we take an arbitrary term k between 1 and n which is the number of terms. We know that the document frequency vectors for terms to appear together such as xik and xjk can be calculated. This has to be repeated for k = 1 to n so we take the aggregated sum of the dot product xik * xjk over this range and normalize them by dividing with their magnitudes to compute the cosine similarity. Intuitively, cosine similarity could be positive and negative since the vectors may point in different directions. Here the steps involved are iteration, summation and normalization.  So this is easy to do.
We have a computational complexity of O(N^2) We will call this the first step.
Now lets look at the next formula: Here we said we need to find the dissimilarity measure across the different attribute directions.we defined the dissimilarity between document frequency vector xj and the ith cluster center vector using the cosine measure. This we define as the second equation which requires  a third formula for the distance along the individual dimensions.
The third formula is easy because the distance along individual dimensions is something smaller than 1/n and this is the dot product of xjk and cik
Note that the cik is the kth component of the ith cluster center vector and the cluster centers are terms themselves so cik is a specially marked xik. That is why the third step is easy to compute given that we did first step.
Next we look at the objective function. This has several components so we will take a look at each of them one at a time. The objective function searches for optimal cluster centers and the optimal set of feature weights simultaneously.
The first part of the objective function is the error to the cluster centers. It is minimized when only one keyword in each cluster is completely relevant and all other keywords are irrelevant. So we sum over each of the clusters.
The Second part of the objective function is the sum of the squared keyword weights. When all the keywords have equal weights, we get the global minimum of this component.  We also have a tuning parameter to use with the second component that we can set. Together, the objective function minimizes the sum of intra-cluster weighted distances, where the keyword weights are optimized for each cluster.
Instead of using the objective function directly, it is decomposed into C independent problems. This way we remove the summation for all the clusters. and we use a multiplier technique with partial derivatives. To get the equations for partial derivatives, we set the gradient to zero. We will also look at these equations in detail shortly.
In the previous post, we introduced SWKIC from the paper by Frigui and Nasraoui. In this post, we take a look at how they update keyword weights and their clustering.
SKWIC algorithm is given as :
Fix the number of clusters C
Initialize the centers by randomly selecting C documents
Initialize the partitions X, using equal feature weights (1/n) where n is the total number of items in a collection of N documents.
REPEAT
Compute cosine based distance between document xj and ith cluster center vector using a formula that involves the kth component of the ith cluster center vector
Compute the aggregated sum of the above
Update the cluster partitioning
Update the centers
Update the delta-i
UNTIL centers stabilize
We will discuss  the steps above. Let xi and xj be the document frequency vectors defined on a vocabulary of n terms. Then the cosine similarity measure between xi and xj to categorize the corresponding documents is taken as the fraction of the dot product of xik and xjk over their magnitudes.
They also introduce  a dissimilarity measure between documents xj and the ith cluster center vector as the weighted aggregated sum cosine based distances along the individual dimensions.
This distance is bounded by the inverse of the total number of terms and measures to be the difference between the inverse and the dot product of xjk and the kth component of the cluster I.
The weighted aggregated sum is not normalized with the magnitude because both the weights and the distances are normalized to unit length.
The weights in the weighted aggregated sum is dynamically obtained as the relevance weight of the keyword k in cluster I.
Thus we have both the similarity measure and the dissimilarity measure for use with the document.
SKWIC is designed to search for the optimal cluster centers C and the optimal set of features weights V simultaneously. Each cluster is allowed to have its own Vi = [vi1,...,vin] and there is an objective function to get to the optimal.

Tuesday, October 22, 2013


While we discussed graph based algorithms, we mentioned that there have been improvements since. We will discuss some of these today. We will use the survey by Frigui and Nasraoui who come up with their own technique as well. One improvement has been to use clustering. Clustering is efficient for finding the nearest neighbors be it terms or documents. We use a vector space model where each term is represented by a vector in the document space. The document is considered a bag of words. The vectors can be assigned based on term-frequencies or other statistical or probabilistic metrics. If term-frequencies are used, frequent words with little discriminating power can be discounted with its Inverse Document Frequency in the document collection. Since documents vary a lot, both TF and IDF or their combination is not sufficient. So we go by a large corpus where the text has been manually classified. and this provides a rich base for classifying terms or documents.  However, both the categories and their respective keyword sets need to be  discovered simultaneously. Frigui and Nasraoui mention that selecting and weighting subsets of keyword in text documents is similar to the problem of feature selection and weighting in pattern recognition and data mining. Selecting the best set of features is important for real world tasks because there could be noise from irrelevant features that can severely degrade performance. Even if the data samples have already been classified into known classes, the authors suggest that it is preferable to model each complex class by several simple sub-classes or clusters and to use different set of feature weights for each cluster. This can help in classifying documents into one of the preexisting categories. Since clustering and feature selection have been dealt with independently, here's where the authors suggest a scheme that does both simultaneously.  They call this scheme SCAD which stands for simultaneous clustering and attribute discrimination.
They also improve on weighting by using a continuous feature weighting method. So feature relevance representation is better than binary or other discrete basis. Also, SCAD learns a different feature relevance representation for each cluster in an unsupervised manner.Hence the keyword weighting is both dynamic and category-dependent while being simultaneous with document clustering. One thing to note is that we described metrics for distance vectors and these can be Euclidean however that is not appropriate for text. For text, we rely on cosine similarity or Jaccard index to assess the similarity between documents. Moreover, a text document can belong to several categories  with different degrees. This is done by assigning fuzzy or soft labels instead of assigning them to single label category. There are also noise magnet clusters that take away the outliers from the clusters. So SCAD is refined to work on text documents and this is called Simultaneous Keyword Identification and Clustering of text documents (SKWIC).

We briefly discuss other synonym extraction methods. One is the Distance method and another is the ArcRank method. In the distance method, one possible way of defining  a synonym distance is to see if the words appear in the definitions of many common words or have many common words in their definition. A way of formalizing this is to count the number of words that appear in one of the definitions but not in both in which case the distance can be expressed as |Ai - Aj| + |Aj - Ai| where A is the adjacency matrix.
Unlike other methods presented in this chapter, we can apply this algorithm directly to the entire dictionary graph rather than the neigborhood graph. However, it has been found that better results are obtained when the neighborhood graph is used.
ArcRank is a method for building a thesaurus. The intent with the approach was not to find synonyms but related words. Both the ArcRank method and the method discussed in the previous post had used the 1913 Webster dictionary for source. Here a pagerank is associated with each vertex of the dictionary graph in the following way: All vertices start with identical initial ranking and then iteratively distribute it to the vertices they point to. And they receive the sum of ranks from the vertices they point to. Here too the normalized ranking converges to the principal eigen- vector of the adjacency matrix of the graph so there is a natural order in the ranking. The algorithm is slightly modified so that the edge cases don't have extreme values i.e nodes with no incoming edges or outgoing edges. If a were the number of outgoing edges and pi the page rank of vertex i, the edge relevance between s and t is defined as rs,t = (ps/as)/pt
These edge relevances are then coverted to rankings which are computed only once. Related words for a word w can then be found by selecting the edges with the best ranking and extract the incident vertices.