Wednesday, October 23, 2013

In the previous post, we introduced SWKIC from the paper by Frigui and Nasraoui. In this post, we take a look at how they update keyword weights and their clustering.
SKWIC algorithm is given as :
Fix the number of clusters C
Initialize the centers by randomly selecting C documents
Initialize the partitions X, using equal feature weights (1/n) where n is the total number of items in a collection of N documents.
REPEAT
Compute cosine based distance between document xj and ith cluster center vector using a formula that involves the kth component of the ith cluster center vector
Compute the aggregated sum of the above
Update the cluster partitioning
Update the centers
Update the delta-i
UNTIL centers stabilize
We will discuss  the steps above. Let xi and xj be the document frequency vectors defined on a vocabulary of n terms. Then the cosine similarity measure between xi and xj to categorize the corresponding documents is taken as the fraction of the dot product of xik and xjk over their magnitudes.
They also introduce  a dissimilarity measure between documents xj and the ith cluster center vector as the weighted aggregated sum cosine based distances along the individual dimensions.
This distance is bounded by the inverse of the total number of terms and measures to be the difference between the inverse and the dot product of xjk and the kth component of the cluster I.
The weighted aggregated sum is not normalized with the magnitude because both the weights and the distances are normalized to unit length.
The weights in the weighted aggregated sum is dynamically obtained as the relevance weight of the keyword k in cluster I.
Thus we have both the similarity measure and the dissimilarity measure for use with the document.
SKWIC is designed to search for the optimal cluster centers C and the optimal set of features weights V simultaneously. Each cluster is allowed to have its own Vi = [vi1,...,vin] and there is an objective function to get to the optimal.

No comments:

Post a Comment