Sunday, May 19, 2013

Review : Paper on comparision of document clustering techniques by Steinbach, Karypis and Kumar from Univerity of Minnesota.
This paper compares k-means clustering methods to hierarchical clustering methods. The paper suggests that bisecting k-means technique is better than the standard k-means technique which is in turn better than hierarchical techniques.
Hierarchical clustering has quadratic time complexity where as K-means have a time complexity that's linear. There are mixed approaches too.
There are two metrics used for cluster quality analysis. Entropy is one which provides a measure of goodness for single level clusters. F-measure is the other which measures the effectiveness of hierarchical clustering.
The bisecting k-means clustering is explained as follows:
Step 1 pick a cluster to split
Step 2 find two clusters using the basic k-means algorithm
Step 3 Repeat step 2 for a fixed number of times and take the split that produces the clustering with the highest overall similarity
Step 4 Repeat step 1, 2 and 3 until the desired number of clusters are reached.
Splitting the largest cluster also works.

Agglomerative hierarchical clustering have the following variations:
1) Intra cluster similarity : This hierarchical clustering looks at the similarity of all documents in the cluster to the centroid where the similarity distance is given as the sum of cosines. The pair of clusters that when merged leads to smallest decrease in similarity.

2) Centroid similarity technique: This works in a similar way but it takes the similarity distance as the cosine between the centroids of the two clusters.

3) The UPGMA scheme is based on the cluster similarity measure which takes the sum of the cosine distances between two documents of different clusters divided by the product of the sizes of their clusters,.

An explanation is given for why the agglomerative hierarchical clustering performs poorly when compared with bisecting k-means that mentions that the former puts documents of the same class in the same cluster and this is done early on and generally not reversed.
 

No comments:

Post a Comment