Sunday, November 24, 2013

I want to talk about using the K-means clustering algorithm to categorize documents based on Kullback-leibler distance. We have discussed finding the Kullback leibler distance between any document and another based on the probabilities of the terms from the vocabulary. Next we normalize the distance and use this KLD distance for clustering.
Today we talk about how to cluster these documents. Let us take an example with a K-means clustering. We take the sample set of documents from three categories say news, reviews and editorials. We initially set the centroids to be representatives of their clusters. This we can pick from any of the documents that represent their categories and they should ideally be as far apart from each other in terms of their cluster memberships as possible. Then for each of the documents, we find the KLD distance between the document and the centroids and assign it to the nearest cluster. After a round of assigning, we fix the centroids. We do this by finding the average of all normalized distances within the cluster and picking the new centroid as the one closest to the average. Once the centroids are updated, we assign the labels again and re-adjust the cluster centers. We can repeat this process until cluster centers stabilize.
Clustering is independent of how the distance is calculated so we can be as articulate about the distance as we want. In this case the strategy for distance was chosen as KLD but this could be cosine, Jensen or other distances as well. This distance computations can also be materialized if we know the vocabulary or features beforehand for say things like keyword extraction. That is an optimization for known samples. In general, we have distance algorithms that we use for each clustering.
Similarly, we can even vary the clustering algorithms from the choices we listed earlier. These are several mentioned in the earlier posts. We can visualize the different algorithms for clustering as K-means and agglomerative for the same distances. In the implementation letting the clustering algorithm vary independently from the distance computation lets us be more flexible and compare the algorithms.
When we take the sample data for three categories, we will need to extract the vocabulary to use as features for the documents. This is possible with the KLD as well.  The documents can be selected to be representatives of their categories.
Initially to test the clusterer, we could use select the sample data to be a small set and pre-label these. Then we can look at the precision and recall. However, the sample sets we choose for unsupervised runs must at least be five to ten times the number of features.

No comments:

Post a Comment