Thursday, November 21, 2013

In the unsupervised category, MCMM was compared to K-means clustering on the same DS2 dataset. Three of the four clusters align very well with the labels given by Reuters, achieving very high precision score, especially considering label information was unavailable during training. The algorithm stops after it has added an additional cluster center that does not increase the gain function. That is why the first three clusters are clearly delineated than the last one. The fourth cluster found by the MCMM is a general subtopic within the dataset almost like a miscellaneous group.
When compared with the earlier supervised run, where we found clusters that did not match the reuters labels, a closer inspection of the same showed that the clusters were well formed and that some documents were originally mislabeled.
Thus MCMM appears to be a practical machine learning method for text categorization. The ability to run essentially the same algorithm in both supervised and unsupervised mode helps evaluate existing labels. And there is flexibility to include the  documents in more than one category along with a ranked list of the degree of the inclusion in the cluster.

No comments:

Post a Comment