After the discussion on the results of simulations conducted by the authors of Fuzzy SKWIC, here are their observations on their simulation using the newsgroups. Since the results showed that the documents that were inconsistently classified were from four miscellaneous classes. the run was repeated without these documents. Specifically, these documents included those that lie in areas of overlap or fuzziness between distinct categories, documents that are outliers, and those that affect the purity of the resulting partition. In a way they studied where the clustering did not perform so well. After discarding these documents, the results showed that the clusters were more uniform , pure and the keywords were richer in terms of their relevance weights and partitions.
This implies that without the noise, the clusters were homogenous, compact and pure and the cluster dependent keywords were more relevant and representative.
In order to measure the performance of the clustering in any scenario, they came up with a metric called the average entropy measure of all C clusters. This average is found by taking the weighted entropy of a cluster as the ratio of the documents in that cluster to the overall number of documents. The cluster entropy for the ith cluster is then represented as the document class wise sum of this ratio times the log of this ratio and tuned by a constant. This constant is the inverse of the log of K documents
By using this metric, they arrive at the conclusion that the real unlabeled text data is the most challenging. Moreover, with any manually labeled benchmark document set, there are plenty of errors adding to the noise. Documents that straddle more than one topic also end up with an inadequate label. So there's no baseline to compare the unsupervised clustering but the results have shown that automatic labeling is superior to manual labeling.
The authors also suggest that the improvements they see to this kind of clustering is in using some kind of probabilistic latent semantic indexing which allows us to differentiate contexts further. Fuzzy SKWIC does simultaneous partitioning in two different hyperspaces - the document space to capture spatial document organization and the keyword space to capture context. Context can be inferred because they are not one from one keyword but from co-occurring relevant keywords. By providing fuzziness or memberships to different clusters to varying degrees, the cluster dependent keywords are richer and better suited to classify the documents in that clusters. The documents in a cluster are also more uniform, compact and homogeneous.
This implies that without the noise, the clusters were homogenous, compact and pure and the cluster dependent keywords were more relevant and representative.
In order to measure the performance of the clustering in any scenario, they came up with a metric called the average entropy measure of all C clusters. This average is found by taking the weighted entropy of a cluster as the ratio of the documents in that cluster to the overall number of documents. The cluster entropy for the ith cluster is then represented as the document class wise sum of this ratio times the log of this ratio and tuned by a constant. This constant is the inverse of the log of K documents
By using this metric, they arrive at the conclusion that the real unlabeled text data is the most challenging. Moreover, with any manually labeled benchmark document set, there are plenty of errors adding to the noise. Documents that straddle more than one topic also end up with an inadequate label. So there's no baseline to compare the unsupervised clustering but the results have shown that automatic labeling is superior to manual labeling.
The authors also suggest that the improvements they see to this kind of clustering is in using some kind of probabilistic latent semantic indexing which allows us to differentiate contexts further. Fuzzy SKWIC does simultaneous partitioning in two different hyperspaces - the document space to capture spatial document organization and the keyword space to capture context. Context can be inferred because they are not one from one keyword but from co-occurring relevant keywords. By providing fuzziness or memberships to different clusters to varying degrees, the cluster dependent keywords are richer and better suited to classify the documents in that clusters. The documents in a cluster are also more uniform, compact and homogeneous.
No comments:
Post a Comment