Some observations from the implementation by the authors of Fuzzy SKWIC. We talk about their experiments with general SKWIC as well as the fuzzy improvement.
They tried simulation results with hard clustering. Their collection consisted of 200 documents spread out equally from four categories. They did filter stop words and used stemming. IDF of the terms were used in descending order to pick out 200 keywords. Each document was represented by a vector of its document frequencies. This vector was normalized to unit-length. Four clusters were used and the SKWIC converged in five iterations. Six cluster dependent keywords were chosen and verified and they correctly represented the cluster category. Only one partition showed most of the error. Documents that straddled two or more topics were found in this partition.
One observation I would like to highlight here is regardless of the keyword extraction, their choice for the number of keywords was in the order of a few hundreds. The ratio of keywords to documents is incidental here but the choice to keep the number of keywords to a small number is deliberate. We should follow a similar practice.
Now on to the fuzzy labels based run, the results were again with four clusters and m = 1.1. Their cluster center updates converged after 27 iterations. The results did indicate a significant improvement over the results from the former run. The improvement was seen particularly with respect to the partition that had errors with documents. In this run, those documents were correctly classified. A closer look at those documents suggested that they could have had soft labels from the start i.e they should not have been assigned one label. The results also indicate that the labels generated by the Fuzzy SKWIC was richer and more descriptive of their categories, thus indicating a good application.
A third simulation was done with a much varied data from 20 newsgroups with 20000 messages. Here the messages were stripped from their structural data in addition to being processed with stemming and stop word filtering.
Here the messages were stripped from their structural data in addition to being processed with stemming and stop word filtering. Here the number of clusters were 40. The feature weights varied quite a bit from cluster to cluster A lot of document migrations to clusters were seen Ten relevant keywords for each cluster were chosen. Again, they reflected their clusters well. Some indicated that the discussions were about an event, while others indicated that they pertained to an issue found interesting by the participants.
They tried simulation results with hard clustering. Their collection consisted of 200 documents spread out equally from four categories. They did filter stop words and used stemming. IDF of the terms were used in descending order to pick out 200 keywords. Each document was represented by a vector of its document frequencies. This vector was normalized to unit-length. Four clusters were used and the SKWIC converged in five iterations. Six cluster dependent keywords were chosen and verified and they correctly represented the cluster category. Only one partition showed most of the error. Documents that straddled two or more topics were found in this partition.
One observation I would like to highlight here is regardless of the keyword extraction, their choice for the number of keywords was in the order of a few hundreds. The ratio of keywords to documents is incidental here but the choice to keep the number of keywords to a small number is deliberate. We should follow a similar practice.
Now on to the fuzzy labels based run, the results were again with four clusters and m = 1.1. Their cluster center updates converged after 27 iterations. The results did indicate a significant improvement over the results from the former run. The improvement was seen particularly with respect to the partition that had errors with documents. In this run, those documents were correctly classified. A closer look at those documents suggested that they could have had soft labels from the start i.e they should not have been assigned one label. The results also indicate that the labels generated by the Fuzzy SKWIC was richer and more descriptive of their categories, thus indicating a good application.
A third simulation was done with a much varied data from 20 newsgroups with 20000 messages. Here the messages were stripped from their structural data in addition to being processed with stemming and stop word filtering.
Here the messages were stripped from their structural data in addition to being processed with stemming and stop word filtering. Here the number of clusters were 40. The feature weights varied quite a bit from cluster to cluster A lot of document migrations to clusters were seen Ten relevant keywords for each cluster were chosen. Again, they reflected their clusters well. Some indicated that the discussions were about an event, while others indicated that they pertained to an issue found interesting by the participants.