Tuesday, October 29, 2013

In the previous post, we mentioned a few caveats and perhaps we can elaborate on some of them. These are for example, the document may not contain terms that are all in the corpus, terms may be missing from both, some document terms may need to be discounted and finding the KLD for these may result in infinite value, so we treat them the same as unknown terms and cap it with an epsilon value. Further, we mentioned that the KLD may not be used for clustering as opposed to its use in pre-processing for clustering. This is because the KLD has the following shortcomings: It doesn't work for all the data and the choice of the inclusions makes a difference to the end result. Instead we wanted something for clustering that is independent of the order of selections even if they can belong to different clusters to varying degrees. Furthermore, the distance metric used for clustering has several components often including the aggregated cosine distances.
On the other hand, we can better tune our clustering when we try out with a smaller dataset. So while KLD fits in very well to reduce the data set, the clustering works well with to use it with smaller dataset. The technique in clustering should work regardless. With improvements in the data set, the clustering can be refined to the point where we can find the different datasets.
One way to test the clustering would be to use a smaller sample. If it works for smaller sample, it should work for more data. The clustering relies on a metric that is independent of the number of data points. If its just one word, then the cluster center is the word itself.
When we talk of smaller data sets for clustering, we will find it easier to verify with manual classification and categorization. This will help with the precision and recall. Note that from our previous examples, the precision is the ratio of the true positive to the total positive cases i.e. the sum of both true positives and false positives. On the other hand, recall is the ratio of the true positives to the sum of true positives and false negatives. In other words, recall is how much of the relevant items were retrieved while precision is how many items that were retrieved were relevant.
With smaller samples, we can easily measure the precision and recall and fine tune our clustering.
One suggestion is that regardless of the sample we work with, we try text from different categories and not just from the corpus. Web data can be used for the trials since this has a lot more variety.
The discussion on KLD above was read from the paper on the same by Brigitte Bigi and the discussion on the clustering was read from the paper on topic detection by Christian Wartena and Roger Brussee. The paper by Wartena and Brussee  describes the clustering and the pre-processing we refer.

No comments:

Post a Comment