Monday, November 18, 2013

Wartena mentioned in this paper that the keyword selection was done based on the filtering out general terms and this was done by requiring that a keyword has to be different from the background distribution. The technique is to use Kullback-Leibler divergence to see whether the term is different enough and using a cutoff based on this KLD(ptk', q) > 1.  In Kullback divergence measure, we normalized the distances based on an empty document with probability distribution all consisting of epsilon values. And we measured the P(tk/q) as [tf(tk, q) / Sum-for-all-terms-x-in-q (tf (tx,q))] . When we take the KLD distance for a term in a document containing just that term weighted against the KLD for the same term in the distribution Q, we will find that some terms have higher ratios. This we use with a threshold to select the ones that are not too frequent and overly general.  This threshold lets us reduce the number of terms in Q which may be quite large to a few that have higher discrimination. Choosing the terms this way lets us come up with representative keywords for use with any sample document.
From all the words in a sample document, each term can be weighted this way. This gives us a keyword selection technique. The number of terms to be extracted can be varied by adjusting the threshold on a sliding scale.
 By increasing the size of the distribution Q or choosing a category that is similar to the sample document, we could further improve the selection technique.
Thereafter, we could use the document categories and sample document clustering for simultaneous feature extraction and document categorization.

No comments:

Post a Comment