Saturday, November 16, 2013

KLD and clustering
In a paper on how to do clustering with KLD on narrow-domain short texts by Pinto, Benedi & Rosso, there is a good example of such a clustering. Although KLD is used to calculate a distance between two probability distributions, here the authors use it compare how similar or dissimilar two documents are so that they can be clustered. In this study, they use narrow domain texts to show that this method works even in such case. The challenge with narrow domain texts is that term frequencies are very low and term selection has to be done carefully. If we take the vocabulary across all the text, only a very small percentage of those terms might appear in the short text and that too their frequency may be one or two or so. Moreover, a small change in the frequency affects the clustering results widely. Therefore this is a good test for the clustering. A real life example of such a problem space is seen with the categorization of the abstracts of scientific papers by repositories. Although authors provide keywords for the same, the keywords are not sufficient to classify the text and often some are misleading resulting in poor classification. Moreover, the number of categories are not known before hand.
In their study, the authors have used hierarchical clustering methods and they use the symmetric KLD distance for divergence. There are various ways of expressing KLD.
We have seen asymmetric version as D(P/Q) = Sum(P(x)log(P(x)/Q(x))) .
Then we have seen the symmetric one as D(P/Q) = Sum((P(x)-Q(x))log(P(x)/Q(x)))
and then there is the D(P/Q) = 1/2[ D(P / (P + Q)/2) + D (Q / (P + Q)/2)]
and finally D(P/Q) = max [D (P / Q) + D (Q / P)]
Since the texts vary in terms, the frequency of many terms is zero. To avoid these, a back off scheme is used similar to what we discussed earlier. The document probability of the term is given as a constant times the Probability of term tk in document dj, if tk occurs in document dj or a constant epsilon otherwise. i.e. all the words from the vocabulary that don't appear in the document are treated the same as unknown words and given a default value. This probability of a term tk in dj is calculates as P(tk/dj) = tf(tk, dj) / Sum(x belongs to dj)[tf(tx, dj)]. A document j is represented by a term vector of probabilities dj and the distance between two documents is the KLD measure. Finally all the term vectors are normalized so that the document vectors are of unit length. Therefore the scaling factors for the terms occurring in the document are found as 1 - sum(all epsilons for terms not occurring in the document). Note the convention has been to use a predetermined vocabulary. We don't invert the problem by building the vocabulary from the documents as we process them. In the case of this study, most of the corpora have several thousands of terms, the vocabulary consists of a few thousands of terms  and the document texts consist of about a hundred terms
 

No comments:

Post a Comment