Sunday, August 31, 2014

In Kullback-Leibler divergence that we mentioned in earlier posts, we saw that the divergence was measured word by word as the average probability of that word against the distribution. We calculate nw as the total number of terms where w appears and pw as the total number of terms where w appears divided by the total number of terms in the document and we measured the P(tk/q) as [nw / Sum-for-all-terms-x-in-q (nw)] . When the divergence was greater than a threshold, we selected it as a keyword. It is not necessary to measure the divergence of the term one by one against the background distribution in a document because the metric hold for any two distributions such as P(x) and Q(x) and their divergence is measured as P(x)-Q(x)logP(x)/Q(x). The term distribution of  sample document is the compared with the distribution of categories that number C. The probability distribution of a term tk in a document dj is the ration of the term frequencies in that document compared to the overall term frequency across all documents in the case that the terms appear in the document otherwise zero. The term probability distribution across categories is normalized to 1.
The equation we refer to for the divergence comes in many forms.
Most equations use a default value for when a term doesn't appear in either P or Q. This is because the zero values skews the equation. The probability epsilon corresponds to an unknown word.

No comments:

Post a Comment