Cluster computing

Friday, January 17, 2014

Today we take a short break from our discussion on Teradata
Using Kullback-Leibler divergence to extract keywords from a single document:
This approach talks about a simple way to extract keywords using term weighting.
Terms are weighted based on the Kullback-Leibler divergence D(avg probability of term t/q) from the background distribution q.
The terms are evaluated one by one and if their divergence is above a cutoff, they are selected.
Initially, we calculate the expected probability of a term by counting its occurrence in the document. We could later refine this to be a lookup against a dictionary of words and their expected probabilities as determined independently from a corpus of text.
We calculate nw as the total number of terms where w appears.
and pw as the total number of terms where w appears divided by the total number of terms in the document
and we measured the P(tk/q) as [nw / Sum-for-all-terms-x-in-q (nw)] .
By setting an adjustable threshold for the cutoff, we extract variable number of keywords.
pseudo-code in python syntax for finding KLD from a single document:
tokens = nltk.word_tokenize(document)
words = sorted(set(tokens))
porter = nltk.PorterStemmer()
words = [porter.stem(t) for t in words]

wnl = nltk.WordNetLemmatizer()
lemma = [wnl.lemmatize(t) for t in words]

fdist-document = nltk.FreqDist(tokens)
def Ptk-d(term): return fdist-document[term] / sum([fdist-document[term] for term in lemma])

The KLD is given as D(p||q) = Sigma-t-from1ton[pt.log(pt/qt)]
where pt is for the distribution consisting of only the one keyword and is a constant = 1 for that term
qt is the distribution nt/n for that term
In our case we don't need to aggregate the KLD component pt.log(pt/qt) since we consider only one term. I take this back, the distribution is based on the same set of terms for both p and q. When a term doesn't appear in p or q, then it is given an epsilon probability.
we select the terms where pt.log(pt/qt) = KLD(p|q) > 1

KLD-term = 1 * log (1/(nt/n) for term in lemma)]) if we considered only one term
KLD-term = 1 * log (1/(nt/n)) + Sum(epsilon.log(epsilon/(nt'/n))) for all other terms t'

Alternatively, we could use Bigi's equation as

KLD(d, t) = KLD(d,t) / KLD(d,0)

KLD(d,t) = SUM((P(t,d) – P(t,t)) x log(P(t,d)/P(t,t)))

= (Nt/n – 1) * log(nt/n) if we considered only one term
= (Nt/n – 1) * log(nt/n) + SUM(((nt/n)-epsilon) x log((nt/n)/epsilon)) for all terms

Note that this considers the occurrence and non-occurrence of terms but not their density for which we want to add weights in the probability computation above.

Cluster computing

Friday, January 17, 2014

No comments:

Post a Comment