Sunday, December 15, 2013

We describe a simple tweak to Matsuo's paper on keyword extraction from a single document where we replace the unconditional probability of a term in the document with the prior Bayesian probability from a corpus. We could pre-compute and store the Bayesian probabilities of all terms in a table. We include all words from an English Dictionary and assign a default value to those terms not found in the corpus.We use the nltk.NaiveBayesClassifier to populate this table beforehand. Since the original approach finds the probabilities of only the top 30% of the frequent terms, the number of documents where the frequent terms won't be found in the corpus should be fairly small.
Said another way, we don't just set the weights on the terms based on their frequency in the document, but taken the overall likelihood of that term to appear in any document. In Matsuo's approach, this substitution does not affect alter the meaning of the expected probability too much. If anything, it improves on the probability from a wider source than the document.
We also aim to differentiate terms that occur similar number of times in a document. This is especially useful for short text where the frequency may not give as much information as in a larger text. Further the original approach was to find the term weights based on the document itself and avoid the corpus. In this case too, we avoid the corpus with a prior probability table that useful in cases where we can discriminate the terms.
Note that this tweak is more suitable to the approach we modify and may not apply in all cases. For example in the KLD distance calculation, KLD distance is about relative probability distribution and a measure how different one is from the other. In which case, we want the probabilities from the same sample space.
Here we cluster sentences based on the KLD similarity between a pair of sentences where sentences are represented by their term vector of probabilities.  We compute the probability that a term occurs in a sentence as the ratio of the term frequency of the term in the sentence to the total term frequency across all the sentences. Depending on the scope we could treat sentences with paragraphs or call them short text documents. By clustering, we categorize the text and therefore select the keywords that represent the categories. In both symmetric and pairwise clustering, we use co-occurrence of terms. The difference is that symmetric clustering groups phrases and pairwise clustering groups relevant terms with the original approach.
KLD equations come in many form.  All of them use the same probability distributions.

No comments:

Post a Comment