Tuesday, October 29, 2013

We discussed processing text documents and I wanted to mention GATE framework in this regard. It stands for General Architecture for Text Engineering. Its an open source software and is widely adopted by organizations, corporations and universities and has a community of developers, educators and scientists. It allows different text processing tasks to be completed by creole plugins. The tokenizer splits text into very simple tokens. The grammar rules are more adaptable and can be used for interpreting the text.  A gazetteer is used for identifying proper nouns and named entities in the text. Annotations are used with this gazetteer list to mention the type and sub-types. These can then be looked up via grammar rules. For example, to find the 'day' of the week in a text, this gazetteer list can be used to lookup Thursday.
Part of Speech tagging is also handled by this framework although we can rely on TreeTagger for the same. The tree tagging lets us filter the content words by part of speech further.
We haven't yet reviewed the application of the Kullback-Leibler mechanism to differentiate the keywords from the background text. The suggestion is that the background text could be considered a different document or bag of words.  We want to differentiate the keyword candidates from the background using the individual probability distributions of the keywords and applying the  Kullback-Leibler cutoff. This technique is also called relative entropy.  The more dissimilar the two probability distributions are, the higher the divergence and more useful for keyword selection. The terms could be selected only when it contributes significantly to the divergence.
Each word represents a probability and each document refers to a term vector of probabilities.

The goal is to require keywords P to be different from the words with little discerning power and the general background distribution Q. We can then define a threshold P\Q to a value as a cutoff. Typically greater than 1 is a good threshold.
something along the lines given by the small python syntax based pseudocode
import re, math, collections, numpy
def tokenizeandfilter(sent):
  // tokenize
 // remove stop words
 // lemmatize

def ProbabilityDistributions( document ):
 // assign probabilities to terms

def KLDifferentiation( PD1, PD2 )
     // add words to document 2 from 1 such that the divergence is greater than 1

A caveat with  Kullback-Leibler divergence is that it doesn't work well when the probability distribution is zero in either of the documents, or words are included in both documents that could have otherwise been left out.

That is why we plan to use it for our pre-processing and not with our clustering.

No comments:

Post a Comment