Monday, October 28, 2013

We talked about Kullback-Leibler divergence in the previous post. We will look into it now.
How it comes useful is in its ability to differentiate two probability distributions P and Q where P represents the true distribution and Q represents the computed statistical distribution.
The computation for this divergence measure is computed using expected values in terms of information entropy.
The divergences is given as the entropy of P - the cross entropy of P and Q. It can also be represented as -Sum(p(x)log(qx)) + Sum(p(x)log(p(x))). We will defer the application of the Kullback-Leibler divergence for later. Let us discuss the next steps in the processing such as lemmatization and tagging.
Both lemmatization or finding the lemma and tagging with parts of speech such as noun phrase, verb, adjective etc can be done with a TreeTagger. It can be downloaded and experimented with separately. But given a sentence of words, it can detect the parts of speech and the lemma that go with it into two lists. Internally, it uses a decision tree classifier to identify the parts of speech. the decision tree is built recursively from a training set of trigrams. At each recursion step, a test is added which divides the trigram set into two subsets with maximum distinctness regarding the probabilistic ditribution of the third tag. The test examines if one of the two preceding tags is identical to a tag t from a known tagset. In the same recursion step, all possible tests are compared and the best one yielding most information is attached to the current node. Then this node is expanded recursively on each of the two subsets of the training set. This results in a yes-no binary decision tree.
Stemmers are included with the tree tagger to help with identifying the lemma. The stemmers for example has the ability to detect common modifying suffixes to words. Lemmatization generally reduces the number of unique terms drastically and this helps reduce the arbitrarily high dimensions to work with for terms.
Stop lists are another tool to filter out the frequently occurring words or words that have no discernible information content. This list can be maintained external to the processing and is quite useful for reducing the count of relevant terms.
Together both the above steps are prerequisites for clustering based on document frequency.

No comments:

Post a Comment