Thursday, November 14, 2013

In the previous post, we discussed the algorithm to differentiate probability distribution. With this we have seen how to differentiate a sample document against document categories.
 Here we will walk through how to implement it. First we take a document collection that is already categorized. We then select a vocabulary V from these documents using the lemmatization, stop words. With the terms in the vocabulary we compute the probability distribution as by counting the term frequencies in a document versus and taking the ratio with respect to all the frequencies.  We can do the same for the categories so that we have vectors for both.  Computing the vectors and normalizing them helps us with the preparation for computing the distance. This is measured using the four cases we discussed and normalizing them with the distance corresponding to the empty document vector. After computing the KLD distance, we assign the document to the category with the minimum distance.
Notice that these are no iterations involved since we are finding the distances only. With the distances, we get a measure of how similar or dissimilar the vectors are.
We have said so far that the document can be assigned a category with which it has the minimum distance. This distance is computed based on the difference between the documents distribution and the category's distribution. Notice that the sample for the distribution can be the category or the document and we evaluate it on a term by term basis.
We will discuss a variation where given a fairly large set of vocabulary, we select the keywords that are most distinguishing from the background that consists of general terms. Perhaps, we can choose a subset of the vocabulary and find the terms that repeatedly contribute most to the distinguishing from the background distribution or we could start out with an empty document and add the terms that keep the distance higher than a threshold.
If we choose the latter approach, we may find it easier to build the subset of keyword we are interested in. At the same time, we may need to experiment with different threshold values. This is not a bad option given that the words will keep reducing in number in the subset with higher and higher thresholds. The advantage we have with this approach is that we keep the distribution in the background the same. Consequently the distance is directly related to the subset of keywords.
This way we have a higher likelihood of reducing the number of selections significantly.
In the former approach, we might have to try out different subsets and again since the distribution to compare with remains the same the words that repeatedly contribute more and more could be chosen as candidates. The sample can be of fixed size with only those terms swapped that don't contribute significantly to the distance and then stopped when we have a desired size of selection.

No comments:

Post a Comment