Saturday, November 16, 2013

KLD and clustering
In a paper on how to do clustering with KLD on narrow-domain short texts by Pinto, Benedi & Rosso, there is a good example of such a clustering. Although KLD is used to calculate a distance between two probability distributions, here the authors use it compare how similar or dissimilar two documents are so that they can be clustered. In this study, they use narrow domain texts to show that this method works even in such case. The challenge with narrow domain texts is that term frequencies are very low and term selection has to be done carefully. If we take the vocabulary across all the text, only a very small percentage of those terms might appear in the short text and that too their frequency may be one or two or so. Moreover, a small change in the frequency affects the clustering results widely. Therefore this is a good test for the clustering. A real life example of such a problem space is seen with the categorization of the abstracts of scientific papers by repositories. Although authors provide keywords for the same, the keywords are not sufficient to classify the text and often some are misleading resulting in poor classification. Moreover, the number of categories are not known before hand.
In their study, the authors have used hierarchical clustering methods and they use the symmetric KLD distance for divergence. There are various ways of expressing KLD.
We have seen asymmetric version as D(P/Q) = Sum(P(x)log(P(x)/Q(x))) .
Then we have seen the symmetric one as D(P/Q) = Sum((P(x)-Q(x))log(P(x)/Q(x)))
and then there is the D(P/Q) = 1/2[ D(P / (P + Q)/2) + D (Q / (P + Q)/2)]
and finally D(P/Q) = max [D (P / Q) + D (Q / P)]
Since the texts vary in terms, the frequency of many terms is zero. To avoid these, a back off scheme is used similar to what we discussed earlier. The document probability of the term is given as a constant times the Probability of term tk in document dj, if tk occurs in document dj or a constant epsilon otherwise. i.e. all the words from the vocabulary that don't appear in the document are treated the same as unknown words and given a default value. This probability of a term tk in dj is calculates as P(tk/dj) = tf(tk, dj) / Sum(x belongs to dj)[tf(tx, dj)]. A document j is represented by a term vector of probabilities dj and the distance between two documents is the KLD measure. Finally all the term vectors are normalized so that the document vectors are of unit length. Therefore the scaling factors for the terms occurring in the document are found as 1 - sum(all epsilons for terms not occurring in the document). Note the convention has been to use a predetermined vocabulary. We don't invert the problem by building the vocabulary from the documents as we process them. In the case of this study, most of the corpora have several thousands of terms, the vocabulary consists of a few thousands of terms  and the document texts consist of about a hundred terms
 

Friday, November 15, 2013

Today let us just focus on an implementation topic for text categorizing software. How do we design a class for a clusterer ? Let us take a specific example for the clusterer. We can generalize it later. We want a K-means clusterer that classifies text documents to categories/clusters. Text documents are represented by term vectors. Distance between documents and categories are measured by a distance function or metric. Documents can vary in number. They can be quite large. Number of clusters are fixed. Clusters will have centers. Distance to the nearest center for a document can be found. The clusterer has a method called Classify. It reads the input documents, computes the distances, assigns the clusters, adjusts the cluster centers and repeats until the cluster centers stabilize. The mechanism of the classify method is known. The class needs to be aware of the following
cluster centers
distance function
document set
With the classify method, there will be an output where each document has been assigned a category label or cluster.
How do we test the classifier ? We give it different sets of inputs and check for the labels. What can we vary for this classifier. Surely we can vary the sample space. We can include a small number or a large number of documents. We can check for the proper assignments of each documents We can vary the number of clusters. We can check the output for different distance functions. We may have to manually classify the samples  before we run the clusterer. But how do we check the correctness of the program ? We want an invariant, an initial condition and a final state. How do we check the progress especially when the classify method takes a long time ? We could add some supportability.
Next we could consider some design factors for flexibility.
We could decouple the distance calculating function from the clusterer.  We could use design patterns to switch different distance functions to the class. We could use a strategy pattern.
We could also parameterize the input and the output so that we don't need to know what the vectors are that we cluster. We could create a base class for the clustered and derive others.
pseudo-code in python syntax for finding KLD:

tokens = nltk.word_tokenize(document)

words = sorted(set(tokens))

porter = nltk.PorterStemmer()
words = [porter.stem(t) for t in words]


wnl = nltk.WordNetLemmatizer()
lemma = [wnl.lemmatize(t) for t in words]
vocabulary = lemma[1:200] 
# categories
news = brown.words(categories = 'news')
editorial = brown.words(categories = 'editorial')
reviews = brown.words(categories = 'reviews')
hobbies = brown.words(categories = 'hobbies')

fdist-news = nltk.FreqDist(news)
fdist-editorial = nltk.FreqDist(editorial)
fdist-reviews = nltk.FreqDist(reviews)
fdist-hobbies = nltk.FreqDist(hobbies)
fdist-document = nltk.FreqDist(tokens)

def Ptk-d(term): return fdist-document[term] / sum([fdist-document[term] for term in words])
def Ptk-editorial(term): return fdist-document[term] / sum([fdist-document[term] for term in editorial])
def Ptk-reviews(term): return fdist-document[term] / sum([fdist-document[term] for term in reviews])
def Ptk-hobbies(term): = return fdist-document[term] / sum([fdist-document[term] for term in hobbies])
def Ptk-document(term): = return fdist-document[term] / sum([fdist-document[term] for term in document])

KLD-editorial = sum([(Ptk-editorial(term) - Ptk-d(term)) * log (Ptk-editorial(term)/Ptk-d(term) for term in vocabulary)] 


[in progress]

 

Thursday, November 14, 2013

In the previous post, we discussed the algorithm to differentiate probability distribution. With this we have seen how to differentiate a sample document against document categories.
 Here we will walk through how to implement it. First we take a document collection that is already categorized. We then select a vocabulary V from these documents using the lemmatization, stop words. With the terms in the vocabulary we compute the probability distribution as by counting the term frequencies in a document versus and taking the ratio with respect to all the frequencies.  We can do the same for the categories so that we have vectors for both.  Computing the vectors and normalizing them helps us with the preparation for computing the distance. This is measured using the four cases we discussed and normalizing them with the distance corresponding to the empty document vector. After computing the KLD distance, we assign the document to the category with the minimum distance.
Notice that these are no iterations involved since we are finding the distances only. With the distances, we get a measure of how similar or dissimilar the vectors are.
We have said so far that the document can be assigned a category with which it has the minimum distance. This distance is computed based on the difference between the documents distribution and the category's distribution. Notice that the sample for the distribution can be the category or the document and we evaluate it on a term by term basis.
We will discuss a variation where given a fairly large set of vocabulary, we select the keywords that are most distinguishing from the background that consists of general terms. Perhaps, we can choose a subset of the vocabulary and find the terms that repeatedly contribute most to the distinguishing from the background distribution or we could start out with an empty document and add the terms that keep the distance higher than a threshold.
If we choose the latter approach, we may find it easier to build the subset of keyword we are interested in. At the same time, we may need to experiment with different threshold values. This is not a bad option given that the words will keep reducing in number in the subset with higher and higher thresholds. The advantage we have with this approach is that we keep the distribution in the background the same. Consequently the distance is directly related to the subset of keywords.
This way we have a higher likelihood of reducing the number of selections significantly.
In the former approach, we might have to try out different subsets and again since the distribution to compare with remains the same the words that repeatedly contribute more and more could be chosen as candidates. The sample can be of fixed size with only those terms swapped that don't contribute significantly to the distance and then stopped when we have a desired size of selection.

Wednesday, November 13, 2013

In the paper titled Kullback-Leibler distance for text categorization, Brigitte Bigi introduces the tf-idf classifier and the KLD classifier for text categorization. Documents are represented by their term vectors where each term is a content term and the documents represent weights in terms of the content terms selected. Each term is assumed to occur at least once in one document.  In the tf-idf case, each weight is calculated as the combination of term frequency and the inverse document frequency. If the idf of the term is lower, it appears in more documents and consequently not a good discriminating content word. The weight is zero when the term is not assigned to document dj and for the assigned terms, it equals the count of the term in that document with the idf calculated as the log of the total number of documents in the training divided by the document frequency of the term. The classifier for a tf-idf scheme is based on the cosine similarity where we sum the component wise products over their magnitudes.
The KLD classifier works similarly but measures how different two probability distributions are. The KL divergence of probability distributions P,Q on a finite set X is defined as the probability of selecting a term from the probability distribution P given Q and is expressed as the aggregated conditional probability P(x)log(P(x)/Q(x)) . The author mentions that a better distance metric would be the aggregated P(x) - Q(x) log P(x) / log Q(x). With this distance measure, we can choose the categories that contribute most to this distance. Note that the document vectors may have a lot of zeros for the terms that don't appear in the document. For these, there is a backoff scheme mentioned by the author. In general, C is the number of categories and V is the number of terms from all the categories. For each category, a distribution based only on the selected terms is computed from the corpus. The term distribution of a sample document is compared with the distribution of these categories. In the event of a backoff, term frequencies are discounted and all those terms that don't occur in the document are given a default value and treating them as if they were unknown words. This is done so that the distances don't come out to be infinite.
A document is represented by a term vector of probabilities dj while a category is represented by a term vector of probabilities ci. The probability distribution for a term tk in a document dj is computed as P(tk/dj) which is the ratio of the term frequencies in that document as compared to the overall term frequency across all documents in the case that the term appears in the document otherwise zero.  Similarly, the probability of a term tk in a category ci is given as the ratio of the term frequency in the category c1 as compared to the term frequencies over all the categories. The probability distributions across the documents and the categories for the same term are normalized to unit length i.e the default values and the computed ratios add up to 1. The categorization method based on the KL distance computes the distance as the aggregated sum of the difference between the probability distribution of a term in a category versus a document taken together with the log of their ratio.
Notice that this has four cases  : tk could appear in document dj and category ci, tk appears in document dj but not category ci, tk appears in category ci but not document dj and tk appears in neither. In the last case, the term degenerates to an unknown word. Finally, we normalize the distance based on the distance computed for the document dj and the distance computed for an empty document.
So now we add terms to the empty document that maximizes this distance to extract keywords. And it is also clear that we need the categories and documents  for this keyword extraction and we do not want to rely in the test document or it's summary itself.

Tuesday, November 12, 2013

While the FuzzySKWIC works with a selected finite set of vocabulary for the documents in consideration, Wartena's paper focuses on treating documents as a bag of words and attempting to find the topic by clustering those words. Here too we have a term collection T each occurrence of which can be found in exactly one source document in a collection  C. Wartena's approach included Lemmatization that reduced all inflected forms of the verb, nouns and adjectives to their lexical form, substantially reducing the variation in the input data. Tagging was also used to distinguish content words from function words. This pre-processing let them select around a 160 most frequent keywords for their term collection.
In this approach, they consider the probability distributions that measure the probability to select  an occurrence of a term t, the probability to select a document from the collection, and the probability to select a term and a document from C x T. They have two conditional probabilities - one for the probability that a randomly selected occurrence of term t has source d also called the source distribution of t and the other for the randomly selected term occurrence  from document d is an instance of term t also called the term distribution of d.  These help them define the distribution of co-occurring terms with which they perform clustering to determine the topic.
The concern here is that  the vocabulary set may not be sufficient to find the topic in a test document or to extract keywords. In order to evaluate available words, a different scheme may be needed. Word similarities and clustering could be used.
In order that we pick keywords out of the document, we must identify centers for the clusters within the document itself. Do we have a limit on the number of words we can cluster, no not as long as the terms are considered by their weights or vectors. Typically the documents have vectors and not the terms themselves because the terms only have weights. The document vectors suffer from a high dimension due to the number of terms in the vocabulary set. The high dimensions have problems for the documents but not for the keywords themselves. The weights associated with the keywords can be based on tf-idf or Jensen-Shannon divergence. We find the most frequent content words. We can differentiate the keywords from the background based on the Kullback-Leibler divergence and a cutoff. however, documents and keywords are not treated independently. Often topics have to be categorized and keywords have to be extracted together because they independently don't give correct information. Take the example of just keyword extraction. If we were looking merely for the content words and we worked only with the test document and the test document was reduced to a bag of words, we would be selecting words out of this document without any knowledge of the distribution of the words in any other documents. That may or may not be satisfactory since the document corpus gives valuable information for keywords that generally appear together in a category. Perhaps, then the documents from the corpus could be summarized in a term-attributes lookup table that we pre-populate and summarize based on the studied corpus. Then given the words we encounter in the document, we find the matches and using the corresponding attributes, we extract the keywords. This is where wordnet was a start but we are talking about categories the words appear in as attributes among others. In addition, we are considering performing Kullback-Leibler divergence with a cut-off to extract keywords. The suggestion is that we use the existing knowledge as well as the differentiation in the given document focusing on keyword extraction

Monday, November 11, 2013

// PreOrder A B C D E F

// InOrder C B D A E F

// Build Tree

public static Node GetTree(char[] PreOrder, char[] InOrder, int index = 0)



{
Node root = new Node();



root.data = PreOrder[index];
int inIndex = Array.IndexOf(InOrder, PreOrder[index]);

if ( index + 1 < PreOrder.Length && IsLeftSubtree(PreOrder[index + 1], InOrder, inIndex) == true)



root.left = GetTree(PreOrder, InOrder, index + 1);
else

root.left = null;

if (inIndex + 1 < InOrder.Length && IsPredecessor(InOrder[inIndex + 1], PreOrder, index) == false)



root.right = GetTree(PreOrder, InOrder, inIndex + 1);
else

root.right = null;

return root;



}
private static bool IsPredecessor(char c, char[] PreOrder, int index)



{
return Array.IndexOf(PreOrder, c) < index;



}
private static bool IsLeftSubtree(char c, char[] InOrder, int index)



{
return Array.IndexOf(InOrder, c) < index;



}