Wednesday, November 13, 2013

In the paper titled Kullback-Leibler distance for text categorization, Brigitte Bigi introduces the tf-idf classifier and the KLD classifier for text categorization. Documents are represented by their term vectors where each term is a content term and the documents represent weights in terms of the content terms selected. Each term is assumed to occur at least once in one document.  In the tf-idf case, each weight is calculated as the combination of term frequency and the inverse document frequency. If the idf of the term is lower, it appears in more documents and consequently not a good discriminating content word. The weight is zero when the term is not assigned to document dj and for the assigned terms, it equals the count of the term in that document with the idf calculated as the log of the total number of documents in the training divided by the document frequency of the term. The classifier for a tf-idf scheme is based on the cosine similarity where we sum the component wise products over their magnitudes.
The KLD classifier works similarly but measures how different two probability distributions are. The KL divergence of probability distributions P,Q on a finite set X is defined as the probability of selecting a term from the probability distribution P given Q and is expressed as the aggregated conditional probability P(x)log(P(x)/Q(x)) . The author mentions that a better distance metric would be the aggregated P(x) - Q(x) log P(x) / log Q(x). With this distance measure, we can choose the categories that contribute most to this distance. Note that the document vectors may have a lot of zeros for the terms that don't appear in the document. For these, there is a backoff scheme mentioned by the author. In general, C is the number of categories and V is the number of terms from all the categories. For each category, a distribution based only on the selected terms is computed from the corpus. The term distribution of a sample document is compared with the distribution of these categories. In the event of a backoff, term frequencies are discounted and all those terms that don't occur in the document are given a default value and treating them as if they were unknown words. This is done so that the distances don't come out to be infinite.
A document is represented by a term vector of probabilities dj while a category is represented by a term vector of probabilities ci. The probability distribution for a term tk in a document dj is computed as P(tk/dj) which is the ratio of the term frequencies in that document as compared to the overall term frequency across all documents in the case that the term appears in the document otherwise zero.  Similarly, the probability of a term tk in a category ci is given as the ratio of the term frequency in the category c1 as compared to the term frequencies over all the categories. The probability distributions across the documents and the categories for the same term are normalized to unit length i.e the default values and the computed ratios add up to 1. The categorization method based on the KL distance computes the distance as the aggregated sum of the difference between the probability distribution of a term in a category versus a document taken together with the log of their ratio.
Notice that this has four cases  : tk could appear in document dj and category ci, tk appears in document dj but not category ci, tk appears in category ci but not document dj and tk appears in neither. In the last case, the term degenerates to an unknown word. Finally, we normalize the distance based on the distance computed for the document dj and the distance computed for an empty document.
So now we add terms to the empty document that maximizes this distance to extract keywords. And it is also clear that we need the categories and documents  for this keyword extraction and we do not want to rely in the test document or it's summary itself.

Tuesday, November 12, 2013

While the FuzzySKWIC works with a selected finite set of vocabulary for the documents in consideration, Wartena's paper focuses on treating documents as a bag of words and attempting to find the topic by clustering those words. Here too we have a term collection T each occurrence of which can be found in exactly one source document in a collection  C. Wartena's approach included Lemmatization that reduced all inflected forms of the verb, nouns and adjectives to their lexical form, substantially reducing the variation in the input data. Tagging was also used to distinguish content words from function words. This pre-processing let them select around a 160 most frequent keywords for their term collection.
In this approach, they consider the probability distributions that measure the probability to select  an occurrence of a term t, the probability to select a document from the collection, and the probability to select a term and a document from C x T. They have two conditional probabilities - one for the probability that a randomly selected occurrence of term t has source d also called the source distribution of t and the other for the randomly selected term occurrence  from document d is an instance of term t also called the term distribution of d.  These help them define the distribution of co-occurring terms with which they perform clustering to determine the topic.
The concern here is that  the vocabulary set may not be sufficient to find the topic in a test document or to extract keywords. In order to evaluate available words, a different scheme may be needed. Word similarities and clustering could be used.
In order that we pick keywords out of the document, we must identify centers for the clusters within the document itself. Do we have a limit on the number of words we can cluster, no not as long as the terms are considered by their weights or vectors. Typically the documents have vectors and not the terms themselves because the terms only have weights. The document vectors suffer from a high dimension due to the number of terms in the vocabulary set. The high dimensions have problems for the documents but not for the keywords themselves. The weights associated with the keywords can be based on tf-idf or Jensen-Shannon divergence. We find the most frequent content words. We can differentiate the keywords from the background based on the Kullback-Leibler divergence and a cutoff. however, documents and keywords are not treated independently. Often topics have to be categorized and keywords have to be extracted together because they independently don't give correct information. Take the example of just keyword extraction. If we were looking merely for the content words and we worked only with the test document and the test document was reduced to a bag of words, we would be selecting words out of this document without any knowledge of the distribution of the words in any other documents. That may or may not be satisfactory since the document corpus gives valuable information for keywords that generally appear together in a category. Perhaps, then the documents from the corpus could be summarized in a term-attributes lookup table that we pre-populate and summarize based on the studied corpus. Then given the words we encounter in the document, we find the matches and using the corresponding attributes, we extract the keywords. This is where wordnet was a start but we are talking about categories the words appear in as attributes among others. In addition, we are considering performing Kullback-Leibler divergence with a cut-off to extract keywords. The suggestion is that we use the existing knowledge as well as the differentiation in the given document focusing on keyword extraction

Monday, November 11, 2013

// PreOrder A B C D E F

// InOrder C B D A E F

// Build Tree

public static Node GetTree(char[] PreOrder, char[] InOrder, int index = 0)



{
Node root = new Node();



root.data = PreOrder[index];
int inIndex = Array.IndexOf(InOrder, PreOrder[index]);

if ( index + 1 < PreOrder.Length && IsLeftSubtree(PreOrder[index + 1], InOrder, inIndex) == true)



root.left = GetTree(PreOrder, InOrder, index + 1);
else

root.left = null;

if (inIndex + 1 < InOrder.Length && IsPredecessor(InOrder[inIndex + 1], PreOrder, index) == false)



root.right = GetTree(PreOrder, InOrder, inIndex + 1);
else

root.right = null;

return root;



}
private static bool IsPredecessor(char c, char[] PreOrder, int index)



{
return Array.IndexOf(PreOrder, c) < index;



}
private static bool IsLeftSubtree(char c, char[] InOrder, int index)



{
return Array.IndexOf(InOrder, c) < index;



}
We will pick up the implementation discussion on FuzzySKWIC from here : http://ravinote.blogspot.com/2013/11/today-im-going-to-describe-few-steps.html
and we will try to wrap up the implementation so that we can finish it.  We only need to make sure the cosine based distances are calculated correctly. Rest everything follows in the iteration like we discussed. The cosine based distance are along components. Each document is represented by a vector of its document frequencies and this vector is normalized to unit-length. The cosine based distance is calculated as 1/n - xjk.cik. where xjk is the kth component of the document frequency vector xj. This can be calculated as the component affected by this document  on the total document frequency of that term. xjk  therefore does not change with iterations. cik is the kth component of the ith cluster center vector. Note that both are less than one. That is we are referring to cell values in the document vectors. The cluster center is also a document vector. We start with a matrix N x n in this case and for each document we will maintain feature weights as well as cosine distances to the centers of the cluster. When the cluster center changes, the cosine document distance changes and the aggregated cosine consequently changes. If the centers have not changed, we can skip some of the calculations. We will want a fuzzy matrix of size C x N but for now , we will first try out without it. Initial assignments of the document frequency vector populates the N x n matrix purely in the IDF. The feature weights are initialized.
For test data we use the brown categories for news, government, adventure and humor. We will pick ten documents from each category.
We will pick up the implementation discussion on FuzzySKWIC from here : http://ravinote.blogspot.com/2013/11/today-im-going-to-describe-few-steps.html
and we will try to wrap up the implementation so that we can finish it.  We only need to make sure the cosine based distances are calculated correctly. Rest everything follows in the iteration like we discussed. The cosine based distance are along components. Each document is represented by a vector of its document frequencies and this vector is normalized to unit-length. The cosine based distance is calculated as 1/n - xjk.cik. where xjk is the kth component of the document frequency vector xj. and cik is the kth component of the ith cluster center vector. Note that both are less than one. That is we are referring to cell values in the document vectors. The cluster center is also a document vector. We start with a matrix N x n in this case and for each document we will maintain feature weights as well as cosine distances to the centers of the cluster. If the centers have not changed, we can skip some of the calculations. We will want a fuzzy matrix of size C x N but for now , we will first try out without it. Initial assignments of the document frequency vector populates the N x n matrix
For test data we use the brown categories for news, government, adventure and humor. We will pick ten documents from each category.

We will pick up the implementation discussion on FuzzySKWIC from here : http://ravinote.blogspot.com/2013/11/today-im-going-to-describe-few-steps.html
and we will try to wrap up the implementation so that we can finish it.  We only need to make sure the cosine based distances are calculated correctly. Rest everything follows in the iteration like we discussed. The cosine based distance are along components. Each document is represented by a vector of its document frequencies and this vector is normalized to unit-length. The cosine based distance is calculated as 1/n - xjk.cik. where xjk is the kth component of the document frequency vector xj. and cik is the kth component of the ith cluster center vector. Note that both are less than one. That is we are referring to cell values in the document vectors. The cluster center is also a document vector. We start with a matrix N x n in this case and for each document we will maintain feature weights as well as cosine distances to the centers of the cluster. If the centers have not changed, we can skip some of the calculations.
For test data we use the brown categories for news, government, adventure and humor. We will pick ten documents from each category.