Tuesday, November 12, 2013

While the FuzzySKWIC works with a selected finite set of vocabulary for the documents in consideration, Wartena's paper focuses on treating documents as a bag of words and attempting to find the topic by clustering those words. Here too we have a term collection T each occurrence of which can be found in exactly one source document in a collection  C. Wartena's approach included Lemmatization that reduced all inflected forms of the verb, nouns and adjectives to their lexical form, substantially reducing the variation in the input data. Tagging was also used to distinguish content words from function words. This pre-processing let them select around a 160 most frequent keywords for their term collection.
In this approach, they consider the probability distributions that measure the probability to select  an occurrence of a term t, the probability to select a document from the collection, and the probability to select a term and a document from C x T. They have two conditional probabilities - one for the probability that a randomly selected occurrence of term t has source d also called the source distribution of t and the other for the randomly selected term occurrence  from document d is an instance of term t also called the term distribution of d.  These help them define the distribution of co-occurring terms with which they perform clustering to determine the topic.
The concern here is that  the vocabulary set may not be sufficient to find the topic in a test document or to extract keywords. In order to evaluate available words, a different scheme may be needed. Word similarities and clustering could be used.
In order that we pick keywords out of the document, we must identify centers for the clusters within the document itself. Do we have a limit on the number of words we can cluster, no not as long as the terms are considered by their weights or vectors. Typically the documents have vectors and not the terms themselves because the terms only have weights. The document vectors suffer from a high dimension due to the number of terms in the vocabulary set. The high dimensions have problems for the documents but not for the keywords themselves. The weights associated with the keywords can be based on tf-idf or Jensen-Shannon divergence. We find the most frequent content words. We can differentiate the keywords from the background based on the Kullback-Leibler divergence and a cutoff. however, documents and keywords are not treated independently. Often topics have to be categorized and keywords have to be extracted together because they independently don't give correct information. Take the example of just keyword extraction. If we were looking merely for the content words and we worked only with the test document and the test document was reduced to a bag of words, we would be selecting words out of this document without any knowledge of the distribution of the words in any other documents. That may or may not be satisfactory since the document corpus gives valuable information for keywords that generally appear together in a category. Perhaps, then the documents from the corpus could be summarized in a term-attributes lookup table that we pre-populate and summarize based on the studied corpus. Then given the words we encounter in the document, we find the matches and using the corresponding attributes, we extract the keywords. This is where wordnet was a start but we are talking about categories the words appear in as attributes among others. In addition, we are considering performing Kullback-Leibler divergence with a cut-off to extract keywords. The suggestion is that we use the existing knowledge as well as the differentiation in the given document focusing on keyword extraction

Monday, November 11, 2013

// PreOrder A B C D E F

// InOrder C B D A E F

// Build Tree

public static Node GetTree(char[] PreOrder, char[] InOrder, int index = 0)



{
Node root = new Node();



root.data = PreOrder[index];
int inIndex = Array.IndexOf(InOrder, PreOrder[index]);

if ( index + 1 < PreOrder.Length && IsLeftSubtree(PreOrder[index + 1], InOrder, inIndex) == true)



root.left = GetTree(PreOrder, InOrder, index + 1);
else

root.left = null;

if (inIndex + 1 < InOrder.Length && IsPredecessor(InOrder[inIndex + 1], PreOrder, index) == false)



root.right = GetTree(PreOrder, InOrder, inIndex + 1);
else

root.right = null;

return root;



}
private static bool IsPredecessor(char c, char[] PreOrder, int index)



{
return Array.IndexOf(PreOrder, c) < index;



}
private static bool IsLeftSubtree(char c, char[] InOrder, int index)



{
return Array.IndexOf(InOrder, c) < index;



}
We will pick up the implementation discussion on FuzzySKWIC from here : http://ravinote.blogspot.com/2013/11/today-im-going-to-describe-few-steps.html
and we will try to wrap up the implementation so that we can finish it.  We only need to make sure the cosine based distances are calculated correctly. Rest everything follows in the iteration like we discussed. The cosine based distance are along components. Each document is represented by a vector of its document frequencies and this vector is normalized to unit-length. The cosine based distance is calculated as 1/n - xjk.cik. where xjk is the kth component of the document frequency vector xj. This can be calculated as the component affected by this document  on the total document frequency of that term. xjk  therefore does not change with iterations. cik is the kth component of the ith cluster center vector. Note that both are less than one. That is we are referring to cell values in the document vectors. The cluster center is also a document vector. We start with a matrix N x n in this case and for each document we will maintain feature weights as well as cosine distances to the centers of the cluster. When the cluster center changes, the cosine document distance changes and the aggregated cosine consequently changes. If the centers have not changed, we can skip some of the calculations. We will want a fuzzy matrix of size C x N but for now , we will first try out without it. Initial assignments of the document frequency vector populates the N x n matrix purely in the IDF. The feature weights are initialized.
For test data we use the brown categories for news, government, adventure and humor. We will pick ten documents from each category.
We will pick up the implementation discussion on FuzzySKWIC from here : http://ravinote.blogspot.com/2013/11/today-im-going-to-describe-few-steps.html
and we will try to wrap up the implementation so that we can finish it.  We only need to make sure the cosine based distances are calculated correctly. Rest everything follows in the iteration like we discussed. The cosine based distance are along components. Each document is represented by a vector of its document frequencies and this vector is normalized to unit-length. The cosine based distance is calculated as 1/n - xjk.cik. where xjk is the kth component of the document frequency vector xj. and cik is the kth component of the ith cluster center vector. Note that both are less than one. That is we are referring to cell values in the document vectors. The cluster center is also a document vector. We start with a matrix N x n in this case and for each document we will maintain feature weights as well as cosine distances to the centers of the cluster. If the centers have not changed, we can skip some of the calculations. We will want a fuzzy matrix of size C x N but for now , we will first try out without it. Initial assignments of the document frequency vector populates the N x n matrix
For test data we use the brown categories for news, government, adventure and humor. We will pick ten documents from each category.

We will pick up the implementation discussion on FuzzySKWIC from here : http://ravinote.blogspot.com/2013/11/today-im-going-to-describe-few-steps.html
and we will try to wrap up the implementation so that we can finish it.  We only need to make sure the cosine based distances are calculated correctly. Rest everything follows in the iteration like we discussed. The cosine based distance are along components. Each document is represented by a vector of its document frequencies and this vector is normalized to unit-length. The cosine based distance is calculated as 1/n - xjk.cik. where xjk is the kth component of the document frequency vector xj. and cik is the kth component of the ith cluster center vector. Note that both are less than one. That is we are referring to cell values in the document vectors. The cluster center is also a document vector. We start with a matrix N x n in this case and for each document we will maintain feature weights as well as cosine distances to the centers of the cluster. If the centers have not changed, we can skip some of the calculations.
For test data we use the brown categories for news, government, adventure and humor. We will pick ten documents from each category.

Sunday, November 10, 2013

In the previous post , we looked at HotMiner . It's an innovative tool that combines search view and content view to discover hot topics that match the users perspectives. further, this tool mines case logs to extract technical content and discover hot topics. In addition, it uses a variety of techniques to deal with dirty text and generate quality output that matches user expectations.
An important distinction here from other approaches is that document categorization has not worked with these logs. Both manual classification and automatic classification has been poor. Typically clustering is used for categorizing the documents. However,  the categories discovered by these documents fall into the same results as the manual categorization.  This means that the categories are unnatural and not intuitive to the user. Since the categories are unknown and this work does not belong in that category, the author chooses a novel approach  based on the search terms used to open them or from excerpts comprised of most representative sentences of case documents.
Further, the author has introduced several techniques for preprocessing and post filtering that we should look into. We will enumerate the ones we referred to earlier first.
For example, misspelt words are normalized based on edit distance and synonyms. Only the word that are close by edit distance and looked up as synonym are chosen. A thesaurus for the domain is derived using the words encountered. This helps identify the approximate duplicates identified in the document. Terminology is also derived from this thesaurus.  Automatic correction of words in text is also done. Some misspelled words require words to be tagged with a  parts of speech tagger. However those could not be applied to the dirty text as available from the logs. Removal of table formatted text is also done by identification of tables and even code fragments are identified and removed. Sentence boundary detection is another technique used here. There have been two approaches for such work.  Most of the techniques developed so far follow either of the two approaches. The rule based approach is one which requires domain specific handcrafted lexically based rules to be compiled and the corpus based approach is another which requires part of speech tagging in the model. Both of these approaches were difficult with the kind of text seen in the logs. So a set of heuristic rules were identified by inspection and compiled by hand.
Summarization techniques were also used in this work. Typically these are applied at various levels of processing such as at the surface, entity or discourse levels. Surface level approaches in terms of shallow features. Entity level approaches tend to represent patterns of connectivity of text  by modeling text entities and their relationships, such as those based on syntactic analysis or the use of scripts. Discourse based approaches model the global structure of the text and its relation to communicative goals and they use WordNet to compute lexical chains from which a model of topic progression is derived.  But these summarization techniques are not applicable in this domain such as ours where the specific lingo, bad use of grammar and the non-narrative nature of the text make it hard to apply this technique. Instead, surface and entity level techniques are used.