Cluster computing

Monday, November 11, 2013

We will pick up the implementation discussion on FuzzySKWIC from here : http://ravinote.blogspot.com/2013/11/today-im-going-to-describe-few-steps.html
and we will try to wrap up the implementation so that we can finish it. We only need to make sure the cosine based distances are calculated correctly. Rest everything follows in the iteration like we discussed. The cosine based distance are along components. Each document is represented by a vector of its document frequencies and this vector is normalized to unit-length. The cosine based distance is calculated as 1/n - xjk.cik. where xjk is the kth component of the document frequency vector xj. This can be calculated as the component affected by this document on the total document frequency of that term. xjk therefore does not change with iterations. cik is the kth component of the ith cluster center vector. Note that both are less than one. That is we are referring to cell values in the document vectors. The cluster center is also a document vector. We start with a matrix N x n in this case and for each document we will maintain feature weights as well as cosine distances to the centers of the cluster. When the cluster center changes, the cosine document distance changes and the aggregated cosine consequently changes. If the centers have not changed, we can skip some of the calculations. We will want a fuzzy matrix of size C x N but for now , we will first try out without it. Initial assignments of the document frequency vector populates the N x n matrix purely in the IDF. The feature weights are initialized.
For test data we use the brown categories for news, government, adventure and humor. We will pick ten documents from each category.

We will pick up the implementation discussion on FuzzySKWIC from here : http://ravinote.blogspot.com/2013/11/today-im-going-to-describe-few-steps.html
and we will try to wrap up the implementation so that we can finish it. We only need to make sure the cosine based distances are calculated correctly. Rest everything follows in the iteration like we discussed. The cosine based distance are along components. Each document is represented by a vector of its document frequencies and this vector is normalized to unit-length. The cosine based distance is calculated as 1/n - xjk.cik. where xjk is the kth component of the document frequency vector xj. and cik is the kth component of the ith cluster center vector. Note that both are less than one. That is we are referring to cell values in the document vectors. The cluster center is also a document vector. We start with a matrix N x n in this case and for each document we will maintain feature weights as well as cosine distances to the centers of the cluster. If the centers have not changed, we can skip some of the calculations. We will want a fuzzy matrix of size C x N but for now , we will first try out without it. Initial assignments of the document frequency vector populates the N x n matrix
For test data we use the brown categories for news, government, adventure and humor. We will pick ten documents from each category.

We will pick up the implementation discussion on FuzzySKWIC from here : http://ravinote.blogspot.com/2013/11/today-im-going-to-describe-few-steps.html
and we will try to wrap up the implementation so that we can finish it. We only need to make sure the cosine based distances are calculated correctly. Rest everything follows in the iteration like we discussed. The cosine based distance are along components. Each document is represented by a vector of its document frequencies and this vector is normalized to unit-length. The cosine based distance is calculated as 1/n - xjk.cik. where xjk is the kth component of the document frequency vector xj. and cik is the kth component of the ith cluster center vector. Note that both are less than one. That is we are referring to cell values in the document vectors. The cluster center is also a document vector. We start with a matrix N x n in this case and for each document we will maintain feature weights as well as cosine distances to the centers of the cluster. If the centers have not changed, we can skip some of the calculations.
For test data we use the brown categories for news, government, adventure and humor. We will pick ten documents from each category.

Sunday, November 10, 2013

In the previous post , we looked at HotMiner . It's an innovative tool that combines search view and content view to discover hot topics that match the users perspectives. further, this tool mines case logs to extract technical content and discover hot topics. In addition, it uses a variety of techniques to deal with dirty text and generate quality output that matches user expectations.
An important distinction here from other approaches is that document categorization has not worked with these logs. Both manual classification and automatic classification has been poor. Typically clustering is used for categorizing the documents. However, the categories discovered by these documents fall into the same results as the manual categorization. This means that the categories are unnatural and not intuitive to the user. Since the categories are unknown and this work does not belong in that category, the author chooses a novel approach based on the search terms used to open them or from excerpts comprised of most representative sentences of case documents.
Further, the author has introduced several techniques for preprocessing and post filtering that we should look into. We will enumerate the ones we referred to earlier first.
For example, misspelt words are normalized based on edit distance and synonyms. Only the word that are close by edit distance and looked up as synonym are chosen. A thesaurus for the domain is derived using the words encountered. This helps identify the approximate duplicates identified in the document. Terminology is also derived from this thesaurus. Automatic correction of words in text is also done. Some misspelled words require words to be tagged with a parts of speech tagger. However those could not be applied to the dirty text as available from the logs. Removal of table formatted text is also done by identification of tables and even code fragments are identified and removed. Sentence boundary detection is another technique used here. There have been two approaches for such work. Most of the techniques developed so far follow either of the two approaches. The rule based approach is one which requires domain specific handcrafted lexically based rules to be compiled and the corpus based approach is another which requires part of speech tagging in the model. Both of these approaches were difficult with the kind of text seen in the logs. So a set of heuristic rules were identified by inspection and compiled by hand.
Summarization techniques were also used in this work. Typically these are applied at various levels of processing such as at the surface, entity or discourse levels. Surface level approaches in terms of shallow features. Entity level approaches tend to represent patterns of connectivity of text by modeling text entities and their relationships, such as those based on syntactic analysis or the use of scripts. Discourse based approaches model the global structure of the text and its relation to communicative goals and they use WordNet to compute lexical chains from which a model of topic progression is derived. But these summarization techniques are not applicable in this domain such as ours where the specific lingo, bad use of grammar and the non-narrative nature of the text make it hard to apply this technique. Instead, surface and entity level techniques are used.

Saturday, November 9, 2013

Both the search log and case log mentioned in the previous post are required to perform the search. The search log keeps track of the search string that customers formulate and the case log maintains the history of actions, events and dialogues when the case is open. The implementation of the approach mentioned involves searches in both these logs. The search logs often have noise that comes with the nature of the web browsing required for matches of interest and availability to take place. Hence the search logs are subjected to both a pre-processing as well as a post-filtering technique Further the same content is viewed differently for different topics i.e. there are search views of document or content view in a hot topic to identify the extraneous ones. This identification of extraneous documents is not only beneficial for obtaining higher quality topics but to pinpoint documents that are being returned as noise to certain queries.
The case logs are different from the search log and hence their processing is somewhat indirect. The excerpts are generated from case log and mined Note that the case documents are in general very long documents because they capture all the information on the action taken on the case. In addition, there may be input from more than one party and hence there is a lot of noise to deal with. In the author's approach there is both a pre-processing as well as a clean up of the actions involved. Here the noise filtering is done by normalizing typos, misspellings and abbreviations. Even the words are normalized to a known jargon. This is done with a help of a thesaurus. Excerpt generation and summarization is composed of a variety of techniques. Techniques range from dealing with the characteristics of the text to making use of the domain knowledge. Sentences are identified regardless of tables and cryptic text that they wrap around. Sentences are ranked and primarily based on the technical content rather than the logistics. Techniques can be enabled or disabled independently.
The author proposes a novel approach to search hot topics from the search logs. Here the search view and the content view are combined to get high-quality topics and these have a higher match with the user's perspective.

There is a topic mentioned in the book on Survey of Text Mining as HotMiner by Malu Castellanos. Here companies would like to find topics that are of interest to their customers i.e. hot problems and making them available on their website along with links to corresponding solution documents. The customer support centers maintain logs of their customer interactions which becomes the source for discovering hot topics. The author uses these case logs to extract relevant sentences from cases to conform case excerpts. In addition, this approach deals with dirty text containing typos, adhoc abbreviations, incorrect grammar, cryptic tables and ambiguous and missing punctuations. They normalize the terminology with a thesaurus assistant and a sentence identifier.
The suggestion here is that the logs rather than the documents provide information on the topics that are of most interest to the customers. These are called hot topics. This works well given that the document categorization and classification - be it manual or automatic, is not sufficient to detect the topics of interest to the customers. Instead by providing the self-help solution documents by identifying the topics of these hot problems, organizations can better organize their site, reduce customer support costs and improve their customer satisfaction.
The author's approach to mine hot topics from logs of customer support centers, involves the use of two kinds of logs - a search log and a case log. The search log keeps track of the search strings that customers formulate and the case logs keep track of the cases opened by the customers along with the history of action, events and dialogues, followed while a case is open.
These two logs are complimentary to obtain information on all problems encountered by the customer.