Sunday, November 10, 2013

In the previous post , we looked at HotMiner . It's an innovative tool that combines search view and content view to discover hot topics that match the users perspectives. further, this tool mines case logs to extract technical content and discover hot topics. In addition, it uses a variety of techniques to deal with dirty text and generate quality output that matches user expectations.
An important distinction here from other approaches is that document categorization has not worked with these logs. Both manual classification and automatic classification has been poor. Typically clustering is used for categorizing the documents. However,  the categories discovered by these documents fall into the same results as the manual categorization.  This means that the categories are unnatural and not intuitive to the user. Since the categories are unknown and this work does not belong in that category, the author chooses a novel approach  based on the search terms used to open them or from excerpts comprised of most representative sentences of case documents.
Further, the author has introduced several techniques for preprocessing and post filtering that we should look into. We will enumerate the ones we referred to earlier first.
For example, misspelt words are normalized based on edit distance and synonyms. Only the word that are close by edit distance and looked up as synonym are chosen. A thesaurus for the domain is derived using the words encountered. This helps identify the approximate duplicates identified in the document. Terminology is also derived from this thesaurus.  Automatic correction of words in text is also done. Some misspelled words require words to be tagged with a  parts of speech tagger. However those could not be applied to the dirty text as available from the logs. Removal of table formatted text is also done by identification of tables and even code fragments are identified and removed. Sentence boundary detection is another technique used here. There have been two approaches for such work.  Most of the techniques developed so far follow either of the two approaches. The rule based approach is one which requires domain specific handcrafted lexically based rules to be compiled and the corpus based approach is another which requires part of speech tagging in the model. Both of these approaches were difficult with the kind of text seen in the logs. So a set of heuristic rules were identified by inspection and compiled by hand.
Summarization techniques were also used in this work. Typically these are applied at various levels of processing such as at the surface, entity or discourse levels. Surface level approaches in terms of shallow features. Entity level approaches tend to represent patterns of connectivity of text  by modeling text entities and their relationships, such as those based on syntactic analysis or the use of scripts. Discourse based approaches model the global structure of the text and its relation to communicative goals and they use WordNet to compute lexical chains from which a model of topic progression is derived.  But these summarization techniques are not applicable in this domain such as ours where the specific lingo, bad use of grammar and the non-narrative nature of the text make it hard to apply this technique. Instead, surface and entity level techniques are used.

 

No comments:

Post a Comment