Sunday, June 9, 2013

Text mining targets unstructured text as opposed to structured text in web mining and databases in data mining. and patterns in natural language processing. Data Mining and Natural language processing both find patterns  and information is retrieved with queries while Text mining finds nuggets. That said, text mining has overlap with all of the above.
The processing stages in text mining are text storage, text preprocessing, text transformation or attribute generation, attribute selection, Data mining or pattern discovery and interpretation or evaluation. Storing text doesn't necessarily need to be in raw form only but can be stored as document clusters. Text is characterized by one or more of the following:
source  : human input, automated input in different languages and formats
context : words and phrases create context.
ambiguity : word and sentence are disambiguated based on ontology
noise : erroneous data, misspelt data, stop words, etc
format : normal speech, interactive chat, etc.
sparseness: aka document density percentage in typical document
Text processing involves cleanup, tokenization, part of speech tagging, word sense disambiguation and semantic structures. Text cleanup involves removing junk characters, binary formats, tables, figures and formulas. Tokenization is splitting up into a set of tokens. Part of speech tagging is associating words with parts of speech which can be grammar based or statistically based. Word sense disambiguation is how many distinct senses is used in a given sentence. Semantic processing involves chunking which produces syntactic constructs like noun phrases and verb phrases or full parsing which yields a tree. Chunking is more common.
Text transformation involves text representation and feature selection which characterizes the document. A classifier is used to automatically generate labels (attributes) from the features fed into it.
Feature selection is based on two approaches - one is to select the features before using them in a classifier, which requires a feature ranking method and the othe is to select the features on how well they work in a classifier. In the latter case the classifier is part of the feature selection method and is often an iterative process.  However the classifier needs to be trained and the evaluation is based on actual use. The features are evaluated iteratively. In the former case, there are many more choices to feature selection since this is independent of the classifier. Each feature is evaluated once lowering computational cost. Attributes generated are the labels of the classes automatically produced by the classifier on the selected features.
Attribute selection is done because higher dimensions causes issues with machine learners and hence irrelevant features are removed.
After the attributes are selected, patterns can be found with data mining techniques and the results interpreted.

Review of text mining slides from CSE634 presentation by Chiwara, Al-Ayyoub, Hossain and Gupta
 

No comments:

Post a Comment