Sunday, January 18, 2015

Today we continue to read from Hearst: Untangling Text Data Mining. (TDM)
We saw the mention that TDM can yield tools that indirectly aid in the information access process. Aside form providing these tools, Hearst mention that it can also provide tools for exploratory data analysis.  Hearst compares TDM with computational linguistics. He says empirical computational linguistics computes statistics over large text collections in order to discover useful patterns. These patterns then inform the algorithms of various sub problems within natural language processing. For example, co-occurrence of prices, prescription and patent are highly likely to co-occur with the medical sense of the "drug" while "abuse, paraphernalia and illicit" are likely to co-occur with the illegal use of the word drug. This kind of information can also be used to improve information retrieval algorithms. However, these are different from TDM. TDM can instead help with the efforts to automatically augment existing lexical structures such as WordNet relations as mentioned by Fellbaum. Hearst also gave such an example of identifying lexicosyntactic patterns that help with WordNet relations. Manning gave an example of automatically acquiring sub-categorization data from large text corpora.
We now review TDM and category metadata. Hearst notes that text categorization is not TDM. He argues that text categorization reduces the document to  a set of labels but does not add new data.  However an approach that compares distributions of category assignments within subsets of the document collection to find interesting or unexpected trends can be considered TDM. For example, distribution of commodities in country C1 can be compared with those of C2 to constitute an economic unit. Another effort that can be considered as TDM is DARPA topic detection and Tracking initiative where it receives a stream of new stories in chronological order and the system marks a yes or no on arrival for whether the story is a first reference to an  event.
By citing these approaches Hearst says TDM says something about the world outside the text collection.
TDM can be considered to be a process of exploratory data analysis which results in new and yet undiscovered information or the answers for questions for which the answer is not currently known. The effort here is to use text for discovery in a more direct manner than inferencing manually. Two examples are provided.  First is using text to form hypothesis about disease and second is to use text to uncover social impact.
#codingexercise
Double GetAlternateEvenNumberRangeProductOfSquares()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeProductOfSquares();
}
Double GetAlternateEvenNumberRangeSqRtProductOfSquares()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeSqRtProductOfSquares();
}

No comments:

Post a Comment