Tuesday, January 20, 2015

Today we continue our discussion on text data mining with another example for exploratory Data Analysis. We discuss using text to uncover social impact. Narin et al studied the effects of publicly financed research on industrial advances. After years of preliminary studies and building special purpose tools, the authors found that the technology industry, relies more heavily than ever on government sponsored research results. The relationships between patent text and published research literature was explored using a procedure as follows: The text from the front pages of the patents issued in two consecutive years was analyzed. There were hundreds of thousands of references from which those published in the last 11 years were filtered. These references were linked to known journals and authors' addresses. Redundant citations were eliminated and articles with no American author were discarded. The study had a core collection of 45000 papers. For these papers, the source of funding was found from the closing lines of the research paper. From these, it was revealed that there was an extensive reliance on publicly financed science.
Patents awarded to the industry were further filtered by excluding those awarded to schools and governments. From these, the study examined the peak year of literature references and found 5217 citations to science papers. 73.3 % of these were found to be written at public institutions.
 The example above illustrates a number of steps needed to perform the exploratory data analysis. Most of these steps at that time had to rely on data that was not available online and therefore had to be done by hand. With this example and the previous one to form hypotheses, we see that the exploratory data analysis uses text in a direct manner.
We next discuss the LINDI project to investigate how researchers can use large text collections in the discovery of new important information, and to build software systems to help support the process.
There are two ways  to discuss new information. Sequences of queries and related operations are issued across text collections. And concepts that co-occur within the retrieved documents are statistically and visually examined for associations.
 There are tools for both and these make use of attributes associated especially with text collections and their metadata. The steps in the second example should be tightly integrated with the analysis and integration tools needed in the first example.
#codingexercise
Double GetAlternateOddNumberRangeProductCubes()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateOddNumberRangeProductCubes();
}

Monday, January 19, 2015

#codingexercise
Double GetAlternateEvenNumberRangeProduct()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeProduct();
}
Today we will continue to discuss Hearst's paper on untangling text data mining.
 We had cited an example from the previous paper about using text to form hypotheses about Disease. Today we look at this example straight from one of the sources. The fact that new hypotheses can be drawn from text has been alluring for a long time but virtually untapped. Experts can only read  a small subset of what is published in their fields and are often unaware of developments in related fields. Thus it should be possible to find useful linkages between information in related literatures but the authors of those literatures rarely refer to one another's work.
The example of tying migraine headache to deficiency in magnesium was suggested in Swanson and Smalheiser's efforts.  Swanson extracted various titles of articles in the biomedical literature and paraphrased them as we had seen earlier
stress is associated with migraines
stress can lead to loss of magnesium
calcium channel blockers prevent some migraines
magnesium is a natural calcium channel blocker
spreading cortical depression is implicated in some migraines
high levels of magnesium inhibit spreading cortical depression
migraine patients have high platelet aggregability
magnesium can suppress platelet aggregability

This led to the hypotheses and its confirmation via experimental means. This approach has been only partially automated. By that Hearst means that there are many more possibilities as can be hinted with combinatorial explosion. Beeferman explored certain links via lexical relations using WordNet.  However, sophisticated new algorithms are needed for helping in the pruning process. Such process needs to take into account various kinds of semantic constraints. This therefore falls in the domain of computational linguistics.

#codingexercise
Double GetAlternateOddNumberRangeProduct()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateOddNumberRangeProduct();
}

Sunday, January 18, 2015

Today we continue to read from Hearst: Untangling Text Data Mining. (TDM)
We saw the mention that TDM can yield tools that indirectly aid in the information access process. Aside form providing these tools, Hearst mention that it can also provide tools for exploratory data analysis.  Hearst compares TDM with computational linguistics. He says empirical computational linguistics computes statistics over large text collections in order to discover useful patterns. These patterns then inform the algorithms of various sub problems within natural language processing. For example, co-occurrence of prices, prescription and patent are highly likely to co-occur with the medical sense of the "drug" while "abuse, paraphernalia and illicit" are likely to co-occur with the illegal use of the word drug. This kind of information can also be used to improve information retrieval algorithms. However, these are different from TDM. TDM can instead help with the efforts to automatically augment existing lexical structures such as WordNet relations as mentioned by Fellbaum. Hearst also gave such an example of identifying lexicosyntactic patterns that help with WordNet relations. Manning gave an example of automatically acquiring sub-categorization data from large text corpora.
We now review TDM and category metadata. Hearst notes that text categorization is not TDM. He argues that text categorization reduces the document to  a set of labels but does not add new data.  However an approach that compares distributions of category assignments within subsets of the document collection to find interesting or unexpected trends can be considered TDM. For example, distribution of commodities in country C1 can be compared with those of C2 to constitute an economic unit. Another effort that can be considered as TDM is DARPA topic detection and Tracking initiative where it receives a stream of new stories in chronological order and the system marks a yes or no on arrival for whether the story is a first reference to an  event.
By citing these approaches Hearst says TDM says something about the world outside the text collection.
TDM can be considered to be a process of exploratory data analysis which results in new and yet undiscovered information or the answers for questions for which the answer is not currently known. The effort here is to use text for discovery in a more direct manner than inferencing manually. Two examples are provided.  First is using text to form hypothesis about disease and second is to use text to uncover social impact.
#codingexercise
Double GetAlternateEvenNumberRangeProductOfSquares()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeProductOfSquares();
}
Double GetAlternateEvenNumberRangeSqRtProductOfSquares()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeSqRtProductOfSquares();
}

Saturday, January 17, 2015

#codingexercise
Double GetAlternateEvenNumberRangeSqRtSumOfSquares()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeSqRtSumOfSquares();
}
Today we read a paper from Hearst : Untangling text data mining. Hearst reminds us that the possibilities of extracting information from text is virtually untapped because text expresses a vast range of information but encodes it in a way that is difficult to decipher automatically. We recently reviewed the difference between Text Knowledge Mining and Text Data Mining. In this paper, the focus is on text data mining. Some of the new problems encountered in computational linguistics are called out in this paper and outlines ideas about how to pursue exploratory data analysis over text.
Hearst differentiates between TDM and Information Access. The goal of information access is to help users find documents that satisfy their information needs. The standard procedure is akin to looking for needles in a needlestack. The analogy comes from the fact that the problem is about finding information that coexists with other valid pieces of information. The homing in on to the information that the user is interested in is the problem here. As per Hearst, the goal of data mining on the other hand, is to derive new information from data, finding patterns across datasets, and/or separating signal from noise. The fact that an information retrieval system can return a document that contains the information the user requested implies that no new discovery is made.
Hearst points out that text data mining is sometimes discussed together with search on the web. For example, the KDD-97 panel on data mining stated that the two challenges predominant for data mining are finding useful information on the web and discovering knowledge about a domain that is represented by a collection of web-documents as well as to analyze the transactions run in a web based system. This search-centric view misses the point that the web can be considered a knowledge base that is helpful to extract new never before encountered information.
The results of certain types of text processing can yield tools that indirectly aid in the information access process. Examples include text clustering to create thematic overviews of text collection.
#codingexercise
Double GetAlternateEvenNumberRangeSumOfSquares()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeSumOfSquares();
}
 

Friday, January 16, 2015

#codingexercise
Double GetAlternateOddNumberRangeSqRtSumOfSquares()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateOddNumberRangeSqRtSumOfSquares();
}

We continue our discussion on Text Knowledge Mining. We were discussing reasoning. Complexity is one of the important aspects of the automatic inference in knowledge based systems. Systems may find a tradeoff between expressiveness of knowledge reasoning and complexity of reasoning. While this is true for both TDM and TKM, the TKM don't have as much difficulty. First, both TDM and TKM, are hard problems and there are exponential algorithms and many strategies for designing efficient mining algorithms. Second TKM is not intended to be exhaustive so it can limit the data sources.
Now let us look at how to assess the results. In data mining, there are two phases for evaluating the results. The first phase is one in which statistical significance of the results is assessed. The second phase is one in which there is subjective assessment for the novelty and usefulness of the patterns.
For TKM, we have the following:
The results are considered reliable if there is validity and reliability of the text. Second, inference procedures are applied.
The user decides whether the results are non-trivial.  fortunately the results are expected to be trivial.
The non-triviality of the results is evaluated by the expert user.
The novelty of the results are evaluated against the BK.
The usefulness is evaluated by the experts.
The authors conclude that TKM is a particular case of knowledge mining. Knowledge mining deals with knowledge while data mining deals with data in which the knowledge is implicit. The operations are therefore deductive and abductive inference.

Thursday, January 15, 2015

We continue with our discussion on Text Knowledge Mining. We were discussing knowledge representation, Text mining requires to translate text into a computationally manageable intermediate form. This step is crucial and poses several challenges to TDM and TKM like obtaining intermediate forms, defining structures for information extraction, identifying semantic links between new concepts, relating different identifiers etc. A key problem with obtaining intermediate forms is that the current methods require human interaction. The objective of an intermediate form is not to represent every possible aspect of the semantic content of a text, but those related to the kind of inference we are interested in. That is why even if we are not able to fully analyze the text, we are only impacted with some missing pieces of knowledge. And TKM is not expected to be exhaustive in obtaining new knowledge. That said, if the analyzer obtains an inexact representation of text in an intermediate form, this can affect the knowledge discovered even so much as reporting false discoveries. The knowledge in a knowledge based system is assumed to be reliable and consistent but the same cannot be said to be true for a collection of text. This poses another challenge in TKM.
We now look at the role of background knowledge. The inclusion of background knowledge in text mining applications is widely recognized. Similarly in TKM, text does not contain common sense knowledge and specific domain knowledge that is necessary in order to perform TKM. As an example text containing A is father of B and B is father of A is not considered contradictory without background knowledge. This knowledge can contribute to TKM in the following ways: First, it allows us to create new knowledge that is not fully contained in the collection of text, but can be derived from a combination of text and background knowledge. This was a requirement of Intelligent text mining. Another interesting application of Background Knowledge is in the assessment of knowledge, in aspects like novelty and importance.
Reasoning and complexity are also important aspects of automatic inference in knowledge based systems.
#codingexercise

Double GetAlternateOddNumberRangeSumOfSquares()(Double [] A)


{


if (A == null) return 0;


Return A.AlternateOddNumberRangeVarianceSumOfSquares();


}

Wednesday, January 14, 2015

#codingexercise
Double GetAlternateOddNumberRangeStdDev()(Double [] A)

{

if (A == null) return 0;

Return A.AlternateOddNumberRangeStdDev();

}

We continue our discussion on Text Knowledge Mining. We discussed the algorithm for finding contradictions. Looking for contradictions is very useful. It helps with the consistency of text collection, or to assess the validity of a new text  to be incorporated, in terms of the knowledge already contained in the collection. In addition, we can use it to group texts such as when we take a  collection of papers expressing opinions about topics where opinions are in different groups. Finally it also lowers the overhead of reasoning with ontologies because we can now instead check the consistency by way of non-contradictions. This check can now become a preliminary requirement for reasoning.
We also looked at the challenges of TKM. Similar to the case in data mining, there are many existing techniques that can be applied but they have to be adapted. This may not always be easy. In addition, some areas require new research. Existing techniques could benefit areas like knowledge representation, reasoning algorithms for performing deductive and abductive inference and knowledge based systems. These are also applicable to natural language processing.
There are several differences between knowledge based systems and TKM
First, knowledge based system is built to contain as much knowledge as possible for a specific purpose. TKM treats them as reports and does not care for any one particular except that they are dedicated information collection.
Second, a knowledge based system tries to answer all the questions whereas a TKM assumes there is no such knowledge pieces and finds new hypothesis.
Third a knowledge based system does reasoning as part of a query processing. TKM does it to find new knowledge without specifying a query, though it can also choose to.
We also look at knowledge representation. Text mining requires to translate text into a computationally manageable intermediate form. This step of text mining is crucial and poses several challenges common to TDM and TKM.  A key problem for obtaining intermediate forms for TKM is that the currently used techniques for translating texts to intermediate forms are mainly semi-automatic involving human interaction. On the other hand, many domains are trying to express knowledge representation models directly. In SemanticWeb for example, not only ontologies are used to represent knowledge but also efficient deductive inference techniques like graph based search are also available.

#codingexercise

Double GetAlternateOddNumberRangeVariance()(Double [] A)


{


if (A == null) return 0;


Return A.AlternateOddNumberRangeVariance();


}