Cluster computing

Saturday, January 17, 2015

#codingexercise
Double GetAlternateEvenNumberRangeSqRtSumOfSquares()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeSqRtSumOfSquares();
}

Today we read a paper from Hearst : Untangling text data mining. Hearst reminds us that the possibilities of extracting information from text is virtually untapped because text expresses a vast range of information but encodes it in a way that is difficult to decipher automatically. We recently reviewed the difference between Text Knowledge Mining and Text Data Mining. In this paper, the focus is on text data mining. Some of the new problems encountered in computational linguistics are called out in this paper and outlines ideas about how to pursue exploratory data analysis over text.
Hearst differentiates between TDM and Information Access. The goal of information access is to help users find documents that satisfy their information needs. The standard procedure is akin to looking for needles in a needlestack. The analogy comes from the fact that the problem is about finding information that coexists with other valid pieces of information. The homing in on to the information that the user is interested in is the problem here. As per Hearst, the goal of data mining on the other hand, is to derive new information from data, finding patterns across datasets, and/or separating signal from noise. The fact that an information retrieval system can return a document that contains the information the user requested implies that no new discovery is made.
Hearst points out that text data mining is sometimes discussed together with search on the web. For example, the KDD-97 panel on data mining stated that the two challenges predominant for data mining are finding useful information on the web and discovering knowledge about a domain that is represented by a collection of web-documents as well as to analyze the transactions run in a web based system. This search-centric view misses the point that the web can be considered a knowledge base that is helpful to extract new never before encountered information.
The results of certain types of text processing can yield tools that indirectly aid in the information access process. Examples include text clustering to create thematic overviews of text collection.
#codingexercise
Double GetAlternateEvenNumberRangeSumOfSquares()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeSumOfSquares();
}

Friday, January 16, 2015

#codingexercise
Double GetAlternateOddNumberRangeSqRtSumOfSquares()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateOddNumberRangeSqRtSumOfSquares();
}

We continue our discussion on Text Knowledge Mining. We were discussing reasoning. Complexity is one of the important aspects of the automatic inference in knowledge based systems. Systems may find a tradeoff between expressiveness of knowledge reasoning and complexity of reasoning. While this is true for both TDM and TKM, the TKM don't have as much difficulty. First, both TDM and TKM, are hard problems and there are exponential algorithms and many strategies for designing efficient mining algorithms. Second TKM is not intended to be exhaustive so it can limit the data sources.
Now let us look at how to assess the results. In data mining, there are two phases for evaluating the results. The first phase is one in which statistical significance of the results is assessed. The second phase is one in which there is subjective assessment for the novelty and usefulness of the patterns.
For TKM, we have the following:
The results are considered reliable if there is validity and reliability of the text. Second, inference procedures are applied.
The user decides whether the results are non-trivial. fortunately the results are expected to be trivial.
The non-triviality of the results is evaluated by the expert user.
The novelty of the results are evaluated against the BK.
The usefulness is evaluated by the experts.
The authors conclude that TKM is a particular case of knowledge mining. Knowledge mining deals with knowledge while data mining deals with data in which the knowledge is implicit. The operations are therefore deductive and abductive inference.

Thursday, January 15, 2015

We continue with our discussion on Text Knowledge Mining. We were discussing knowledge representation, Text mining requires to translate text into a computationally manageable intermediate form. This step is crucial and poses several challenges to TDM and TKM like obtaining intermediate forms, defining structures for information extraction, identifying semantic links between new concepts, relating different identifiers etc. A key problem with obtaining intermediate forms is that the current methods require human interaction. The objective of an intermediate form is not to represent every possible aspect of the semantic content of a text, but those related to the kind of inference we are interested in. That is why even if we are not able to fully analyze the text, we are only impacted with some missing pieces of knowledge. And TKM is not expected to be exhaustive in obtaining new knowledge. That said, if the analyzer obtains an inexact representation of text in an intermediate form, this can affect the knowledge discovered even so much as reporting false discoveries. The knowledge in a knowledge based system is assumed to be reliable and consistent but the same cannot be said to be true for a collection of text. This poses another challenge in TKM.
We now look at the role of background knowledge. The inclusion of background knowledge in text mining applications is widely recognized. Similarly in TKM, text does not contain common sense knowledge and specific domain knowledge that is necessary in order to perform TKM. As an example text containing A is father of B and B is father of A is not considered contradictory without background knowledge. This knowledge can contribute to TKM in the following ways: First, it allows us to create new knowledge that is not fully contained in the collection of text, but can be derived from a combination of text and background knowledge. This was a requirement of Intelligent text mining. Another interesting application of Background Knowledge is in the assessment of knowledge, in aspects like novelty and importance.
Reasoning and complexity are also important aspects of automatic inference in knowledge based systems.
#codingexercise

Double GetAlternateOddNumberRangeSumOfSquares()(Double [] A)

{

if (A == null) return 0;

Return A.AlternateOddNumberRangeVarianceSumOfSquares();

}

Wednesday, January 14, 2015

#codingexercise
Double GetAlternateOddNumberRangeStdDev()(Double [] A)

{

if (A == null) return 0;

Return A.AlternateOddNumberRangeStdDev();

}

We continue our discussion on Text Knowledge Mining. We discussed the algorithm for finding contradictions. Looking for contradictions is very useful. It helps with the consistency of text collection, or to assess the validity of a new text to be incorporated, in terms of the knowledge already contained in the collection. In addition, we can use it to group texts such as when we take a collection of papers expressing opinions about topics where opinions are in different groups. Finally it also lowers the overhead of reasoning with ontologies because we can now instead check the consistency by way of non-contradictions. This check can now become a preliminary requirement for reasoning.
We also looked at the challenges of TKM. Similar to the case in data mining, there are many existing techniques that can be applied but they have to be adapted. This may not always be easy. In addition, some areas require new research. Existing techniques could benefit areas like knowledge representation, reasoning algorithms for performing deductive and abductive inference and knowledge based systems. These are also applicable to natural language processing.
There are several differences between knowledge based systems and TKM
First, knowledge based system is built to contain as much knowledge as possible for a specific purpose. TKM treats them as reports and does not care for any one particular except that they are dedicated information collection.
Second, a knowledge based system tries to answer all the questions whereas a TKM assumes there is no such knowledge pieces and finds new hypothesis.
Third a knowledge based system does reasoning as part of a query processing. TKM does it to find new knowledge without specifying a query, though it can also choose to.
We also look at knowledge representation. Text mining requires to translate text into a computationally manageable intermediate form. This step of text mining is crucial and poses several challenges common to TDM and TKM. A key problem for obtaining intermediate forms for TKM is that the currently used techniques for translating texts to intermediate forms are mainly semi-automatic involving human interaction. On the other hand, many domains are trying to express knowledge representation models directly. In SemanticWeb for example, not only ontologies are used to represent knowledge but also efficient deductive inference techniques like graph based search are also available.

#codingexercise

Double GetAlternateOddNumberRangeVariance()(Double [] A)

{

if (A == null) return 0;

Return A.AlternateOddNumberRangeVariance();

}

Tuesday, January 13, 2015

Today we continue discussing TKM techniques. Let us review the discovery of contradictions. Let us say res(ti) is a procedure performing resolution on the set of clauses obtained from ti, giving the value false in case it finds a contradiction. Let ti, tj be the concatenation of texts ( the corresponding clauses ) ti and tj. Let PT be the set of all possible concatenations of subsets of T of any size. V represents the set of concatenations that do not contain contradictions. Then the algorithm for discovering contradictions is as follows. We find a candidate V1 with text that has no contradictions. We concatenate it to V. This is our initialized set. For the rest of the n iterations, we find text t' candidate and t'' initialized such that they are from disjoint sets and their combination doesn't contain a contradiction. We check with all the subsets of text containing the candidate to rule out contradictions and perform a union for non-contradiction. Finally, with the set of all possible concatenations of subsets of T of any size and with V that contains a set without contradictions, we can exclude the ones that do contain contradictions.
We now look at the challenges of Text knowledge mining. There is a good analogy with data mining in that existing techniques worked for data mining and similarly there are existing techniques that can work for TKM. These include the ones for knowledge representation, reasoning algorithms for performing deductive and abductive inference, and knowledge based systems, as well as natural language processing techniques. As in the case of data mining, these techniques must be adapted for TKM. In addition, new techniques may be needed. We look at some of these challenges as well.
Knowledge based systems and TKM systems have very different objectives that affect the way techniques coming from knowledge representation and reasoning can be adapted to TKM. In a knowledge based system, a knowledge base is built containing the knowledge needed by the system to solve a specific problem. In TKM, the knowledge managed by the systems is collected from texts each of which is treated as a knowledge base as in historical or normative reports as opposed to something that builds a knowledge system. Another difference is that the knowledge based systems are intended to give answer to every possible question, or to solve any possible problem posed to the system while in TKM, new hypothesis and potentially useful knowledge is derived from a collection of text. In fact, TKM may exclude or not know about a set of knowledge pieces.
#codingexercise
Double GetAlternateOddNumberRangeMode()(Double [] A)

{

if (A == null) return 0;

Return A.AlternateOddNumberRangeMode();

}

Monday, January 12, 2015

Today we will continue our discussion on Text Knowledge Mining. We discussed a variety of techniques and how they differ from or are related to TKM. Some advanced data mining techniques that can be considered TKM are related to literature-based discovery. The idea behind literature based discovery is this. Let A1, A2, A3, ... , An, B1, B2 .. C, be a set of concepts and suppose that we are interested in finding the relations between C and some of Ai. Suppose that no Ai and C appear simultaneously in the collection, but they can appear simultaneously with some of Bi. If Bi is large enough then a transitive relation between Ai and C can be hypothesized. This therefore can be used to create new hypothesis that can be been later confirmed by experimental studies.
In this technique, the identification of concepts to obtain intermediate forms of documents is a key point. The main problem with this point is that a lot of background knowledge about the text is necessary and it varies from one domain to another. In addition, the authors state that the relation between concepts is based in most of the cases on their simultaneous appearance in the same title or abstract. Also, the generation of hypothesis does not follow a formal procedure. That said, some formal model for conjectures, consequences and hypothesis are available in the literature which then can be a tool for literature based discovery and TKM.
A text mining technique that can be considered TKM is one involving the discovery of contradictions in a collection of texts. The starting point is a set of text documents T = t1, .. tn of arbitrary length. A first order logic representation in clausal normal form of the content of the text is employed as intermediate form and a deductive inference procedure based on the resolution is performed. The intermediate form is obtained by a semi automatic procedure in two steps. First an interlingua representation is obtained from text. Then the first order clausal form is obtained by means of information extraction techniques even involving background knowledge.
The mining process here is similar to the level wise plus candidate generation exploration in frequent item sets discovery, but replacing the minimum support criteria by non-contradiction in the knowledge base obtained by the union of the intermediate forms of texts. Texts are identified first that do not have a contradiction Texts containing contradiction are not considered in the following level. Candidate generation for the next level is performed in the same way as in frequent item set mining in successive levels. The final result is a collection of subsets of texts in the collection that generate a contradiction.
#codingexercise
Double GetAlternateOddNumberRangeMedian()(Double [] A)

{

if (A == null) return 0;

Return A.AlternateOddNumberRangeMedian();

}

One of the interesting things about the discussion of TKM is that it seems applicable to domains other than text.

Sunday, January 11, 2015

We continue the discussion on Text Knowledge Mining. We now look at what KM does not mean. In other words, we look at what people are saying about Knowledge mining. For example, knowledge mining is employed to specify that the intermediate form employed contains knowledge richer than simple data structures though inductive data mining techniques are used on the intermediate forms and the final patterns contain only those that are interesting to the user. Another example is when knowledge mining is employed to indicate that the background knowledge and user knowledge is incorporated in the mining process in order to ensure that the intermediate form and the final patterns contain only those concepts interesting for the user. In yet another example, knowledge mining refers to using background knowledge to evaluate the novel and interesting patterns after an inductive process.
Sometimes another term deductive mining is used but it is vague and even this is different from the proposal made so far. Deductive mining is used for a group of text mining techniques where the better known example is said to be information extraction. This is referred to as the mapping of natural language texts into predefined, structured representation, or templates, which when filled represent an extract of key information from the original text or alternatively as the process of filling the fields and records of a database from unstructured text. This implies that no new knowledge is found. Therefore this falls in the deductive category based on the starting point but the difference is that the deductive inference performed is translating from one representation to the other. One specific application automatic augmentation of an ontology relations is a good example of information extraction. TKM can use such techniques.
Some have added that deductive data mining be performed by adding deductive capabilities to mining tools in order to improve the assessment of the knowledge obtained. The knowledge is supposed to be found mathematically and taken into account hidden data assumptions.
A deductive data mining paradigm is also proposed where the term refers to the fact that the discovery process is performed on the result of the user queries that limit the possibility to work on corrupt data. However in these cases, the generation of new knowledge is purely inductive.
Having discussed what TKM is different from, we now see what TKM is related to. Some have alluded to non-novel investigation like information retrieval, semi-novel investigation like standard text mining and KDD for pattern discovery and novel investigation or knowledge creation like Intelligent Text Mining which implies the interactions between users and text mining tools and/or artificial intelligence techniques. In the last category here, there is no indication about the kind of inference used to get the new knowledge.
Some advanced text mining techniques that can be considered TKM are related to literature based discovery where these techniques provide possible relations between concepts appearing in the literature about a specific topic although some authors beg to differ from discovery and mention assisting human experts in formulating new hypothesis based on an interactive process.
#codingexercise
Double GetAlternateOddNumberRangeMax()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateOddNumberRangeMax();
}
#codingexercise

Double GetAlternateOddNumberRangeAvg()(Double [] A)

{

if (A == null) return 0;

Return A.AlternateOddNumberRangeAvg();

}

#codingexercise

Double GetAlternateOddNumberRangeSum()(Double [] A)

{

if (A == null) return 0;

Return A.AlternateOddNumberRangeSum();

}