Tuesday, January 13, 2015

Today we continue discussing TKM techniques. Let us review the discovery of contradictions. Let us say res(ti) is a procedure performing resolution on the set of clauses obtained from ti, giving the value  false in case it finds a contradiction. Let ti, tj be the concatenation of texts ( the corresponding clauses ) ti and tj. Let PT be the set of all possible concatenations of subsets of T of any size. V represents the set of concatenations that do not contain contradictions.  Then the algorithm for discovering contradictions is as follows. We find a candidate V1 with text that has no contradictions. We concatenate it to V. This is our initialized set. For the rest of the n iterations, we find text t'  candidate and t'' initialized such that they are from disjoint sets and their combination doesn't contain a contradiction. We check with all the subsets of text containing the candidate to rule out contradictions and perform a union for non-contradiction. Finally, with the set of all possible concatenations of subsets of T of any size and with V that contains a set without contradictions, we can exclude the ones that do contain contradictions.
We now look at the challenges of Text knowledge mining. There is a good analogy with data mining in that existing techniques worked for data mining  and similarly there are existing techniques that can work for TKM. These include the ones for knowledge representation, reasoning algorithms for performing deductive and abductive inference, and knowledge based systems, as well as natural language processing techniques. As in the case of data mining, these techniques must be adapted for TKM.  In addition, new techniques may be needed.  We look at some of these challenges as well.
Knowledge based systems and TKM systems have very different objectives that affect the way techniques coming from knowledge representation and reasoning can be adapted to TKM. In a knowledge based system, a knowledge base is built containing the knowledge needed by the system to solve a specific problem. In TKM, the knowledge managed by the systems is collected from texts each of which is treated as a knowledge base as in historical or normative reports as opposed to something that builds a knowledge system. Another difference is that the knowledge based systems are intended to give answer to every possible question, or to solve any possible problem posed to the system while in TKM, new hypothesis and potentially useful knowledge is derived from a collection of text.  In fact, TKM may exclude or not know about a set of knowledge pieces.
#codingexercise
Double GetAlternateOddNumberRangeMode()(Double [] A)

{

if (A == null) return 0;

Return A.AlternateOddNumberRangeMode();

}

Monday, January 12, 2015

Today we will continue our discussion on Text Knowledge Mining. We discussed a variety of techniques and how they differ from or are related to TKM. Some advanced data mining techniques that can be considered TKM are related to literature-based discovery. The idea behind literature based discovery is this. Let A1, A2, A3, ... , An, B1, B2 .. C, be a set of concepts and suppose that we are interested in finding the relations between C and some of Ai. Suppose that no Ai and C appear simultaneously in the collection, but they can appear simultaneously with some of Bi. If Bi is large enough then a transitive relation between Ai and C can be hypothesized. This therefore can be used to create new hypothesis that can be been later confirmed by experimental studies.
In this technique, the identification of concepts to obtain intermediate forms of documents is a key point. The main problem with this point is that a lot of background knowledge about the text is necessary and it varies from one domain to another. In addition, the authors state that the relation between concepts is based in most of the cases on their simultaneous appearance in the same title or abstract. Also, the generation of hypothesis does not follow a  formal procedure. That said, some formal model for conjectures, consequences and hypothesis are available in the literature which then can be a tool for literature based discovery and TKM.
A text mining technique that can be considered TKM is one involving the discovery of contradictions in a collection of texts.  The starting point is a set of text documents T = t1, .. tn of arbitrary length. A first order logic representation in clausal normal form of the content of the text is employed as intermediate form  and a deductive inference procedure based on the resolution is performed. The intermediate form is obtained by a semi automatic procedure in two steps. First an interlingua representation is obtained from text. Then the first order clausal form is obtained by means of information extraction techniques even involving background knowledge.
The mining process here is similar to the level wise plus candidate generation exploration in frequent item sets discovery, but replacing the minimum support criteria by non-contradiction in the knowledge base obtained by the union of the intermediate forms of texts. Texts are identified first that do not have a contradiction Texts containing contradiction are not considered in the following level. Candidate generation for the next level is performed in the same way as in frequent item set mining in successive levels. The final result is a collection of subsets of texts in the collection that generate a contradiction.
#codingexercise
Double GetAlternateOddNumberRangeMedian()(Double [] A)

{

if (A == null) return 0;

Return A.AlternateOddNumberRangeMedian();

}

One of the interesting things about the discussion of TKM is that it seems applicable to domains other than text.

Sunday, January 11, 2015

We continue the discussion on Text Knowledge Mining. We now look at what KM does not mean. In other words, we look at what people are saying about Knowledge mining. For example, knowledge mining is employed to specify that the intermediate form employed contains knowledge richer than simple data structures though inductive data mining techniques are used on the intermediate forms and the final patterns contain only those that are interesting to the user. Another example is when knowledge mining is employed to indicate that the background knowledge and user knowledge is incorporated in the mining process in order to ensure that the intermediate form and the final patterns contain only those concepts interesting for the user. In yet another example, knowledge mining refers to using background knowledge to evaluate the novel and interesting patterns after an inductive process.
Sometimes another term deductive mining is used but it is vague and even this is different from the proposal made so far. Deductive mining is used for a group of text mining techniques  where the better known example is said to be information extraction. This is referred to as the mapping of natural language texts into predefined, structured representation, or templates, which when filled represent an extract of key information from the original text or alternatively as the process of filling the fields and records of a database from unstructured text. This implies that no new knowledge is found.  Therefore this falls in the deductive category based on the starting point but the difference is that the deductive inference performed is translating from one representation to the other. One specific application automatic augmentation of an ontology relations is a good example of information extraction. TKM can use such techniques.
Some have added that deductive data mining be performed by adding deductive capabilities to mining tools in order to improve the assessment of the knowledge obtained.  The knowledge is supposed to be found mathematically and taken into account hidden data assumptions.
A deductive data mining paradigm is also proposed where the term refers to the fact that the discovery process is performed on the result of the user queries that limit the possibility to work on corrupt data. However in these cases, the generation of new knowledge is purely inductive.
Having discussed what TKM is different from, we now see what TKM is related to. Some have alluded to non-novel investigation like information retrieval, semi-novel investigation like standard text mining and KDD for pattern discovery and novel investigation or knowledge creation like Intelligent Text Mining which implies the interactions between users and text mining tools and/or artificial intelligence techniques. In the last category here, there is no indication about the kind of inference used to get the new knowledge.
Some advanced text mining techniques that can be considered TKM are related to literature based discovery where these techniques provide possible relations between concepts appearing in the literature about a specific topic although some authors beg to differ from discovery and mention assisting human experts in formulating new hypothesis based on an interactive process.
#codingexercise
Double GetAlternateOddNumberRangeMax()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateOddNumberRangeMax();
}
#codingexercise

Double GetAlternateOddNumberRangeAvg()(Double [] A)

{

if (A == null) return 0;

Return A.AlternateOddNumberRangeAvg();

}

#codingexercise

Double GetAlternateOddNumberRangeSum()(Double [] A)

{

if (A == null) return 0;

Return A.AlternateOddNumberRangeSum();

}

Saturday, January 10, 2015

We continue the discussion on Text knowledge mining : An alternative to text data mining paper, We saw that the authors proposed the TKM as one based on abduction and includes techniques that generates new hypothesis. Data mining techniques are based on inductive inference. TDM can be both subjective and objective. When TDM is subjective, typically domain knowledge is involved. When TDM is objective, typically statistics is used. Together they represent simple facts in data structures from which new knowledge can be obtained. This is mining and even though there is strong association with inductive learning, the word induction is generally not used with reference to it  The prevalence of such approach implies that data mining has been useful. Data mining however is  different from text mining in that the former depends on data model while the latter depends on natural language. Data models have limited expressive power. Natural language is richer. It is able to represent not only simple facts but also general knowledge, from relation like rules to complex procedures. The disadvantages with natural language involve computationally expensive to manage structures, vagueness, ambiguity etc. There is no restriction to what kind of techniques whether deductive and/or adductive can be used with mining. By proposing text mining based on non-inductive inference, the authors propose a new direction. TKM is mining by reasoning using knowledge contained in text.TKM is a particular case of what the authors call knowledge mining. The latter defined as obtaining non-trivial previously unknown and potentially useful knowledge from knowledge repositories. The main difference between the definition of knowledge mining and the definition of data mining is that the former intends to discover knowledge from knowledge repositories while the latter attempts to discover knowledge from data repositories. The difference is the starting point, the basic material from which new knowledge is to be obtained. The starting point also makes the approach clear in that the former is deductive or adductive inference while the latter is inductive reference. Reading this it might feel like we are taking existing techniques and applying to a higher layer. In fact there can be other layers as well. Knowledge and data are not the only layers for example, the actions are based on knowledge. Having actions recommended from knowledge  is yet another strata that we haven't reached yet. Coming back to text mining, the phases of the TKM process are exactly the same as those of TDM i.e. first text refining for obtaining a computationally manageable intermediate form representing the text, a mining procedure to obtain new knowledge.and a final step for assessing and filtering the knowledge obtained.
#codingexercise
Double GetAlternateOddNumberRangeMin()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateOddNumberRangeMin();
}

Friday, January 9, 2015

Today we discuss a paper called Text Knowledge Mining : An alternative to Text Data Mining. The authors introduce a notion that is different from contemporary mining techniques which they call inductive inference. They term their approach deductive inference and they include some of the existing techniques in this category. They also discuss about the application of existing theories in possible future research in this category.
They say that the text mining has essentially been data mining on unstructured data by obtaining structured datasets called intermediate forms. Some examples are :
a text unit of a word translates to an intermediate form of bag of words or N-grams.
A concept translates to concept hierarchy, conceptual graph, semantic graph, conceptual dependence.
A phrase translates to N-phrases, Multi-term text phrases, trends etc.
A paragraph translates to a paragraph, N-Phrases, multiterm text phrases, and Trends.
A text unit of document is retained as such.
They argue that text data is not inherently unstructured. It is characterized by very complex implicit structure that has defeated many representation attempts with a very rich semantics. The use of intermediate forms in fact loses the semantics of the text because we are using a very small part of their expressive power by the chosen text unit.
The authors quote the study of causal relationships in medical literature where the attempt to piece together the causes from the titles of the MEDLINE database, in order to generate previously unknown hypothesis actually produced good results. For e.g. Migraine is attributed to deficiency of Magnesium was established from the following:
stress is associated with migraines
stress can lead to loss of magnesium
calcium channel blockers prevent some migraines
magnesium is a natural calcium channel blocker
spreading cortical depression is implicated in some migraines
high levels of magnesium inhibit spreading cortical depression
migraine patients have high platelet aggregability
magnesium can suppress platelet aggregability
#codingexercise
Double GetAlternateEvenNumberRangeMedian()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeMedian();
}
#codingexercise
Double GetAlternateEvenNumberRangeStdDev()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeStdDev();
}

Thursday, January 8, 2015

Today we continue to discuss the Shasta distributed shared memory protocol. We review the effects of upgrades and data sharing.The runs were repeated just like in the previous discussion. The execution time for base runs was measured for each application and the other times were normalized to this time. The base set of runs use upgrade requests and do  not use sharing writeback messages. The former implies that no data is fetched on a store if the processor already has a shared copy.  The latter implies that home is not updated on a 3 hop read operations.  The base run was compared to a run which does not support upgrade messages.In this case, the processor generates a read-exclusive request whether or not there is a local shared copy of the line. The base run was also compared to a run where the shared writeback messages were added. The runs were also compared with varying block sizes.
The results showed that the support for upgrade messages is important for a number of applications. On the other hand, sharing write back messages typically hurt performance. This is why it was not used in the base set of runs. One application was however an exception because several processors read the data produced by another processor. It was also seen that larger block sizes can sometimes exacerbate the cost of the write backs. In all these cases, we see that supporting a dirty sharing protocol is important for achieving higher performance in Shasta.
We now look at the effects of the migratory optimizations. The execution time for the base runs was recorded and the other times were normalized to this time. A comparison was made with the run involving migratory optimizations. The set of runs were then repeated for varying block sizes. It was seen that the migratory optimization did not provide an improvement or even degraded performance in some cases. The degradations are slight and could have been worse were it not for the revert mechanism and hysteresis built into the protocol. This is due to the fact  that the migratory patterns are either not present at the granularity of 64 bytes or larger block sizes or the pattern is unstable. There was only one application that detected a large number of stable patterns. In fact, the number of upgrade misses is reduced by over 90% for this application.  At the larger block sizes, the same application ends up having fewer and more unstable patterns, resulting in a slight loss of performance. Thus migratory optimizations in Shasta have had little or no benefit. There are several other factors that contribute to this. First the use of upgrade messages reduces the cost of store misses that may be eliminated. Second, exploiting release consistency is effective in hiding the latency of the upgrades. Finally, the batching optimization also leads to the merging of load and store misses to the same line within a single batch.
To summarize the results overall, the support for variable granularity communication is by far the most important optimization in Shasta. Support for upgrade messages and a dirty sharing protocol are also important for achieving higher performance. The optimizations from the release consistency models provides smaller performance gains because the processors are busy handling messages while they wait for their own requests to complete. And lastly migratory optimizations turn out not to be useful in the context of Shasta.
#codingexercise
Double GetAlternateEvenNumberRangeAvg()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeAvg();
}
#codingexercise
Double GetAlternateEvenNumberRangeMode()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeMode();
}

Wednesday, January 7, 2015

Today we review the effects of exploiting release consistency in the Shasta protocol we were discussing. The change in the execution time of eight and sixteen processor runs with a 64 byte block size was studied for different levels of optimizations. For each application, the problem sizes was determined beforehand and the corresponding execution time was used to normalize the other times.The base set of runs exploited all of the optimizations related to batching and release consistency except that the release operations were blocking.The batching optimizations therefore included multiple outstanding misses and merging of loads and stores to the same lines within a batch. The release consistency optimizations included non blocking stores, eager exclusive replies, and lockup free optimization. The execution times were compared to a conservative implementation of sequential consistency. The execution times were also compared to a run with the addition of non-blocking releases to the optimizations in the base set of runs. The improvement in the base set of runs over the conservative sequential consistency was nearly 10 %. However, the addition of non-blocking releases had little or no improvement in performance. The execution time was also broken down and studied. It comprised of task time, read time, write time, synchronization time and message time. Task time includes the time for executing inline miss checks and the code necessary to enter the protocol. Read time and write time represent the stall time for read and write misses that are satisfied by other processors through the software protocol. There are also stalls on a store if there are non-contiguous stores to a pending line. Synchronization time represents the stall time for application locks and barriers including both acquire and release times. Message time represents the time spent handling messages when the processor is not already stalled. If the processors are stalled on data or synchronization, it is hidden by the read, write and synchronization time. Everything else was considered miscellaneous. The breakdowns indicate that the optimizations used in the base set of runs are effective in significantly reducing and sometimes eliminating the write stall time. When compared with the sequential consistency runs, the reduction in write stall time is accompanied by an increase in other overhead categories. Even though the processors do not directly stall for stores, the pending store requests still require servicing thereby increasing the time for read, synchronization, message and other overheads. This meant little or no improvement due to increases in other categories. The time spent by the processors to handle incoming protocol messages was also collected.  The contribution of the message handling time to the total execution time is less than 10%. Nevertheless, a large portion of messages are handled while a processor is waiting for its own data and synchronization requests.  The applications reported an average of 20-35% time  for this waiting. Therefore, the processors are heavily utilized while they wait for their own requests to complete. This also implies that when the stall time for some operations is hidden, it increases the stall time for other operations.
#codingexercise
Double GetAlternateEvenNumberRangeMax()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeMax();
}
#codingexercise
Double GetAlternateEvenNumberRangeSum()(Double [] A)
{
if (A == null) return 0;
Return A.AlternateEvenNumberRangeSum();
}