Friday, June 6, 2014

The previous post discussed a paper that formalized the usage of Graphs for describing reasoning chain relationships. It introduced the TRN graph and described two different types of nodes and three different edges.The nodes were either text nodes and section nodes and the edges were structural, similarity and causal. This graph is constructed from the text with the help of a syntactic parser which adds the text nodes Section nodes are attached to the sentence nodes and all the text nodes are added under these. Similarity measures help add edges for similarity to the node pairs. Only those measures that are above a threshold are added. Graphs that are extracted from different sections are combined Next Causal relations are added with the help of a discourse parser. With the structural, similarity and causal edges added we now have a complete TRN graph. Then with this graph, we can choose source destination pairs and find paths. Paths are combined, post-processed and visualized.
The dataset used for this study was the aviation investigation reports from Transportation Safety Board of Canada. Each report in this collection is an incident report and has sections titled summary, factual information, analysis and findings. The documents are available for download from the TSB of Canada and then they were pre-processed by the authors using GATE NLP platform. The preprocessing steps included tokenization, sentence splitting and part of speech tagging.  The sections on the summary is a brief description, the section on the factual information is the details of the aircraft or circumstance, the section on the analysis is a discussion on what might have happened and the section on the findings describes the causes. Therefore the reasoning chain was chosen as the shortest path from Summary to Causes. The constraints for the reasoning chain path was defined in the Backus-Naur Form as one comprising of a summary-path with an edge to a factual-path with an edge to an analysis path with an edge to the causes-path add the intermediaries in this representation are optional. Each of the paths can be recursively defined as one or more nodes with edges.  While these edges can be one of the  three aforementioned edges such as partof-edge, contains-edge, similar-edge and cause-edge,  only the causes path explicitly has part-of edges. A post-processing steps shrinks the path by eliminating transitives. Then the compressed graphs are laid out for visualization.

We read the paper :  automatic extraction of reasoning chains from textual reports by Sizov and Ozturk. They extract reasoning chains from documents such as medical patient reports, accident reports. Reasoning chains contain information useful to analyze the problem at hand. They use a graph based approach that makes the relations between textual units explicit. Reasoning chains are a sequence of transitions from piece of information to another that connect the problem with the resolution and thus are very valuable in future problem triages.  Moreover they determine the validity or the fallaciousness of the arguments presented.

This paper decomposes a document into text units and discovers the connections between the text units and makes them explicit. TRN is automatically acquired from text using a syntactic parser and a semantic similarity measure. We now discuss this TRN in detail.  A reasoning chain is extracted as a path from the graph-based text representation. TRN is a graph with two types of nodes. text nodes and section nodes and three types of edges - structural, similarity and causal. This graph is contructed from the text in the following manner: 1) syntax trees obtained from a syntactic parser are added to the graph, 2) section nodes are attached to the sentence nodes, 3) similarity edges are added between similar text nodes and 4) cause relations identified by a discourse paper are added. When representing text units, graph based approaches have used terms or short phrases as text units that are too small for this approach. Another popular choice is to use a whole sentence or a few. However, such text units may contain several pieces of information where only one is relevant for the reasoning chain. In this paper, the authors use a Stanford parser to extract the syntax tree and pick out the S(sentence, clause), NP (noun phrase) and VP (verb phrase) which are referred to as text nodes. Text nodes are identified by a set of stemmed words. Along with these structural relations such as Contains and PartOf are also added as edges. Graphs extracted from different sentences in a document are combined into one. To obtain similarity relations, a similarity value is computed for each pair of text nodes of the same category(S,VP, NP) and Similar edges are added to the graph for node pairs where the similarity is above a threshold. Their similarity measure finds a one to one alignment of words from two text units The LCH for these words are computed using the shortest path between the corresponding senses in WordNet. Then a complete bipartite graph is constructed. Nodes in the bipartite graph represent words from the text units while the edges have weights that correspond to similarities between words. A maximum weighted bipartite matching finds a one-to-one alignment that maximizes the sum of similarities between aligned words.  This sum is then normalized and thresholded to add corresponding Similar edges to the graph. Next Causal relations are added to the graph. Causal relations are found by a discourse parser. In this case, they used a PDTB-styled End to End discourse parser. Cause relations found by the parser are added to the TRN graph as Cause edges. This is how the TRN graph is built.
Once the TRN graph is constructed, the reasoning chain is generated based on a three stage process:
1) a report is converted from text into a TRN graph
2) given a start and an end node, several paths are extracted from the graphs and
3) paths are combined, post processed and visualized.

Thursday, June 5, 2014

In the previous post we saw that we could represent distributional similarity and contextual similarity with data structures. The first is represented by three graphs We first create a term context feature graph with T set of terms, C the set of context features and edges based on the counts of the terms and the context features. Based on this graph, a first order graph is created, This first order graph is denoted by T , C and the significance score sig. The significance score sig is calculated for each pair (t, c) using a formula for a metric called the Lexicographer's Mutual Information LMI(t,c) = count(t,c).log-base-2(count(t,c)/count(t)count(c)). This is described in the paper by Evert, 2004. To find the significance score, the edges with LMI < 0 are removed and only the p  most significant pairs per term t are retained. Features that are general because they have too many terms are also removed.
The second order similarity graph between terms is defined with the similarity score as the number of salient features the two terms share.  This approach does not count how often the feature occurs with a term but instead uses a significance ranking This makes it more scalable to data as it does not need the full list of features for a term pair at any time.
The contextual similarity is calculated differently.  This is also computed by a ranking method. Given the set of candidate expansions from the second order graph, such that the most similar term in context will be ranked higher and the non-compatible ones should be ranked lower. First the terms and context features are extracted as pairs and for all pairs for the given target word from the document by the holing operator.  Then a new graph is defined with context features CON that belong to the previous graph. For the target word,  the second order similarity graph is looked up to find the top n similar terms and these are added to the CON. Then edges between target word and context features are labeled with the significance score from the first order graph. If the edges are not in the first order graph then they don't get significance score. Then a ranking score for each term is calculated with the harmonic mean and a smoothing factor. Then the entries are re-ordered based on this ranking score.

Wednesday, June 4, 2014


We saw that global and local scopes can matter in finding similarities. We discuss a different paper in this approach: This is the paper on a Graph Based Contextualization method using Distributional Thesauri by Biemann and Riedl. This paper discusses a new contextual similarity finding method that generates for each term occurrence in a text, a ranked list of terms that are semantically similar and compatible with the context. The framework is instantiated by the definition of the term and context. The dependencies are parsed to get the context. Contextual similarity based approach has trumped the non-contextualized baseline across all parts of speech. Moreover this is an unsupervised generative method for similarity in context  and does not rely on the existence of lexical resources as a source for candidate expansion. This approach made use of an existing distributional thesaurus (DT) that has entries consisting of a ranked list of the globally most similar terms for a target term. This approach uses the DT in a graph representation and moves from a global notion to local. The prevalent way to represent word meanings has been to use the vector space model which uses a corpora. However this approach does not require a domain knowledge and is unsupervised in the sense that it does not require training.
They introduce an operation called the holing which splits any sort of observations on the dependencies into pairs of terms and context features. Then these pairs are used for calculating the similarity and for finding the contextual similarity. This operator is used in the system that requires preprocessing steps. The notation is the @ symbol. This is used in more unified representations of dependencies.

The distributional similarity is represented using three graphs. This operation involves using a bipartite term-context feature graph, with T the set of terms, C the set of context features, and e(t,c,f) where f = count(t,c) .count is the count of terms and count of c Based on the graph TC , a first order graph e(T,C, sig) is created and the significance score sig calculated. The significance score sig is calculated for each pair (t,c) using a formula defined in Evert 2004 using counts.
Then all the edges with score (t,c) that is < 0 is removed and only the p most significant pairs per term t and removing the remaining edges. Additionally overly general features that have more than a thousand words are removed.
The second order similarity graph between terms is defined as SO(T,E) for t1,t2 belonging to T with the similarity score which is the number of salient features two terms share. This graph gives the distributional similarity.  
In the paper titled "Merging word senses" by Bhagwani, Satapathy and Karnick, a semi supervised approach to learning Wordnet synsets using a graph based recursive similarity definition is presented. The synsets help provide input on sense similarities of all word sense pairs. Then this method shows coarse sense inventories at arbitrary granularities
Unlike previous approaches where they generated a coarse sense inventory by merging fine grained senses, this approach proposes a framework by learning a synset similarity metric The problems with the earlier approaches included the following: first, it requires a stopping critierion for each word such as the number of final classes and these cannot usually be predetermined. Second the inconsistent clusters are obtained because coarse senses are independently generated for each words. The inconsistent clustering causes transitive closure errors. With the approach discussed in this paper, the coarsening of noun synsets is improved upon. But to learn similarities between synset pairs which do not share a word, they use a variant of the SimRank framework that gives a non-zero similarity.  The SimRank is a graph based similarity measure applicable in any domain with object to object relationships and shows that two objects are similar if they are related to similar objects. The SimRank equations can be solved to find a similarity between all pairs of words and this was proved in a separate paper by Jeh and Wisdom 2002. For a graph G(V,E) the solution is reached by iteration to a fixed point. For each iteration |V|^2 entries are kept where Sk(a,b) is the estimate of similarity between a and b at the kth iteration. This works well when we know complete information on the objects.
In many scenarios, this may not be known. So the SimRank is personalized by initializing it and by knowing the similarities of some pairs, this approach fixes them in the set of equations and let the rest of the values be automatically learnt by the system. To begin with supervised labels, the mapping labels of the Omega ontology is used. To coarsen the WordNet, an undirected graph is constructed which contains the synsets of  WordNet and edge set E comprising of edges obtained by thresholding the similarity metric  learnt using the personalized SimRank model. On these graphs, connected components are found which gives us a partition over synsets. All the senses of a word occurring in the same component are grouped as  a single coarse sense. To make it more coarse, denser graphs are obtained with fewer connected components. The small number of components translates into more coarser senses. This threshold provides a way to control the granularity of the coarse senses.
We now look at Feature Engineering The feature space is constructed this way The features are broadly classified into two parts one that is derived from the structure of WordNet and another that is derived from other corpora. The features derived from WordNet are further subdivided into similarity measures and features. Among the WordNet similarity measures, the authors used path based similarity measures. Other synsets and sense based features include number of lemmas common in two synsets. Other synsets and sense based features include number of lemmas common in two synsets, maximum polysemy degree among the lemmas shared by the synsets, whether two synsets have the same lexicographer file, number of common hypernyms, whether the two synsets have the same lexicographer file, number of common hypernyms etc. Hyperynyms/Hyponyms mean super/sub ordinate relationships in the structure of the WordNet.
 Polysemous means having more than one sense in the syntactic category.  Features derived from External corpora include a score of a synset with respect to 169 hierarchically organized domain label as available from eXtended WordNet domain project. BabelNet is another external corpora that provides the translation of a noun word senses in 6 languages.

Tuesday, June 3, 2014

we will continue with our discussion on graph based random walks today. We saw how we can establish relationships in a bipartisan graph. We saw how to perform random walks. We also saw how to compute costs.
What we will see next is how this translates to keyword extractions. This part is more a conjecture at this point.  We want to reuse the concept of contextual similarity but we want to define the context as not just adjacent words. We don't want to use a local vector space model either because it will be computational ly expensive to do both random walk and VSM clustering. Is there a way to define collocation as something other than Adjacent words and VSM
One suggestion is that we use something like co-occurrence. Collocation is different from co-occurrence.  One defines proximity by boundaries  and the other defines counts and groupings. If we can come up with different groups of words and we find patterns in them such as they tend to repeat, that's cooccurrence. How do we discover different groups and different patterns is closely associated with clustering. 

Monday, June 2, 2014

When we discussed the Random walk based lexicon building in the algorithm from the earlier post, we had a few stages of computations.
First we built the bipartite graph. ( this was mentioned before the last one )
We built it this way:
     We extracted N-grams from the text corpus.
      For each N-gram denoted by n.
      we checked if the center word  is noisy or a normalized word.
      if it is noisy we add it to the source node
      else we add it to the sink node and add context
      we add the context and
      then we add the edge weight 
Second we perform the random walk. The goal was to identify pairs of noisy and normalized words that can be considered as equivalences. The walk starts at a noisy word and ends at a normalized word. We normalize the results for each node that is dirty.
Third we calculate the costs and then prune the list to the top few.
When we calculated the cost, we factored in a component for the similarity cost. This cost function was described as the ratio of the longest common sub sequence ratio and edit distance between two strings.
We had a cost for the random walk as well which was called the hitting time between those nodes. Furthermore, we averaged this hitting time between two nodes and normalized it with that of all other nodes linked to that noisy node.
In the algorithm above, we first see that we do iterations over the lexicons list. By doing these repeated iterations we wanted to choose the most probable paths and refine the results.
This way we now have a list of the top few contextually relevant words.
This method lets us use pair wise contextual similarity to normalize social media text.