Monday, June 9, 2014

In today's post we look at some of the details on graph based word similarity measures. While we looked at word similarity measure in the Saluja, Navratil 2012 paper, I'm going to focus on normalization of words. We mentioned the holing system in Bieman and Riedl. This holing system is only part of their system but they describe a way to capture distributional and contextual similarity.  The idea here is to represent observations as <t,c> pair where t is the term (at a certain offset) and c is the context feature. The position of t in c is denoted by the @ hole symbol. The authors give an example where dependency (nsub;gave2;I1) could be transferred to <gave2, (nsub; @; I1) > Notice how the holing operator is used to give the position of the term. and this in turn could be represented by <I1, (nsub; gave2;@)>. Note the change in the context with the terms and the use of the holing operator. In the second dependency representation, the @ symbol has been introduced to give the position of gave2 and in the third representation the @ symbol gives the position of I1. The last two dependency representations are in the <t,c> pairs.
This paper also uses a term-context feature bipartite graph just like the last paper we saw. In addition it calculates a significance score sig for each pair <t,c> using the counts of the terms, the counts of the context features and the counts of their co-occurrence. If the significance score is negative, the corresponding edge is not added. Also features that have more than a thousand words is removed because they are overly general. This way the distributional similarity is calculated but the contextual similarity is calculated instead from the second order graph.
This way the authors showed that with the sentence "I caught a nasty cold" the holing system produced <cold5, (amod,@, nasty4)>, and <cold5, (dobj;caught2; @)> and the scores were <heat, (dobj; caught; @)> with an LMI score of 42.0 and <weather, (amod; @; nasty)> with a score of 139.4. The word disease came highest in rank for the above sentence.

we will continue on this discussion.

Saturday, June 7, 2014

I've been looking at some objective functions for graph based similarity between words and I came across "Graph based Unsupervised Learning of Word Similarities using Heterogeneous Feature Types" by Saluja and Navratil. This seems a simple way to express the similarity. The graph they use is bipartite between words and feature sets. Edges connect two words or two feature sets or a word and a feature set.
To learn the graph, the following objective function in minimized:
Omega(Wv, Wf, Wz) = a0 Sum ( Wfp fq - (Wfp fq)-initial )^2
                                   + a1 Sum over vi Sum over fp ( Zvi fp - Zvi-initial fp) ^2
                                   + a2 Sum over vi, vj Sum over fp fq  Zvi fp Zvj fq ( Wvi, vj - Wfp,fq) ^ 2
                                   + a3 Sum over vi, vj  Sum over fp, fq Wvivj WfpWfq (Zvi,fp - Zvj, fq) ^ 2
where
Wfp,fq is the current similarity between feature fp and fq
Wvi vj is the current similarity between word vi and vj
Zvi, fp is the current importance weight of feature fp for word vi
a0 to a3 are constants that add up to 1 for the relative importance of each term
The first two terms are described as minimizing the l2-norm between the initial and the current values of Wfp,fq and Zvi, fp
The third term minimizes the similarity between the words vi and j and the features fp and fq
The fourth term minimizes the similarity between the importance weights in proportion to the similarities
As you can see the terms are progressive and enable smoothing over the two graphs.
But can we express this with a different equation that enables Laplacian smoothing and matrices
The Laplacian matrix sometimes called the admittance matrix or Kirchoff matrix of a graph G which is undirected and unweighted, is an n * n symmetric matrix with one row and column for each node defined by L = D - A where  D is the degree matrix, which is the diagonal matrix formed from the vertex degrees and A is the adjacency matrix.
A normalized vector of the Laplacian matrix of a graph is implemented by
L(G) = 1 if i == j and di != 0
         = -1 / sq rt (di.dj) if i and j are adjacent
         = 0 otherwise.
The Laplacian matrix measures to what extent a graph differs at one vertex from its values at nearby vertices. 
I'm going to take a break to discuss a topic that I've been asked to look into. A python library is not able to make an SSL Connection with an application that uses libssl. The error message received is that the 'The handshake operation timed out'. This is because the server is not completing the handshake in some cases.
We first look at what all steps are involved in the TLS handshake. TLS and SSL are different in that they don't interpolate and TLS 1.0 is often referred to as SSL 3.1 but at the same time the differences are not that much in our discussion and TLS 1.0 has a way to fall back to SSL 3.0. The handshake we describe is responsible for the authentication and the key exchange necessary to establish or resume secure sessions. In establishing a secure session, the following steps are involved :
cipher suite negotiation
authentication of the server and optionally the client
session key information exchange
The first step refers to the exchange of the cipher suite that both the client and the server will be using for the duration of the session.
The second step refers to the step where the server proves its identity to the client. The client might also need to prove its identity to the server.
The third step refers to the exchange of random numbers and the selection of a special number called the Pre-Master Secret which helps in creating a Master Secret and that in turn helps create the MAC secret which is the session key used for hashing and the write key, which is the session key used for encryption.
After the initial client hello, server hello and the server hello done message, the pre-master secret is generated by the client and the master secret is generated by both the client and the server. The client sends the change cipher spec notification to the server
Cryptographic algorithms can vary. Some fall in the bucket of Federal Information Processing Standards also called FIPS standard. These can be enabled or disabled  with FIPS mode and often is used to troubleshoot SSL issues. There are dedicated libraries  designed to support cross-platform development of security enabled client and server applications with optional support for hardware SSL acceleration on the server side and smart cards on the client side. These are called Network Security Services.

Friday, June 6, 2014

The previous post discussed a paper that formalized the usage of Graphs for describing reasoning chain relationships. It introduced the TRN graph and described two different types of nodes and three different edges.The nodes were either text nodes and section nodes and the edges were structural, similarity and causal. This graph is constructed from the text with the help of a syntactic parser which adds the text nodes Section nodes are attached to the sentence nodes and all the text nodes are added under these. Similarity measures help add edges for similarity to the node pairs. Only those measures that are above a threshold are added. Graphs that are extracted from different sections are combined Next Causal relations are added with the help of a discourse parser. With the structural, similarity and causal edges added we now have a complete TRN graph. Then with this graph, we can choose source destination pairs and find paths. Paths are combined, post-processed and visualized.
The dataset used for this study was the aviation investigation reports from Transportation Safety Board of Canada. Each report in this collection is an incident report and has sections titled summary, factual information, analysis and findings. The documents are available for download from the TSB of Canada and then they were pre-processed by the authors using GATE NLP platform. The preprocessing steps included tokenization, sentence splitting and part of speech tagging.  The sections on the summary is a brief description, the section on the factual information is the details of the aircraft or circumstance, the section on the analysis is a discussion on what might have happened and the section on the findings describes the causes. Therefore the reasoning chain was chosen as the shortest path from Summary to Causes. The constraints for the reasoning chain path was defined in the Backus-Naur Form as one comprising of a summary-path with an edge to a factual-path with an edge to an analysis path with an edge to the causes-path add the intermediaries in this representation are optional. Each of the paths can be recursively defined as one or more nodes with edges.  While these edges can be one of the  three aforementioned edges such as partof-edge, contains-edge, similar-edge and cause-edge,  only the causes path explicitly has part-of edges. A post-processing steps shrinks the path by eliminating transitives. Then the compressed graphs are laid out for visualization.

We read the paper :  automatic extraction of reasoning chains from textual reports by Sizov and Ozturk. They extract reasoning chains from documents such as medical patient reports, accident reports. Reasoning chains contain information useful to analyze the problem at hand. They use a graph based approach that makes the relations between textual units explicit. Reasoning chains are a sequence of transitions from piece of information to another that connect the problem with the resolution and thus are very valuable in future problem triages.  Moreover they determine the validity or the fallaciousness of the arguments presented.

This paper decomposes a document into text units and discovers the connections between the text units and makes them explicit. TRN is automatically acquired from text using a syntactic parser and a semantic similarity measure. We now discuss this TRN in detail.  A reasoning chain is extracted as a path from the graph-based text representation. TRN is a graph with two types of nodes. text nodes and section nodes and three types of edges - structural, similarity and causal. This graph is contructed from the text in the following manner: 1) syntax trees obtained from a syntactic parser are added to the graph, 2) section nodes are attached to the sentence nodes, 3) similarity edges are added between similar text nodes and 4) cause relations identified by a discourse paper are added. When representing text units, graph based approaches have used terms or short phrases as text units that are too small for this approach. Another popular choice is to use a whole sentence or a few. However, such text units may contain several pieces of information where only one is relevant for the reasoning chain. In this paper, the authors use a Stanford parser to extract the syntax tree and pick out the S(sentence, clause), NP (noun phrase) and VP (verb phrase) which are referred to as text nodes. Text nodes are identified by a set of stemmed words. Along with these structural relations such as Contains and PartOf are also added as edges. Graphs extracted from different sentences in a document are combined into one. To obtain similarity relations, a similarity value is computed for each pair of text nodes of the same category(S,VP, NP) and Similar edges are added to the graph for node pairs where the similarity is above a threshold. Their similarity measure finds a one to one alignment of words from two text units The LCH for these words are computed using the shortest path between the corresponding senses in WordNet. Then a complete bipartite graph is constructed. Nodes in the bipartite graph represent words from the text units while the edges have weights that correspond to similarities between words. A maximum weighted bipartite matching finds a one-to-one alignment that maximizes the sum of similarities between aligned words.  This sum is then normalized and thresholded to add corresponding Similar edges to the graph. Next Causal relations are added to the graph. Causal relations are found by a discourse parser. In this case, they used a PDTB-styled End to End discourse parser. Cause relations found by the parser are added to the TRN graph as Cause edges. This is how the TRN graph is built.
Once the TRN graph is constructed, the reasoning chain is generated based on a three stage process:
1) a report is converted from text into a TRN graph
2) given a start and an end node, several paths are extracted from the graphs and
3) paths are combined, post processed and visualized.

Thursday, June 5, 2014

In the previous post we saw that we could represent distributional similarity and contextual similarity with data structures. The first is represented by three graphs We first create a term context feature graph with T set of terms, C the set of context features and edges based on the counts of the terms and the context features. Based on this graph, a first order graph is created, This first order graph is denoted by T , C and the significance score sig. The significance score sig is calculated for each pair (t, c) using a formula for a metric called the Lexicographer's Mutual Information LMI(t,c) = count(t,c).log-base-2(count(t,c)/count(t)count(c)). This is described in the paper by Evert, 2004. To find the significance score, the edges with LMI < 0 are removed and only the p  most significant pairs per term t are retained. Features that are general because they have too many terms are also removed.
The second order similarity graph between terms is defined with the similarity score as the number of salient features the two terms share.  This approach does not count how often the feature occurs with a term but instead uses a significance ranking This makes it more scalable to data as it does not need the full list of features for a term pair at any time.
The contextual similarity is calculated differently.  This is also computed by a ranking method. Given the set of candidate expansions from the second order graph, such that the most similar term in context will be ranked higher and the non-compatible ones should be ranked lower. First the terms and context features are extracted as pairs and for all pairs for the given target word from the document by the holing operator.  Then a new graph is defined with context features CON that belong to the previous graph. For the target word,  the second order similarity graph is looked up to find the top n similar terms and these are added to the CON. Then edges between target word and context features are labeled with the significance score from the first order graph. If the edges are not in the first order graph then they don't get significance score. Then a ranking score for each term is calculated with the harmonic mean and a smoothing factor. Then the entries are re-ordered based on this ranking score.

Wednesday, June 4, 2014


We saw that global and local scopes can matter in finding similarities. We discuss a different paper in this approach: This is the paper on a Graph Based Contextualization method using Distributional Thesauri by Biemann and Riedl. This paper discusses a new contextual similarity finding method that generates for each term occurrence in a text, a ranked list of terms that are semantically similar and compatible with the context. The framework is instantiated by the definition of the term and context. The dependencies are parsed to get the context. Contextual similarity based approach has trumped the non-contextualized baseline across all parts of speech. Moreover this is an unsupervised generative method for similarity in context  and does not rely on the existence of lexical resources as a source for candidate expansion. This approach made use of an existing distributional thesaurus (DT) that has entries consisting of a ranked list of the globally most similar terms for a target term. This approach uses the DT in a graph representation and moves from a global notion to local. The prevalent way to represent word meanings has been to use the vector space model which uses a corpora. However this approach does not require a domain knowledge and is unsupervised in the sense that it does not require training.
They introduce an operation called the holing which splits any sort of observations on the dependencies into pairs of terms and context features. Then these pairs are used for calculating the similarity and for finding the contextual similarity. This operator is used in the system that requires preprocessing steps. The notation is the @ symbol. This is used in more unified representations of dependencies.

The distributional similarity is represented using three graphs. This operation involves using a bipartite term-context feature graph, with T the set of terms, C the set of context features, and e(t,c,f) where f = count(t,c) .count is the count of terms and count of c Based on the graph TC , a first order graph e(T,C, sig) is created and the significance score sig calculated. The significance score sig is calculated for each pair (t,c) using a formula defined in Evert 2004 using counts.
Then all the edges with score (t,c) that is < 0 is removed and only the p most significant pairs per term t and removing the remaining edges. Additionally overly general features that have more than a thousand words are removed.
The second order similarity graph between terms is defined as SO(T,E) for t1,t2 belonging to T with the similarity score which is the number of salient features two terms share. This graph gives the distributional similarity.