Thursday, June 5, 2014

In the previous post we saw that we could represent distributional similarity and contextual similarity with data structures. The first is represented by three graphs We first create a term context feature graph with T set of terms, C the set of context features and edges based on the counts of the terms and the context features. Based on this graph, a first order graph is created, This first order graph is denoted by T , C and the significance score sig. The significance score sig is calculated for each pair (t, c) using a formula for a metric called the Lexicographer's Mutual Information LMI(t,c) = count(t,c).log-base-2(count(t,c)/count(t)count(c)). This is described in the paper by Evert, 2004. To find the significance score, the edges with LMI < 0 are removed and only the p  most significant pairs per term t are retained. Features that are general because they have too many terms are also removed.
The second order similarity graph between terms is defined with the similarity score as the number of salient features the two terms share.  This approach does not count how often the feature occurs with a term but instead uses a significance ranking This makes it more scalable to data as it does not need the full list of features for a term pair at any time.
The contextual similarity is calculated differently.  This is also computed by a ranking method. Given the set of candidate expansions from the second order graph, such that the most similar term in context will be ranked higher and the non-compatible ones should be ranked lower. First the terms and context features are extracted as pairs and for all pairs for the given target word from the document by the holing operator.  Then a new graph is defined with context features CON that belong to the previous graph. For the target word,  the second order similarity graph is looked up to find the top n similar terms and these are added to the CON. Then edges between target word and context features are labeled with the significance score from the first order graph. If the edges are not in the first order graph then they don't get significance score. Then a ranking score for each term is calculated with the harmonic mean and a smoothing factor. Then the entries are re-ordered based on this ranking score.

Wednesday, June 4, 2014


We saw that global and local scopes can matter in finding similarities. We discuss a different paper in this approach: This is the paper on a Graph Based Contextualization method using Distributional Thesauri by Biemann and Riedl. This paper discusses a new contextual similarity finding method that generates for each term occurrence in a text, a ranked list of terms that are semantically similar and compatible with the context. The framework is instantiated by the definition of the term and context. The dependencies are parsed to get the context. Contextual similarity based approach has trumped the non-contextualized baseline across all parts of speech. Moreover this is an unsupervised generative method for similarity in context  and does not rely on the existence of lexical resources as a source for candidate expansion. This approach made use of an existing distributional thesaurus (DT) that has entries consisting of a ranked list of the globally most similar terms for a target term. This approach uses the DT in a graph representation and moves from a global notion to local. The prevalent way to represent word meanings has been to use the vector space model which uses a corpora. However this approach does not require a domain knowledge and is unsupervised in the sense that it does not require training.
They introduce an operation called the holing which splits any sort of observations on the dependencies into pairs of terms and context features. Then these pairs are used for calculating the similarity and for finding the contextual similarity. This operator is used in the system that requires preprocessing steps. The notation is the @ symbol. This is used in more unified representations of dependencies.

The distributional similarity is represented using three graphs. This operation involves using a bipartite term-context feature graph, with T the set of terms, C the set of context features, and e(t,c,f) where f = count(t,c) .count is the count of terms and count of c Based on the graph TC , a first order graph e(T,C, sig) is created and the significance score sig calculated. The significance score sig is calculated for each pair (t,c) using a formula defined in Evert 2004 using counts.
Then all the edges with score (t,c) that is < 0 is removed and only the p most significant pairs per term t and removing the remaining edges. Additionally overly general features that have more than a thousand words are removed.
The second order similarity graph between terms is defined as SO(T,E) for t1,t2 belonging to T with the similarity score which is the number of salient features two terms share. This graph gives the distributional similarity.  
In the paper titled "Merging word senses" by Bhagwani, Satapathy and Karnick, a semi supervised approach to learning Wordnet synsets using a graph based recursive similarity definition is presented. The synsets help provide input on sense similarities of all word sense pairs. Then this method shows coarse sense inventories at arbitrary granularities
Unlike previous approaches where they generated a coarse sense inventory by merging fine grained senses, this approach proposes a framework by learning a synset similarity metric The problems with the earlier approaches included the following: first, it requires a stopping critierion for each word such as the number of final classes and these cannot usually be predetermined. Second the inconsistent clusters are obtained because coarse senses are independently generated for each words. The inconsistent clustering causes transitive closure errors. With the approach discussed in this paper, the coarsening of noun synsets is improved upon. But to learn similarities between synset pairs which do not share a word, they use a variant of the SimRank framework that gives a non-zero similarity.  The SimRank is a graph based similarity measure applicable in any domain with object to object relationships and shows that two objects are similar if they are related to similar objects. The SimRank equations can be solved to find a similarity between all pairs of words and this was proved in a separate paper by Jeh and Wisdom 2002. For a graph G(V,E) the solution is reached by iteration to a fixed point. For each iteration |V|^2 entries are kept where Sk(a,b) is the estimate of similarity between a and b at the kth iteration. This works well when we know complete information on the objects.
In many scenarios, this may not be known. So the SimRank is personalized by initializing it and by knowing the similarities of some pairs, this approach fixes them in the set of equations and let the rest of the values be automatically learnt by the system. To begin with supervised labels, the mapping labels of the Omega ontology is used. To coarsen the WordNet, an undirected graph is constructed which contains the synsets of  WordNet and edge set E comprising of edges obtained by thresholding the similarity metric  learnt using the personalized SimRank model. On these graphs, connected components are found which gives us a partition over synsets. All the senses of a word occurring in the same component are grouped as  a single coarse sense. To make it more coarse, denser graphs are obtained with fewer connected components. The small number of components translates into more coarser senses. This threshold provides a way to control the granularity of the coarse senses.
We now look at Feature Engineering The feature space is constructed this way The features are broadly classified into two parts one that is derived from the structure of WordNet and another that is derived from other corpora. The features derived from WordNet are further subdivided into similarity measures and features. Among the WordNet similarity measures, the authors used path based similarity measures. Other synsets and sense based features include number of lemmas common in two synsets. Other synsets and sense based features include number of lemmas common in two synsets, maximum polysemy degree among the lemmas shared by the synsets, whether two synsets have the same lexicographer file, number of common hypernyms, whether the two synsets have the same lexicographer file, number of common hypernyms etc. Hyperynyms/Hyponyms mean super/sub ordinate relationships in the structure of the WordNet.
 Polysemous means having more than one sense in the syntactic category.  Features derived from External corpora include a score of a synset with respect to 169 hierarchically organized domain label as available from eXtended WordNet domain project. BabelNet is another external corpora that provides the translation of a noun word senses in 6 languages.

Tuesday, June 3, 2014

we will continue with our discussion on graph based random walks today. We saw how we can establish relationships in a bipartisan graph. We saw how to perform random walks. We also saw how to compute costs.
What we will see next is how this translates to keyword extractions. This part is more a conjecture at this point.  We want to reuse the concept of contextual similarity but we want to define the context as not just adjacent words. We don't want to use a local vector space model either because it will be computational ly expensive to do both random walk and VSM clustering. Is there a way to define collocation as something other than Adjacent words and VSM
One suggestion is that we use something like co-occurrence. Collocation is different from co-occurrence.  One defines proximity by boundaries  and the other defines counts and groupings. If we can come up with different groups of words and we find patterns in them such as they tend to repeat, that's cooccurrence. How do we discover different groups and different patterns is closely associated with clustering. 

Monday, June 2, 2014

When we discussed the Random walk based lexicon building in the algorithm from the earlier post, we had a few stages of computations.
First we built the bipartite graph. ( this was mentioned before the last one )
We built it this way:
     We extracted N-grams from the text corpus.
      For each N-gram denoted by n.
      we checked if the center word  is noisy or a normalized word.
      if it is noisy we add it to the source node
      else we add it to the sink node and add context
      we add the context and
      then we add the edge weight 
Second we perform the random walk. The goal was to identify pairs of noisy and normalized words that can be considered as equivalences. The walk starts at a noisy word and ends at a normalized word. We normalize the results for each node that is dirty.
Third we calculate the costs and then prune the list to the top few.
When we calculated the cost, we factored in a component for the similarity cost. This cost function was described as the ratio of the longest common sub sequence ratio and edit distance between two strings.
We had a cost for the random walk as well which was called the hitting time between those nodes. Furthermore, we averaged this hitting time between two nodes and normalized it with that of all other nodes linked to that noisy node.
In the algorithm above, we first see that we do iterations over the lexicons list. By doing these repeated iterations we wanted to choose the most probable paths and refine the results.
This way we now have a list of the top few contextually relevant words.
This method lets us use pair wise contextual similarity to normalize social media text.



Having discussed the paper on normalizing social media text in the previous post, we will conclude reviewing it today. We mentioned that the approach in the paper uses a bipartite graph and performs a random walk to connect words to their normalized equivalences.  We mentioned briefly about how this helps us identify pair wise relationships between words and their contexts. We were looking to establish contextual similarity in other ways. This requires exploiting several relationship considerations such as semantic relationship, part of speech relationship, categorization or ontology based groupings etc. Perhaps we could also consider local vector space models instead of just relying on adjacent words in  a sentence. This means we use a local vector space models to establish contextual similarity and a global graph and random walk model to establish pair wise relationships.
The subject of these discussions is a different topic. However, we will just finish reviewing the paper first. The number of steps taken to traverse between any two nodes is called the hitting time.  The average hitting time of all walks that connect the source and sink nodes is considered H(n,m) = Sum[hr(n,m) / R] where the numerator is the hitting time between the pair n,m over  a walk r.
The relative frequency of the average hitting of those two nodes H(n,m) and all other nodes linked to that noisy node is then described as L(n,m) = H(n,m) / Sum-i H(n,m). The is one cost.
The cost between a noisy word and a normalized word is also the lexical similarity cost SimCost(n,m). This cost function is described as the ratio of Longest common subsequence ratio (LCSR) and Edit distance between two strings.
The algorithm to induce the lexicons is something like this:
The lexicon is first initialized.
for each node in the graph
     if this node in the graph is noisy
         Initialize R(n)
     for i = 0 to K
     do
       R(n) = RandomWalk(n)
       L(n) = Normalize R(n)
Calculate Lexical Similarity Cost
Ln = SimCost(Ln)
Ln = Prune(Ln)
Lexicon = ADD(Ln)
In the algorithm above we see that we first initialize the Lexicons and add to it by performing iterations. As we do the random walks, we pick the more probable paths more often and select the normalization equivalence candidates. Then we calculate the average hits and normalize.  We find the similarity costs and together with the average hits, we construct the lexicon using the entries with the final costs and prune it by selecting the top few.

Sunday, June 1, 2014

Today we will cover some more from the paper we mentioned yesterday. The paper mentioned that the social media text is noisy and contains typos, abbreviations, phonetic substitutions and slangs.
The paper proposed a text normalization approach that can adapt to such variations automatically. This technique can adapt to different sources any genre of social media.  This is different from listing out of vocabulary words because such a list does not suit social media text. Many words and named entities  that do not exist in a  given vocabulary should not be considered for normalization.
OOV word may have many appropriate normalization depending on the context and the domain. Also, text normalization is a pre-processing step should have high accuracy.
With these challenges, the technique proposed addresses these by finding the best normalization sequence according to an n-gram language model. The normalization candidates are automatically picked from unlabeled text data in an unsupervised manner. The approach is scalable, accurate and adaptive to any domain and language. It begins by constructing a lattice from the normalization candidates. Only one-to-one word mappings are considered are considered. The candidates are populated with two generators. One uses a dictionary based spelling correction and the second is based on the trie based approximate string matching. As an aside, we mentioned the trie based string approximation used in mobile devices in an earlier post. A trie is used for prefix or suffix based representation and organization of strings that enables lengthening, letter substitution and letter-number substitutions and other such approximations. Together with the trie based approximation and the spell checker, the words can be normalized.
The dictionary based normalization methods prove inadequate for social media text for many reasons. First, they are too general corrections Second the context less in the sense that they don't consider the nature of the word but the minimum edit distance for any word and their named entity. Third social media text is dynamic where slangs get introduced on a daily basis.
We now describe the graph based random walks mentioned in this paper. Here the authors use the approach that normalization equivalences share similar context. By similar context, they mean a pattern in a say n-grams where the words to the left and to the right are the same between equivalences. If we take a five gram sequence of words, then the  pattern is similar to word1.word2*word4word5. This pattern can now be represented by a bipartite graph where the first partite represents the words and the second partite represents the n-gram contexts shared by the words. A word node can be either normalized or noisy since this decision limits the candidate noisy words to be normalized.
The selection of candidates for normalizing (noisy words) is something like this. There is a vocabulary constructed from a large clean corpus. Any word that does not appear in the vocabulary more than a predefined threshold i.e 10 times is a candidate.
The bipartite graph is composed of W which includes all nodes representing normalized words and noisy words, C that represents the shared context and E that represents the edges connecting word nodes and context nodes. The weight on the edges is the number of occurrences of a word in a context.
The algorithm to construct the graph is based on something like this:
Extract N-grams from the text corpus.
Foreach N-gram denoted by n
   check if the centerword is a noisy or normalized word
   if it is noisy add it to the source node
   else add it to the sink node and add context
   add the context
   add the edge weight (Context, word, count)
Once the bipartite graph is constructed, we can then generate lexicons using Markov Random Walk. The goal is to identify pairs of noisy and normalized words that can be considered as equivalences. A random walk starts at a noisy word and ends at a normalized word. The walker starts from a source node Ni and moves to connected node Mj with probability Pij . This probability is the normalized co-occurrence counts of the word and the corresponding context. Random walks are repeated until either a normalized form is found or k attempts are exhausted. There can be many paths between the same source and the same sink. The paths are generally selected based on transition probability.  In nature this is similar to moving between different forms of words in edit distances.
We note that this graph based method uses similar context. In natural language processing applications, the context is often described as a set of words to the left and right of the candidate.
Is there a better way to describe context and not just based on adjacency in a sentence. This is where the vector space model and clustering helped. However, can we define a graph based on something other than word adjacency for the same context similarity ? Perhaps, we could consider that words can map to different parts of speech or different thesaurus or different domain ontology. Sometimes one solution does not work for all. This is probably where we could consider multiple measures at the same time. While the vector space model attempts to do this without explicit consideration for the factors contributing to the word pair relationships, we could attempt to have graph based explicit considerations for each.