Wednesday, June 4, 2014

In the paper titled "Merging word senses" by Bhagwani, Satapathy and Karnick, a semi supervised approach to learning Wordnet synsets using a graph based recursive similarity definition is presented. The synsets help provide input on sense similarities of all word sense pairs. Then this method shows coarse sense inventories at arbitrary granularities
Unlike previous approaches where they generated a coarse sense inventory by merging fine grained senses, this approach proposes a framework by learning a synset similarity metric The problems with the earlier approaches included the following: first, it requires a stopping critierion for each word such as the number of final classes and these cannot usually be predetermined. Second the inconsistent clusters are obtained because coarse senses are independently generated for each words. The inconsistent clustering causes transitive closure errors. With the approach discussed in this paper, the coarsening of noun synsets is improved upon. But to learn similarities between synset pairs which do not share a word, they use a variant of the SimRank framework that gives a non-zero similarity.  The SimRank is a graph based similarity measure applicable in any domain with object to object relationships and shows that two objects are similar if they are related to similar objects. The SimRank equations can be solved to find a similarity between all pairs of words and this was proved in a separate paper by Jeh and Wisdom 2002. For a graph G(V,E) the solution is reached by iteration to a fixed point. For each iteration |V|^2 entries are kept where Sk(a,b) is the estimate of similarity between a and b at the kth iteration. This works well when we know complete information on the objects.
In many scenarios, this may not be known. So the SimRank is personalized by initializing it and by knowing the similarities of some pairs, this approach fixes them in the set of equations and let the rest of the values be automatically learnt by the system. To begin with supervised labels, the mapping labels of the Omega ontology is used. To coarsen the WordNet, an undirected graph is constructed which contains the synsets of  WordNet and edge set E comprising of edges obtained by thresholding the similarity metric  learnt using the personalized SimRank model. On these graphs, connected components are found which gives us a partition over synsets. All the senses of a word occurring in the same component are grouped as  a single coarse sense. To make it more coarse, denser graphs are obtained with fewer connected components. The small number of components translates into more coarser senses. This threshold provides a way to control the granularity of the coarse senses.
We now look at Feature Engineering The feature space is constructed this way The features are broadly classified into two parts one that is derived from the structure of WordNet and another that is derived from other corpora. The features derived from WordNet are further subdivided into similarity measures and features. Among the WordNet similarity measures, the authors used path based similarity measures. Other synsets and sense based features include number of lemmas common in two synsets. Other synsets and sense based features include number of lemmas common in two synsets, maximum polysemy degree among the lemmas shared by the synsets, whether two synsets have the same lexicographer file, number of common hypernyms, whether the two synsets have the same lexicographer file, number of common hypernyms etc. Hyperynyms/Hyponyms mean super/sub ordinate relationships in the structure of the WordNet.
 Polysemous means having more than one sense in the syntactic category.  Features derived from External corpora include a score of a synset with respect to 169 hierarchically organized domain label as available from eXtended WordNet domain project. BabelNet is another external corpora that provides the translation of a noun word senses in 6 languages.

Tuesday, June 3, 2014

we will continue with our discussion on graph based random walks today. We saw how we can establish relationships in a bipartisan graph. We saw how to perform random walks. We also saw how to compute costs.
What we will see next is how this translates to keyword extractions. This part is more a conjecture at this point.  We want to reuse the concept of contextual similarity but we want to define the context as not just adjacent words. We don't want to use a local vector space model either because it will be computational ly expensive to do both random walk and VSM clustering. Is there a way to define collocation as something other than Adjacent words and VSM
One suggestion is that we use something like co-occurrence. Collocation is different from co-occurrence.  One defines proximity by boundaries  and the other defines counts and groupings. If we can come up with different groups of words and we find patterns in them such as they tend to repeat, that's cooccurrence. How do we discover different groups and different patterns is closely associated with clustering. 

Monday, June 2, 2014

When we discussed the Random walk based lexicon building in the algorithm from the earlier post, we had a few stages of computations.
First we built the bipartite graph. ( this was mentioned before the last one )
We built it this way:
     We extracted N-grams from the text corpus.
      For each N-gram denoted by n.
      we checked if the center word  is noisy or a normalized word.
      if it is noisy we add it to the source node
      else we add it to the sink node and add context
      we add the context and
      then we add the edge weight 
Second we perform the random walk. The goal was to identify pairs of noisy and normalized words that can be considered as equivalences. The walk starts at a noisy word and ends at a normalized word. We normalize the results for each node that is dirty.
Third we calculate the costs and then prune the list to the top few.
When we calculated the cost, we factored in a component for the similarity cost. This cost function was described as the ratio of the longest common sub sequence ratio and edit distance between two strings.
We had a cost for the random walk as well which was called the hitting time between those nodes. Furthermore, we averaged this hitting time between two nodes and normalized it with that of all other nodes linked to that noisy node.
In the algorithm above, we first see that we do iterations over the lexicons list. By doing these repeated iterations we wanted to choose the most probable paths and refine the results.
This way we now have a list of the top few contextually relevant words.
This method lets us use pair wise contextual similarity to normalize social media text.



Having discussed the paper on normalizing social media text in the previous post, we will conclude reviewing it today. We mentioned that the approach in the paper uses a bipartite graph and performs a random walk to connect words to their normalized equivalences.  We mentioned briefly about how this helps us identify pair wise relationships between words and their contexts. We were looking to establish contextual similarity in other ways. This requires exploiting several relationship considerations such as semantic relationship, part of speech relationship, categorization or ontology based groupings etc. Perhaps we could also consider local vector space models instead of just relying on adjacent words in  a sentence. This means we use a local vector space models to establish contextual similarity and a global graph and random walk model to establish pair wise relationships.
The subject of these discussions is a different topic. However, we will just finish reviewing the paper first. The number of steps taken to traverse between any two nodes is called the hitting time.  The average hitting time of all walks that connect the source and sink nodes is considered H(n,m) = Sum[hr(n,m) / R] where the numerator is the hitting time between the pair n,m over  a walk r.
The relative frequency of the average hitting of those two nodes H(n,m) and all other nodes linked to that noisy node is then described as L(n,m) = H(n,m) / Sum-i H(n,m). The is one cost.
The cost between a noisy word and a normalized word is also the lexical similarity cost SimCost(n,m). This cost function is described as the ratio of Longest common subsequence ratio (LCSR) and Edit distance between two strings.
The algorithm to induce the lexicons is something like this:
The lexicon is first initialized.
for each node in the graph
     if this node in the graph is noisy
         Initialize R(n)
     for i = 0 to K
     do
       R(n) = RandomWalk(n)
       L(n) = Normalize R(n)
Calculate Lexical Similarity Cost
Ln = SimCost(Ln)
Ln = Prune(Ln)
Lexicon = ADD(Ln)
In the algorithm above we see that we first initialize the Lexicons and add to it by performing iterations. As we do the random walks, we pick the more probable paths more often and select the normalization equivalence candidates. Then we calculate the average hits and normalize.  We find the similarity costs and together with the average hits, we construct the lexicon using the entries with the final costs and prune it by selecting the top few.

Sunday, June 1, 2014

Today we will cover some more from the paper we mentioned yesterday. The paper mentioned that the social media text is noisy and contains typos, abbreviations, phonetic substitutions and slangs.
The paper proposed a text normalization approach that can adapt to such variations automatically. This technique can adapt to different sources any genre of social media.  This is different from listing out of vocabulary words because such a list does not suit social media text. Many words and named entities  that do not exist in a  given vocabulary should not be considered for normalization.
OOV word may have many appropriate normalization depending on the context and the domain. Also, text normalization is a pre-processing step should have high accuracy.
With these challenges, the technique proposed addresses these by finding the best normalization sequence according to an n-gram language model. The normalization candidates are automatically picked from unlabeled text data in an unsupervised manner. The approach is scalable, accurate and adaptive to any domain and language. It begins by constructing a lattice from the normalization candidates. Only one-to-one word mappings are considered are considered. The candidates are populated with two generators. One uses a dictionary based spelling correction and the second is based on the trie based approximate string matching. As an aside, we mentioned the trie based string approximation used in mobile devices in an earlier post. A trie is used for prefix or suffix based representation and organization of strings that enables lengthening, letter substitution and letter-number substitutions and other such approximations. Together with the trie based approximation and the spell checker, the words can be normalized.
The dictionary based normalization methods prove inadequate for social media text for many reasons. First, they are too general corrections Second the context less in the sense that they don't consider the nature of the word but the minimum edit distance for any word and their named entity. Third social media text is dynamic where slangs get introduced on a daily basis.
We now describe the graph based random walks mentioned in this paper. Here the authors use the approach that normalization equivalences share similar context. By similar context, they mean a pattern in a say n-grams where the words to the left and to the right are the same between equivalences. If we take a five gram sequence of words, then the  pattern is similar to word1.word2*word4word5. This pattern can now be represented by a bipartite graph where the first partite represents the words and the second partite represents the n-gram contexts shared by the words. A word node can be either normalized or noisy since this decision limits the candidate noisy words to be normalized.
The selection of candidates for normalizing (noisy words) is something like this. There is a vocabulary constructed from a large clean corpus. Any word that does not appear in the vocabulary more than a predefined threshold i.e 10 times is a candidate.
The bipartite graph is composed of W which includes all nodes representing normalized words and noisy words, C that represents the shared context and E that represents the edges connecting word nodes and context nodes. The weight on the edges is the number of occurrences of a word in a context.
The algorithm to construct the graph is based on something like this:
Extract N-grams from the text corpus.
Foreach N-gram denoted by n
   check if the centerword is a noisy or normalized word
   if it is noisy add it to the source node
   else add it to the sink node and add context
   add the context
   add the edge weight (Context, word, count)
Once the bipartite graph is constructed, we can then generate lexicons using Markov Random Walk. The goal is to identify pairs of noisy and normalized words that can be considered as equivalences. A random walk starts at a noisy word and ends at a normalized word. The walker starts from a source node Ni and moves to connected node Mj with probability Pij . This probability is the normalized co-occurrence counts of the word and the corresponding context. Random walks are repeated until either a normalized form is found or k attempts are exhausted. There can be many paths between the same source and the same sink. The paths are generally selected based on transition probability.  In nature this is similar to moving between different forms of words in edit distances.
We note that this graph based method uses similar context. In natural language processing applications, the context is often described as a set of words to the left and right of the candidate.
Is there a better way to describe context and not just based on adjacency in a sentence. This is where the vector space model and clustering helped. However, can we define a graph based on something other than word adjacency for the same context similarity ? Perhaps, we could consider that words can map to different parts of speech or different thesaurus or different domain ontology. Sometimes one solution does not work for all. This is probably where we could consider multiple measures at the same time. While the vector space model attempts to do this without explicit consideration for the factors contributing to the word pair relationships, we could attempt to have graph based explicit considerations for each.

Saturday, May 31, 2014

We read a paper on Social text normalization using Contextual Graph Random Walks by Hassan and Menezes. This paper describes text normalization that comes in useful for keywords extraction or summarization as well. Text normalization does away with noise, typos, abbreviations, phonetic substitutions, abbreviations and slang. This means the cleaned data will be uniform and more meaningful. Their approach uses  Random Walks on a contextual similarity bipartite graph constructed from n-gram sequences on large unlabeled  text corpus.
Natural language processing applications often have to deal with noisy text. Text that is dirty has a lot of problems in that it can break the machine learning algorithms, skew the results and may not be representative of the content. Text results are often measured by parameters such as precision and recall. We discussed these in our earlier posts and they are generally meant to determine the relevancy of the results. We saw that these measures were directly affected by the count of successful matches. If the matches are not possible because dirty text gets discounted, skipped or even worse included where it should not be, then the results are no longer accurate.
Furthermore, we saw that we had some elementary steps to mitigate these. For example, we considered, stemming and lemmatization to discover root words and conform different representations to the same word. Such techniques however can be overly general. If the modifications also are supposed to differentiate the words, then such technique does not alone help.
There are other techniques such as a lookup table that may be relevant to the domain of the subject. Such lookup tables simplify the language to a great deal in that we now have fewer and more consistent words.
Lookup tables are an interesting idea, we can associate one or more variations of the word against the same normalized term. This lets us add new variations such as slangs into the table for lookup. Every-time we see a new term , we can include it in the table for use later.
While Lookup tables are great for associations between terms and their normalized word, the normalized word is the neither the lemma nor necessarily the meaning of the word.
To improve that, we could now extend the idea to use a thesaurus or an ontology. These organize words by their related semantics. An Ontology goes further in describing a hierarchy or even categorizing As you can see these refinements are meant to bring in the dirty text into the mainstream for counting towards the measures we have for the language processing applications.


I will continue on my post on text graphs paper later today.