Monday, June 2, 2014

When we discussed the Random walk based lexicon building in the algorithm from the earlier post, we had a few stages of computations.
First we built the bipartite graph. ( this was mentioned before the last one )
We built it this way:
     We extracted N-grams from the text corpus.
      For each N-gram denoted by n.
      we checked if the center word  is noisy or a normalized word.
      if it is noisy we add it to the source node
      else we add it to the sink node and add context
      we add the context and
      then we add the edge weight 
Second we perform the random walk. The goal was to identify pairs of noisy and normalized words that can be considered as equivalences. The walk starts at a noisy word and ends at a normalized word. We normalize the results for each node that is dirty.
Third we calculate the costs and then prune the list to the top few.
When we calculated the cost, we factored in a component for the similarity cost. This cost function was described as the ratio of the longest common sub sequence ratio and edit distance between two strings.
We had a cost for the random walk as well which was called the hitting time between those nodes. Furthermore, we averaged this hitting time between two nodes and normalized it with that of all other nodes linked to that noisy node.
In the algorithm above, we first see that we do iterations over the lexicons list. By doing these repeated iterations we wanted to choose the most probable paths and refine the results.
This way we now have a list of the top few contextually relevant words.
This method lets us use pair wise contextual similarity to normalize social media text.



Having discussed the paper on normalizing social media text in the previous post, we will conclude reviewing it today. We mentioned that the approach in the paper uses a bipartite graph and performs a random walk to connect words to their normalized equivalences.  We mentioned briefly about how this helps us identify pair wise relationships between words and their contexts. We were looking to establish contextual similarity in other ways. This requires exploiting several relationship considerations such as semantic relationship, part of speech relationship, categorization or ontology based groupings etc. Perhaps we could also consider local vector space models instead of just relying on adjacent words in  a sentence. This means we use a local vector space models to establish contextual similarity and a global graph and random walk model to establish pair wise relationships.
The subject of these discussions is a different topic. However, we will just finish reviewing the paper first. The number of steps taken to traverse between any two nodes is called the hitting time.  The average hitting time of all walks that connect the source and sink nodes is considered H(n,m) = Sum[hr(n,m) / R] where the numerator is the hitting time between the pair n,m over  a walk r.
The relative frequency of the average hitting of those two nodes H(n,m) and all other nodes linked to that noisy node is then described as L(n,m) = H(n,m) / Sum-i H(n,m). The is one cost.
The cost between a noisy word and a normalized word is also the lexical similarity cost SimCost(n,m). This cost function is described as the ratio of Longest common subsequence ratio (LCSR) and Edit distance between two strings.
The algorithm to induce the lexicons is something like this:
The lexicon is first initialized.
for each node in the graph
     if this node in the graph is noisy
         Initialize R(n)
     for i = 0 to K
     do
       R(n) = RandomWalk(n)
       L(n) = Normalize R(n)
Calculate Lexical Similarity Cost
Ln = SimCost(Ln)
Ln = Prune(Ln)
Lexicon = ADD(Ln)
In the algorithm above we see that we first initialize the Lexicons and add to it by performing iterations. As we do the random walks, we pick the more probable paths more often and select the normalization equivalence candidates. Then we calculate the average hits and normalize.  We find the similarity costs and together with the average hits, we construct the lexicon using the entries with the final costs and prune it by selecting the top few.

Sunday, June 1, 2014

Today we will cover some more from the paper we mentioned yesterday. The paper mentioned that the social media text is noisy and contains typos, abbreviations, phonetic substitutions and slangs.
The paper proposed a text normalization approach that can adapt to such variations automatically. This technique can adapt to different sources any genre of social media.  This is different from listing out of vocabulary words because such a list does not suit social media text. Many words and named entities  that do not exist in a  given vocabulary should not be considered for normalization.
OOV word may have many appropriate normalization depending on the context and the domain. Also, text normalization is a pre-processing step should have high accuracy.
With these challenges, the technique proposed addresses these by finding the best normalization sequence according to an n-gram language model. The normalization candidates are automatically picked from unlabeled text data in an unsupervised manner. The approach is scalable, accurate and adaptive to any domain and language. It begins by constructing a lattice from the normalization candidates. Only one-to-one word mappings are considered are considered. The candidates are populated with two generators. One uses a dictionary based spelling correction and the second is based on the trie based approximate string matching. As an aside, we mentioned the trie based string approximation used in mobile devices in an earlier post. A trie is used for prefix or suffix based representation and organization of strings that enables lengthening, letter substitution and letter-number substitutions and other such approximations. Together with the trie based approximation and the spell checker, the words can be normalized.
The dictionary based normalization methods prove inadequate for social media text for many reasons. First, they are too general corrections Second the context less in the sense that they don't consider the nature of the word but the minimum edit distance for any word and their named entity. Third social media text is dynamic where slangs get introduced on a daily basis.
We now describe the graph based random walks mentioned in this paper. Here the authors use the approach that normalization equivalences share similar context. By similar context, they mean a pattern in a say n-grams where the words to the left and to the right are the same between equivalences. If we take a five gram sequence of words, then the  pattern is similar to word1.word2*word4word5. This pattern can now be represented by a bipartite graph where the first partite represents the words and the second partite represents the n-gram contexts shared by the words. A word node can be either normalized or noisy since this decision limits the candidate noisy words to be normalized.
The selection of candidates for normalizing (noisy words) is something like this. There is a vocabulary constructed from a large clean corpus. Any word that does not appear in the vocabulary more than a predefined threshold i.e 10 times is a candidate.
The bipartite graph is composed of W which includes all nodes representing normalized words and noisy words, C that represents the shared context and E that represents the edges connecting word nodes and context nodes. The weight on the edges is the number of occurrences of a word in a context.
The algorithm to construct the graph is based on something like this:
Extract N-grams from the text corpus.
Foreach N-gram denoted by n
   check if the centerword is a noisy or normalized word
   if it is noisy add it to the source node
   else add it to the sink node and add context
   add the context
   add the edge weight (Context, word, count)
Once the bipartite graph is constructed, we can then generate lexicons using Markov Random Walk. The goal is to identify pairs of noisy and normalized words that can be considered as equivalences. A random walk starts at a noisy word and ends at a normalized word. The walker starts from a source node Ni and moves to connected node Mj with probability Pij . This probability is the normalized co-occurrence counts of the word and the corresponding context. Random walks are repeated until either a normalized form is found or k attempts are exhausted. There can be many paths between the same source and the same sink. The paths are generally selected based on transition probability.  In nature this is similar to moving between different forms of words in edit distances.
We note that this graph based method uses similar context. In natural language processing applications, the context is often described as a set of words to the left and right of the candidate.
Is there a better way to describe context and not just based on adjacency in a sentence. This is where the vector space model and clustering helped. However, can we define a graph based on something other than word adjacency for the same context similarity ? Perhaps, we could consider that words can map to different parts of speech or different thesaurus or different domain ontology. Sometimes one solution does not work for all. This is probably where we could consider multiple measures at the same time. While the vector space model attempts to do this without explicit consideration for the factors contributing to the word pair relationships, we could attempt to have graph based explicit considerations for each.

Saturday, May 31, 2014

We read a paper on Social text normalization using Contextual Graph Random Walks by Hassan and Menezes. This paper describes text normalization that comes in useful for keywords extraction or summarization as well. Text normalization does away with noise, typos, abbreviations, phonetic substitutions, abbreviations and slang. This means the cleaned data will be uniform and more meaningful. Their approach uses  Random Walks on a contextual similarity bipartite graph constructed from n-gram sequences on large unlabeled  text corpus.
Natural language processing applications often have to deal with noisy text. Text that is dirty has a lot of problems in that it can break the machine learning algorithms, skew the results and may not be representative of the content. Text results are often measured by parameters such as precision and recall. We discussed these in our earlier posts and they are generally meant to determine the relevancy of the results. We saw that these measures were directly affected by the count of successful matches. If the matches are not possible because dirty text gets discounted, skipped or even worse included where it should not be, then the results are no longer accurate.
Furthermore, we saw that we had some elementary steps to mitigate these. For example, we considered, stemming and lemmatization to discover root words and conform different representations to the same word. Such techniques however can be overly general. If the modifications also are supposed to differentiate the words, then such technique does not alone help.
There are other techniques such as a lookup table that may be relevant to the domain of the subject. Such lookup tables simplify the language to a great deal in that we now have fewer and more consistent words.
Lookup tables are an interesting idea, we can associate one or more variations of the word against the same normalized term. This lets us add new variations such as slangs into the table for lookup. Every-time we see a new term , we can include it in the table for use later.
While Lookup tables are great for associations between terms and their normalized word, the normalized word is the neither the lemma nor necessarily the meaning of the word.
To improve that, we could now extend the idea to use a thesaurus or an ontology. These organize words by their related semantics. An Ontology goes further in describing a hierarchy or even categorizing As you can see these refinements are meant to bring in the dirty text into the mainstream for counting towards the measures we have for the language processing applications.


I will continue on my post on text graphs paper later today.

Friday, May 30, 2014

In today's post we look at the paper on a graph based approach to skill extraction from text by Kivimaki, Panchenko, Dessy and Verdegem. This paper presents a system that outputs a list of professional skills from a given text and as obtained from LinkedIn social network. The text is reviewed for similarities with the texts of Wikipedia pages and then uses a Spreading Activation algorithm on the Wikipedia graph in order to associate the input document with the skill.  We look at just this algorithm. This method consists of two phases : First the query document is associated with the Wikipedia articles using a vector space model and described as a text2wiki phase. Then with these pages and the hyperlink graph of Wikipedia, articles corresponding to the skills are found and related using the algorithm in what is described as a wiki2skill phase. One caveat with this method is that it has to avoid overly general topics or skills and this is done by biasing against hubs.
The text2wiki module relies on Gensim library for text similarity functions based on traditional vector space models. Each text is represented as a vector in a space of 300,000 most frequent terms in the corpus. This is then input to the wiki2skill module where the vector with initial activations a(0) is iterated upon and the activation spread into the neighbouring nodes. Activation refers to the initial work done to identify the nodes of interest. At the end of each iteration, we propagate the activations using the formula:
a(t) = decay_factor.a(t-1) + Lambda. W^pulses. a(t-1) + c(t)
where the decay factor controls the conservation of activation during time
Lambda is the friction factor which controls the amount of activations that nodes can spread to their neighbours
W is the weighted adjacency matrix

Thursday, May 29, 2014

In Today's Post I Will Talk about assigning weights to keywords in a text. We will use the same strategy as in the paper described earlier. Specifically we will use the number of categories together with the words as subgraphs.