Monday, June 2, 2014

Having discussed the paper on normalizing social media text in the previous post, we will conclude reviewing it today. We mentioned that the approach in the paper uses a bipartite graph and performs a random walk to connect words to their normalized equivalences.  We mentioned briefly about how this helps us identify pair wise relationships between words and their contexts. We were looking to establish contextual similarity in other ways. This requires exploiting several relationship considerations such as semantic relationship, part of speech relationship, categorization or ontology based groupings etc. Perhaps we could also consider local vector space models instead of just relying on adjacent words in  a sentence. This means we use a local vector space models to establish contextual similarity and a global graph and random walk model to establish pair wise relationships.
The subject of these discussions is a different topic. However, we will just finish reviewing the paper first. The number of steps taken to traverse between any two nodes is called the hitting time.  The average hitting time of all walks that connect the source and sink nodes is considered H(n,m) = Sum[hr(n,m) / R] where the numerator is the hitting time between the pair n,m over  a walk r.
The relative frequency of the average hitting of those two nodes H(n,m) and all other nodes linked to that noisy node is then described as L(n,m) = H(n,m) / Sum-i H(n,m). The is one cost.
The cost between a noisy word and a normalized word is also the lexical similarity cost SimCost(n,m). This cost function is described as the ratio of Longest common subsequence ratio (LCSR) and Edit distance between two strings.
The algorithm to induce the lexicons is something like this:
The lexicon is first initialized.
for each node in the graph
     if this node in the graph is noisy
         Initialize R(n)
     for i = 0 to K
     do
       R(n) = RandomWalk(n)
       L(n) = Normalize R(n)
Calculate Lexical Similarity Cost
Ln = SimCost(Ln)
Ln = Prune(Ln)
Lexicon = ADD(Ln)
In the algorithm above we see that we first initialize the Lexicons and add to it by performing iterations. As we do the random walks, we pick the more probable paths more often and select the normalization equivalence candidates. Then we calculate the average hits and normalize.  We find the similarity costs and together with the average hits, we construct the lexicon using the entries with the final costs and prune it by selecting the top few.

No comments:

Post a Comment