When we discussed the Random walk based lexicon building in the algorithm from the earlier post, we had a few stages of computations.
First we built the bipartite graph. ( this was mentioned before the last one )
We built it this way:
We extracted N-grams from the text corpus.
For each N-gram denoted by n.
we checked if the center word is noisy or a normalized word.
if it is noisy we add it to the source node
else we add it to the sink node and add context
we add the context and
then we add the edge weight
Second we perform the random walk. The goal was to identify pairs of noisy and normalized words that can be considered as equivalences. The walk starts at a noisy word and ends at a normalized word. We normalize the results for each node that is dirty.
Third we calculate the costs and then prune the list to the top few.
When we calculated the cost, we factored in a component for the similarity cost. This cost function was described as the ratio of the longest common sub sequence ratio and edit distance between two strings.
We had a cost for the random walk as well which was called the hitting time between those nodes. Furthermore, we averaged this hitting time between two nodes and normalized it with that of all other nodes linked to that noisy node.
In the algorithm above, we first see that we do iterations over the lexicons list. By doing these repeated iterations we wanted to choose the most probable paths and refine the results.
This way we now have a list of the top few contextually relevant words.
This method lets us use pair wise contextual similarity to normalize social media text.
First we built the bipartite graph. ( this was mentioned before the last one )
We built it this way:
We extracted N-grams from the text corpus.
For each N-gram denoted by n.
we checked if the center word is noisy or a normalized word.
if it is noisy we add it to the source node
else we add it to the sink node and add context
we add the context and
then we add the edge weight
Second we perform the random walk. The goal was to identify pairs of noisy and normalized words that can be considered as equivalences. The walk starts at a noisy word and ends at a normalized word. We normalize the results for each node that is dirty.
Third we calculate the costs and then prune the list to the top few.
When we calculated the cost, we factored in a component for the similarity cost. This cost function was described as the ratio of the longest common sub sequence ratio and edit distance between two strings.
We had a cost for the random walk as well which was called the hitting time between those nodes. Furthermore, we averaged this hitting time between two nodes and normalized it with that of all other nodes linked to that noisy node.
In the algorithm above, we first see that we do iterations over the lexicons list. By doing these repeated iterations we wanted to choose the most probable paths and refine the results.
This way we now have a list of the top few contextually relevant words.
This method lets us use pair wise contextual similarity to normalize social media text.
No comments:
Post a Comment