Today we will cover some more from the paper we mentioned yesterday. The paper mentioned that the social media text is noisy and contains typos, abbreviations, phonetic substitutions and slangs.
The paper proposed a text normalization approach that can adapt to such variations automatically. This technique can adapt to different sources any genre of social media. This is different from listing out of vocabulary words because such a list does not suit social media text. Many words and named entities that do not exist in a given vocabulary should not be considered for normalization.
OOV word may have many appropriate normalization depending on the context and the domain. Also, text normalization is a pre-processing step should have high accuracy.
With these challenges, the technique proposed addresses these by finding the best normalization sequence according to an n-gram language model. The normalization candidates are automatically picked from unlabeled text data in an unsupervised manner. The approach is scalable, accurate and adaptive to any domain and language. It begins by constructing a lattice from the normalization candidates. Only one-to-one word mappings are considered are considered. The candidates are populated with two generators. One uses a dictionary based spelling correction and the second is based on the trie based approximate string matching. As an aside, we mentioned the trie based string approximation used in mobile devices in an earlier post. A trie is used for prefix or suffix based representation and organization of strings that enables lengthening, letter substitution and letter-number substitutions and other such approximations. Together with the trie based approximation and the spell checker, the words can be normalized.
The dictionary based normalization methods prove inadequate for social media text for many reasons. First, they are too general corrections Second the context less in the sense that they don't consider the nature of the word but the minimum edit distance for any word and their named entity. Third social media text is dynamic where slangs get introduced on a daily basis.
We now describe the graph based random walks mentioned in this paper. Here the authors use the approach that normalization equivalences share similar context. By similar context, they mean a pattern in a say n-grams where the words to the left and to the right are the same between equivalences. If we take a five gram sequence of words, then the pattern is similar to word1.word2*word4word5. This pattern can now be represented by a bipartite graph where the first partite represents the words and the second partite represents the n-gram contexts shared by the words. A word node can be either normalized or noisy since this decision limits the candidate noisy words to be normalized.
The selection of candidates for normalizing (noisy words) is something like this. There is a vocabulary constructed from a large clean corpus. Any word that does not appear in the vocabulary more than a predefined threshold i.e 10 times is a candidate.
The bipartite graph is composed of W which includes all nodes representing normalized words and noisy words, C that represents the shared context and E that represents the edges connecting word nodes and context nodes. The weight on the edges is the number of occurrences of a word in a context.
The algorithm to construct the graph is based on something like this:
Extract N-grams from the text corpus.
Foreach N-gram denoted by n
check if the centerword is a noisy or normalized word
if it is noisy add it to the source node
else add it to the sink node and add context
add the context
add the edge weight (Context, word, count)
Once the bipartite graph is constructed, we can then generate lexicons using Markov Random Walk. The goal is to identify pairs of noisy and normalized words that can be considered as equivalences. A random walk starts at a noisy word and ends at a normalized word. The walker starts from a source node Ni and moves to connected node Mj with probability Pij . This probability is the normalized co-occurrence counts of the word and the corresponding context. Random walks are repeated until either a normalized form is found or k attempts are exhausted. There can be many paths between the same source and the same sink. The paths are generally selected based on transition probability. In nature this is similar to moving between different forms of words in edit distances.
We note that this graph based method uses similar context. In natural language processing applications, the context is often described as a set of words to the left and right of the candidate.
Is there a better way to describe context and not just based on adjacency in a sentence. This is where the vector space model and clustering helped. However, can we define a graph based on something other than word adjacency for the same context similarity ? Perhaps, we could consider that words can map to different parts of speech or different thesaurus or different domain ontology. Sometimes one solution does not work for all. This is probably where we could consider multiple measures at the same time. While the vector space model attempts to do this without explicit consideration for the factors contributing to the word pair relationships, we could attempt to have graph based explicit considerations for each.
The paper proposed a text normalization approach that can adapt to such variations automatically. This technique can adapt to different sources any genre of social media. This is different from listing out of vocabulary words because such a list does not suit social media text. Many words and named entities that do not exist in a given vocabulary should not be considered for normalization.
OOV word may have many appropriate normalization depending on the context and the domain. Also, text normalization is a pre-processing step should have high accuracy.
With these challenges, the technique proposed addresses these by finding the best normalization sequence according to an n-gram language model. The normalization candidates are automatically picked from unlabeled text data in an unsupervised manner. The approach is scalable, accurate and adaptive to any domain and language. It begins by constructing a lattice from the normalization candidates. Only one-to-one word mappings are considered are considered. The candidates are populated with two generators. One uses a dictionary based spelling correction and the second is based on the trie based approximate string matching. As an aside, we mentioned the trie based string approximation used in mobile devices in an earlier post. A trie is used for prefix or suffix based representation and organization of strings that enables lengthening, letter substitution and letter-number substitutions and other such approximations. Together with the trie based approximation and the spell checker, the words can be normalized.
The dictionary based normalization methods prove inadequate for social media text for many reasons. First, they are too general corrections Second the context less in the sense that they don't consider the nature of the word but the minimum edit distance for any word and their named entity. Third social media text is dynamic where slangs get introduced on a daily basis.
We now describe the graph based random walks mentioned in this paper. Here the authors use the approach that normalization equivalences share similar context. By similar context, they mean a pattern in a say n-grams where the words to the left and to the right are the same between equivalences. If we take a five gram sequence of words, then the pattern is similar to word1.word2*word4word5. This pattern can now be represented by a bipartite graph where the first partite represents the words and the second partite represents the n-gram contexts shared by the words. A word node can be either normalized or noisy since this decision limits the candidate noisy words to be normalized.
The selection of candidates for normalizing (noisy words) is something like this. There is a vocabulary constructed from a large clean corpus. Any word that does not appear in the vocabulary more than a predefined threshold i.e 10 times is a candidate.
The bipartite graph is composed of W which includes all nodes representing normalized words and noisy words, C that represents the shared context and E that represents the edges connecting word nodes and context nodes. The weight on the edges is the number of occurrences of a word in a context.
The algorithm to construct the graph is based on something like this:
Extract N-grams from the text corpus.
Foreach N-gram denoted by n
check if the centerword is a noisy or normalized word
if it is noisy add it to the source node
else add it to the sink node and add context
add the context
add the edge weight (Context, word, count)
Once the bipartite graph is constructed, we can then generate lexicons using Markov Random Walk. The goal is to identify pairs of noisy and normalized words that can be considered as equivalences. A random walk starts at a noisy word and ends at a normalized word. The walker starts from a source node Ni and moves to connected node Mj with probability Pij . This probability is the normalized co-occurrence counts of the word and the corresponding context. Random walks are repeated until either a normalized form is found or k attempts are exhausted. There can be many paths between the same source and the same sink. The paths are generally selected based on transition probability. In nature this is similar to moving between different forms of words in edit distances.
We note that this graph based method uses similar context. In natural language processing applications, the context is often described as a set of words to the left and right of the candidate.
Is there a better way to describe context and not just based on adjacency in a sentence. This is where the vector space model and clustering helped. However, can we define a graph based on something other than word adjacency for the same context similarity ? Perhaps, we could consider that words can map to different parts of speech or different thesaurus or different domain ontology. Sometimes one solution does not work for all. This is probably where we could consider multiple measures at the same time. While the vector space model attempts to do this without explicit consideration for the factors contributing to the word pair relationships, we could attempt to have graph based explicit considerations for each.
No comments:
Post a Comment