Saturday, May 31, 2014

We read a paper on Social text normalization using Contextual Graph Random Walks by Hassan and Menezes. This paper describes text normalization that comes in useful for keywords extraction or summarization as well. Text normalization does away with noise, typos, abbreviations, phonetic substitutions, abbreviations and slang. This means the cleaned data will be uniform and more meaningful. Their approach uses  Random Walks on a contextual similarity bipartite graph constructed from n-gram sequences on large unlabeled  text corpus.
Natural language processing applications often have to deal with noisy text. Text that is dirty has a lot of problems in that it can break the machine learning algorithms, skew the results and may not be representative of the content. Text results are often measured by parameters such as precision and recall. We discussed these in our earlier posts and they are generally meant to determine the relevancy of the results. We saw that these measures were directly affected by the count of successful matches. If the matches are not possible because dirty text gets discounted, skipped or even worse included where it should not be, then the results are no longer accurate.
Furthermore, we saw that we had some elementary steps to mitigate these. For example, we considered, stemming and lemmatization to discover root words and conform different representations to the same word. Such techniques however can be overly general. If the modifications also are supposed to differentiate the words, then such technique does not alone help.
There are other techniques such as a lookup table that may be relevant to the domain of the subject. Such lookup tables simplify the language to a great deal in that we now have fewer and more consistent words.
Lookup tables are an interesting idea, we can associate one or more variations of the word against the same normalized term. This lets us add new variations such as slangs into the table for lookup. Every-time we see a new term , we can include it in the table for use later.
While Lookup tables are great for associations between terms and their normalized word, the normalized word is the neither the lemma nor necessarily the meaning of the word.
To improve that, we could now extend the idea to use a thesaurus or an ontology. These organize words by their related semantics. An Ontology goes further in describing a hierarchy or even categorizing As you can see these refinements are meant to bring in the dirty text into the mainstream for counting towards the measures we have for the language processing applications.


No comments:

Post a Comment