Thursday, February 7, 2013

a new way to tag words

When you want to select keywords in a text, there are several techniques such as linguistic search etc available as mentioned earlier. Usually words are tagged as various parts of speech (POS) available from say WordNet. I propose a slightly different tag that might make document indexing easier. This is a very simple metric that is based on how frequently these words are used across the internet. With this metric, I'm not suggesting a better way to find hapaxes. Instead I'm proposing to differentiate between frequently occuring not so common words in a text block.
The metric could be a simple Poisson Distribution where the formula is
and K is the number of occurrences of the word, k! is the factorial of k and
is a positive real number

 
 

No comments:

Post a Comment