Keywords could be considered local to the document they appear in. Consequently, keywords not only have an attribute via term frequency but also in their appearance in a given document as opposed to others. This has been utilized in papers such as Yasin et al in keyword extraction using naive Bayes to identify whether a word belongs to the class of ordinary words or keywords. The metric is called TFxIDF which combines Term Frequency and Inverse Document Frequency. TF*IDF(P,D) = P(word in D is W) x [ -log P(W in a document) ]. Assuming feature values are independent, Naive Bayes classifier has been proposed in thsi paper with the following model:
P(key | T, D, PT, PS) = P(T|key) x P(D|Key) x P(PT|key) x P(PS|Key) / P(T, D, PT, PS) where P(key) denotes the prior probability that the word is a key, P(T|key) denotes the probability of having TFxIDF score T given the word is a key, P(D|Key) denotes the probability of having neighbor distance D to the previous occurance of the same word, P(PT|Key) denotes the probability of having relative distance D to the previous occurance of the same word given the word is a key.
P(key | T, D, PT, PS) = P(T|key) x P(D|Key) x P(PT|key) x P(PS|Key) / P(T, D, PT, PS) where P(key) denotes the prior probability that the word is a key, P(T|key) denotes the probability of having TFxIDF score T given the word is a key, P(D|Key) denotes the probability of having neighbor distance D to the previous occurance of the same word, P(PT|Key) denotes the probability of having relative distance D to the previous occurance of the same word given the word is a key.
 
No comments:
Post a Comment