In the previous post we mentioned that when the features are independent, we can use Bayes classifier. The spam filtering does not use Bayes Classifier. It uses a method called Fisher method which is implemented by SpamBayes an outlook plugin. The difference is that this calculates the probability of a category for each feature in the document and then combines them to see if its a random set or its more or less likely to be in a category. In the Bayesian filter, we combined all the Pr(feature | category) results But there might be many more documents in one category than the other and this was ignored. Here we take advantage of the features that distinguish the categories. The Fisher method calculates: clf = Pr(feature | category) for this category, freqsum = sum of Pr(feature | category) for all the categories and then the cprob = clf / freqsum. By this method a mail containing a word casino will be treated as spam with a better probability than the Bayes classifier which can be useful in deciding thresholds and cutoff. When the probabilities are independent and random, the chi-square measure can be used as a measure of the goodness of the classification. Since an item belonging to a category may have many features of that category with high probability, the Fisher method calculates the inverse chi-square function to get the probability that a random set of probabilities would return such a higher number.
When there are many users training the classifier simultaneously, it is probably better to store the data and the counts in a database such as SQLite
When there are many users training the classifier simultaneously, it is probably better to store the data and the counts in a database such as SQLite
No comments:
Post a Comment