In this post, I discuss document filtering from Programming Collective Intelligence book. While we have talked about keyword extraction, classifying document using machine intelligence is fast becoming a common approach in applications such as the elimination of spam in e-mails, blogs, RSS feeds or message boards. Other applications include dividing your inbox into categories of social or work related e-mails. Early attempts were all rule-based classifiers. Rules were either authored as out of box or allowed to be authored by users.
However, classifiers operate on features that are used to classify different items. Features can simply be a set of words or phrases in a document. Determining which features to uses is tricky and important. The entire text or even a character could be a feature. Learning the features is what is accomplished by training the classifier. When classifiers are trained, they keep track of the count of features/category pairs, the number of times a feature has appeared in a category, the total number of items, and the list of all categories. By passing in a feature extraction parameter to the classifier such as one that returns words in a text and by training the classifier with an association of a feature with a category, the classifier can start accumulating the said counts and probabilities. The probabilities are usually calculated as conditional on the outcome.
The classifier is able to make predictions based only on the training data it has seen so far. If the training data is not sufficient, we get around it by starting from an assumed probability for example 0.5. This can be weighted. The weighted probability returns a weighted average of the calculated probability and the assumed probability.
Now that we have the probability of a document in a category for a given word/feature, we combine the probabilities of the features that make up the document to get the probability of the document belonging to that category. If the probabilities for different features are treated as independent, we call them a naïve Bayes classifier. Once the classifier is constructed, the final step in classifying a document is to calculate the probabilities of that document belonging to each category and then choosing the best.
#codingexercise
RadixSort (int [] numbers, int len)
{
For each column of significant digits starting from least to most
Use any stable sort algorithm to sort the numbers based on that column.
}
However, classifiers operate on features that are used to classify different items. Features can simply be a set of words or phrases in a document. Determining which features to uses is tricky and important. The entire text or even a character could be a feature. Learning the features is what is accomplished by training the classifier. When classifiers are trained, they keep track of the count of features/category pairs, the number of times a feature has appeared in a category, the total number of items, and the list of all categories. By passing in a feature extraction parameter to the classifier such as one that returns words in a text and by training the classifier with an association of a feature with a category, the classifier can start accumulating the said counts and probabilities. The probabilities are usually calculated as conditional on the outcome.
The classifier is able to make predictions based only on the training data it has seen so far. If the training data is not sufficient, we get around it by starting from an assumed probability for example 0.5. This can be weighted. The weighted probability returns a weighted average of the calculated probability and the assumed probability.
Now that we have the probability of a document in a category for a given word/feature, we combine the probabilities of the features that make up the document to get the probability of the document belonging to that category. If the probabilities for different features are treated as independent, we call them a naïve Bayes classifier. Once the classifier is constructed, the final step in classifying a document is to calculate the probabilities of that document belonging to each category and then choosing the best.
#codingexercise
RadixSort (int [] numbers, int len)
{
For each column of significant digits starting from least to most
Use any stable sort algorithm to sort the numbers based on that column.
}
No comments:
Post a Comment