Yesterday we were discussing a survey of statistical techniques in document filtering for indexing. The concept of excluding sensitive documents from indexing is similar to filtering spams from email. These techniques use conditional probabilites and Bayes theorem.Naïve Bayes exhibits good performance on a small feature set but the same cannot be said for larger feature sets. The supervised statistical filtering algorithms therefore do two stages – the training stage and the classification stage. During training a set of labeled documents provide training data. During classification, the prediction is made. When the dimensions of the vector grows, the features are pruned or selected. This has the side benefit that it improves predictions. Feature selection was usually effective to the degree of reducing the features to orders of magnitude smaller than the original. Feature selection techniques include Document Frequency, Information Gain, and Chi-Square.
It is important to note the difference between spam filtering and document indexing. It is much more severe to misclassify a legitimate mail as a spam and missed from the inbox than allowing a spam to pass the filter and show up in the inbox. The missed mail may have been a personal mail and could be a disaster when it wasn't let through. The filtering for documents on the other hand is more focused on not letting a sensitive document through to indexing where it may find its way to public searcg results. it is far less severe to not let a document be indexed in which case it could have been found from file explorer.
Performance of classifiers is measured in terms of precision, recall and F-score. Let S and L stand for sensitive and legitimate documents. NLL and NSS are the number of documents correctly classified by the system in the respective categories. The incorrect classifications are denoted by NLS and NSL for legitimate documents classified as sensitive and sensitive documents classified as legitimate respectively. This gives precision measure as the ratio of successful sensitive classification to overall classifications resulting in sensitive labelling. The recall measure is given by the ratio of the successful sensitive classification to the actual number of sensitive documents. The f-score ranks the classifier with the precision and recall taken together twice as as a fraction of their sums. In all these measures, the sensitive classification is determined by the behavior of the classifier but while a high precision and a high recall determines the better classifier. In addition the emphasis is on a high recall rather than high precision.
#codingexercise
Find the largest contiguous subarray sum:
We can solve this iteratively
int GetMaxSubArraySum(List<int> A)
{
int max_so_far = INT_MIN, max_ending_here = 0;
for (int i = 0; i < A.Count; i++)
{
max_ending_here = max_ending_here + a[i];
if (max_so_far < max_ending_here)
max_so_far = max_ending_here;
if (max_ending_here < 0)
max_ending_here = 0;
}
return max_so_far;
}
It is important to note the difference between spam filtering and document indexing. It is much more severe to misclassify a legitimate mail as a spam and missed from the inbox than allowing a spam to pass the filter and show up in the inbox. The missed mail may have been a personal mail and could be a disaster when it wasn't let through. The filtering for documents on the other hand is more focused on not letting a sensitive document through to indexing where it may find its way to public searcg results. it is far less severe to not let a document be indexed in which case it could have been found from file explorer.
Performance of classifiers is measured in terms of precision, recall and F-score. Let S and L stand for sensitive and legitimate documents. NLL and NSS are the number of documents correctly classified by the system in the respective categories. The incorrect classifications are denoted by NLS and NSL for legitimate documents classified as sensitive and sensitive documents classified as legitimate respectively. This gives precision measure as the ratio of successful sensitive classification to overall classifications resulting in sensitive labelling. The recall measure is given by the ratio of the successful sensitive classification to the actual number of sensitive documents. The f-score ranks the classifier with the precision and recall taken together twice as as a fraction of their sums. In all these measures, the sensitive classification is determined by the behavior of the classifier but while a high precision and a high recall determines the better classifier. In addition the emphasis is on a high recall rather than high precision.
#codingexercise
Find the largest contiguous subarray sum:
We can solve this iteratively
int GetMaxSubArraySum(List<int> A)
{
int max_so_far = INT_MIN, max_ending_here = 0;
for (int i = 0; i < A.Count; i++)
{
max_ending_here = max_ending_here + a[i];
if (max_so_far < max_ending_here)
max_so_far = max_ending_here;
if (max_ending_here < 0)
max_ending_here = 0;
}
return max_so_far;
}