Thursday, March 30, 2017

A survey of statistical techniques in document filtering for indexing
The concept of excluding sensitive documents from indexing is similar to filtering spams from email. Consequently, the statistical techniques that apply for spam filtering also apply to document filtering. I discuss these techniques here:

The task of index filtering is to rule out sensitive documents automatically using a user’s document stream. The statistical filter based techniques work on computing the probability that the document containing a given word such as SSN, is not indexable. The determination proceeds by calculating the conditional probability by taking the probabilities of the word appearing in the document, the overall probability of any given document is unindexable, the probability that the word appears in sensitive documents, the probability that the given document is indexable and the probability that the word appears in indexable documents. The Bayes probability is therefore computed as a conditional probability.
Naïve Bayes exhibits good performance on a small feature set but the same cannot be said for larger feature sets. Filtering for documents is formally described as follows: Given  a document set D= {d1, d2, ...dj, ...dD} and category C = {c1 = invalid, c2 = legitimate } where dj is the jth mail in D and C is the possible label set. The task of the Automated Document Filtering is to build  a boolean categorization function Phi(dj,ci) on DxC->{True, False}. When Phi(dj, ci) is True, it indicates  document dj belongs to category ci; when Phi(dj, ci) is False, it means dj does not belong to ci. Each document can only be assigned to one of them but not both The supervised statistical filtering algorithms therefore do two stages – the training stage and the classification stage. During training a set of labeled documents provide training data. During classification, the prediction is made.

For a classifier, Vector space model helped define the feature space using tens of thousands of features. Loosely speaking this is similar to treating a document as a bag of words where the words are features and a finite but numerous features were selected and their emphasis in the document represented the vector. It turned out that the large number of features did more harm than good. Therefore features were "pruned" or "selected" which included a benefit that it improved prediction in some cases.  Feature selection was usually effective to the degree of reducing the features to orders of magnitude smaller than the original. Feature selection techniques include Document Frequency, Information Gain, and Chi-Square.
Information Gain is often treated the same as mutual information but can be better translated as average mutual information.
Performance of classifiers is measured in terms of precision, recall and F-score.

#codingexercise
we were discussing ways to count distinct subsequences. Here's another:

Int GetCount(List<int>A) 
var dp = new int[A.Count+1]{}; 
var sum = new int[A.Count+1]{};
dp[0] = 1;
sum[0] = 1;
for (int  I = 1; I <= A.Count; i++){ 
    var total_dp_upto_current = 0;
    var total_dp_upto_repetition = 0;
    for (int j = 0; j <= i – 1;j++) { 
       total_dp_upto_current += dp[j];
    }
     if (h.Contains(A[i-1]) == true){
        for (int j= 0; j <= h[A[i-1]]; j++){
              total_dp_upto_repetition += dp[j];
        }
     }
     dp[i] = total_dp_upto_current - total_dp_upto_repetition;
     sum[i] = sum[i-1] + dp[i];
     last[A[i-1]] = (i-1);
   } 
Return sum[n]; 
}

No comments:

Post a Comment