Saturday, April 1, 2017

Today we continue discussing a survey of statistical techniques in document filtering for indexing. The concept of excluding sensitive documents from indexing is similar to filtering spams from email. These techniques use conditional probabilites and Bayes theorem.Naïve Bayes exhibits good performance on a small feature set but the same cannot be said for larger feature sets. The supervised statistical filtering algorithms therefore do two stages – the training stage and the classification stage. During training a set of labeled documents provide training data. During classification, the prediction is made. When the dimensions of the vector grows, the features are pruned or selected. This has the side benefit that it improves predictions. Feature selection was usually effective to the degree of reducing the features to orders of magnitude smaller than the original. Feature selection techniques include Document Frequency, Information Gain, and Chi-Square.
The content based filtering techniques use simple stop word searches, format-specifiers and their matches for sensitive data such as SSN etc. They can also be analyzed based on text analysis methods such as mutual information or pointwise mutual information with a glossary of sensitive words and their occurrences.   One of the most appealing content based techniques where indexing is already utilized in the classification steps is k nearest neighbours. This technique states that if at least t documents in k neighbors of document m are sensitive, then m is a sensitive document otherwise it is a legitimate document. In this technique there is no separate phase of training and the comparision between documents represented by vectors is real time. The comparision is based on the indexing we have already done for the documents and consequently do not need the document feature matrix however this does involve an update of the sample before the comparisions which is dependent on the sample size because each in the sample is labeled.

#codingexercise
Find the largest contiguous subarray sum:
we can solve it another way from the one discussed yesterday.
// Alternatively,
Initialize and maintain the following variables
x = sum of elements
y = minimum of sum encountered
z = maximum of sum encountered
For each element of the array
           y = min(x, y)
           z = max(z, x-y)

No comments:

Post a Comment