Thursday, April 6, 2017

Today we continue discussing a survey of content based techniques in document filtering for indexing. The concept of excluding sensitive documents from indexing is similar to filtering spams from email. The statistical techniques use conditional probabilities and Bayes theorem. Naïve Bayes exhibits good performance on a small feature set but the same cannot be said for larger feature sets. The supervised statistical filtering algorithms therefore do two stages – the training stage and the classification stage. During training a set of labeled documents provide training data. During classification, the prediction is made. When the dimensions of the vector grows, the features are pruned or selected. This has the side benefit that it improves predictions. Feature selection was usually effective to the degree of reducing the features to orders of magnitude smaller than the original. Feature selection techniques include Document Frequency, Information Gain, and Chi-Square.
Other techniques include k nearest neighbours. This technique states that if at least t documents in k neighbors of document m are sensitive, then m is a sensitive document otherwise it is a legitimate document. In this technique there is no separate phase of training and the comparision between documents represented by vectors is real time. The comparision is based on the indexing we have already done for the documents and consequently do not need the document feature matrix however this does involve an update of the sample before the comparisions which is dependent on the sample size because each in the sample is labeled.
Another technique is called the support vector machine which views the training set as vectors with n attributes which corresponds to  points in a hyperspace of n dimensions. in order to perform a binary classification. In the case of binary classification such as document filtering, this technique looks for the hyperplane to separate the points of the first class from that of the second class such that their distance from the hyperplane is maximum.
Another technique is the use of maximum entropy. The principle here is to find the probability distribution that maximizes the entropy as the negative sum of probabilities with their logs taken together where the variatons are taken over both the set of possible classes and the set of possible values of features. The maximization should ensure the probabilities meet all known values in the training set.
Another technique is the use of neural networks. For example, with a technique called perceptron, the idea is to define a linear function in terms of weights and a bias vector such that the function is greater than zero if the variable is of first class and negative otherwise. The perceptron therefore clsssifies based on the given vector otherwise it adjusts the weights amd the bias iteratively.
Another technique is the use of search engines. Often the sensitive information in the document may not be directly inline but referred through links and references.  These are equally exploitable just as much as if the sensitive information was inline.  A proposed technique to overcome this challenge is to use the mechanism of search engines where links are crawled and their content is known.  Links are followed to an already classified document and then the current document is marked as sensitive or not. Additional inputs from the user can then be used to determine whether the document should or should not be served in the search results. If the content is not already known, it is classified with a simple classifier such as a Naive Bayes classifier. This makes the model more dynamic.
Another technique is the use of genetic programming. In the design of a Bayesian filter,  the characteristic vector includes the frequencies of words selected as features generally by manual inspection. This selection plays a dominant effect on the filter performance. Instead the filter can be represented by syntactic tree where nodes are the numbers that represent the frequencies, operations on the numbers, words and the operations on the words. Syntactic rules then can be used to check the correctness of the tree. Then functions are used that evaluate the fitness of a population to some criteria. For example, this criteria can be the sum of squares of error between classification of the document by the filter and the correct classification of the document for both the sensitive as well as the allowed document sets.
#codingexercise
find sum of all substrings of digits
number = "1234" 1 2 12
sumofdigit[0] = 1 = 1
sumofdigit[1] = 2 + 12  = 14
sumofdigit[2] = 3 + 23  + 123 = 149
sumofdigit[3] = 4 + 34  + 234 + 1234  = 1506

Total = 1670
Int GetSumDP(List<int>A, int index,  ref int[] dp)
for (int  I = 0; I <= index; i++){ 
//    for (int j = 0; j <= i;j++) { 
            dp[i]= dp[i]*10 + A[i];
   } 
Return dp.Sum(); 
}

int GetSum(List<int> A)
{
int result = 0; 
var dp = new int[A.Count + 1]();
for (int i = 0; i < A.count; i++)
        result += GetSumDP( A, i, ref dp);
return result;
}

No comments:

Post a Comment