Monday, April 3, 2017

Today we continue discussing a survey of statistical and learning techniques in document filtering for indexing. The concept of excluding sensitive documents from indexing is similar to filtering spams from email. The statistical techniques use conditional probabilites and Bayes theorem.Naïve Bayes exhibits good performance on a small feature set but the same cannot be said for larger feature sets. The supervised statistical filtering algorithms therefore do two stages – the training stage and the classification stage. During training a set of labeled documents provide training data. During classification, the prediction is made. When the dimensions of the vector grows, the features are pruned or selected. This has the side benefit that it improves predictions. Feature selection was usually effective to the degree of reducing the features to orders of magnitude smaller than the original. Feature selection techniques include Document Frequency, Information Gain, and Chi-Square.
The learning based techniques include k nearest neighbours. This technique states that if at least t documents in k neighbors of document m are sensitive, then m is a sensitive document otherwise it is a legitimate document. In this technique there is no separate phase of training and the comparision between documents represented by vectors is real time. The comparision is based on the indexing we have already done for the documents and consequently do not need the document feature matrix however this does involve an update of the sample before the comparisions which is dependent on the sample size because each in the sample is labeled.
Another technique is called the support vector machine which views the training set as vectors with n attributes which corresponds to  points in a hyperspace of n dimensions. in order to perform a binary classification. In the case of binary classification such as document filtering, this technique looks for the hyperplane to separate the points of the first class from that of the second class such that their distance from the hyperplane is maximum.
Another technique is the use of maximum entropy. The principle here is to find the probability distribution that maximizes the entropy as the negative sum of probabilities with their logs taken together where the variatons are taken over both the set of possible classes and the set of possible values of features. The maximization should ensure the probabilities meet all known values in the training set.
Another technique is the use of neural networks. For example, with a technique called perceptron, the idea is to define a linear function in terms of weights and a bias vector such that the function is greater than zero if the variable is of first class and negative otherwise. The perceptron therefore clsssifies based on the given vector otherwise it adjusts the weights amd the bias iteratively.
#codingexercise
Sum of all substrings of digits from number:
       Get all valid combinations using combine method above
        Sum the combinations

Sum of all possible numbers from a selection of digits of a number:
        get all valid combinations as well as their permutations of the digits
        Sum the generated numbers

No comments:

Post a Comment