Today we continue discussing a survey of statistical and learning techniques in document filtering for indexing. The concept of excluding sensitive documents from indexing is similar to filtering spams from email. The statistical techniques use conditional probabilites and Bayes theorem.Naïve Bayes exhibits good performance on a small feature set but the same cannot be said for larger feature sets. The supervised statistical filtering algorithms therefore do two stages – the training stage and the classification stage. During training a set of labeled documents provide training data. During classification, the prediction is made. When the dimensions of the vector grows, the features are pruned or selected. This has the side benefit that it improves predictions. Feature selection was usually effective to the degree of reducing the features to orders of magnitude smaller than the original. Feature selection techniques include Document Frequency, Information Gain, and Chi-Square.
The learning based techniques include k nearest neighbours. This technique states that if at least t documents in k neighbors of document m are sensitive, then m is a sensitive document otherwise it is a legitimate document. In this technique there is no separate phase of training and the comparision between documents represented by vectors is real time. The comparision is based on the indexing we have already done for the documents and consequently do not need the document feature matrix however this does involve an update of the sample before the comparisions which is dependent on the sample size because each in the sample is labeled.
Another technique is called the support vector machine which views the training set as vectors with n attributes which corresponds to points in a hyperspace of n dimensions. in order to perform a binary classification. In the case of binary classification such as document filtering, this technique looks for the hyperplane to separate the points of the first class from that of the second class such that tgeir distance from the hyperplane is maximum.
The learning based techniques include k nearest neighbours. This technique states that if at least t documents in k neighbors of document m are sensitive, then m is a sensitive document otherwise it is a legitimate document. In this technique there is no separate phase of training and the comparision between documents represented by vectors is real time. The comparision is based on the indexing we have already done for the documents and consequently do not need the document feature matrix however this does involve an update of the sample before the comparisions which is dependent on the sample size because each in the sample is labeled.
Another technique is called the support vector machine which views the training set as vectors with n attributes which corresponds to points in a hyperspace of n dimensions. in order to perform a binary classification. In the case of binary classification such as document filtering, this technique looks for the hyperplane to separate the points of the first class from that of the second class such that tgeir distance from the hyperplane is maximum.
No comments:
Post a Comment