Friday, April 7, 2017

Today we continue discussing a survey of content based techniques in document filtering for indexing. We discussed statistical techniques that use conditional probabilities and Bayes theorem. Naïve Bayes  classifier exhibits good performance on a small feature set but the same cannot be said for larger feature sets. Features are therefore selected. Feature selection techniques include Document Frequency, Information Gain, and Chi-Square.
One technique is k nearest neighbours. This technique states that if at least t documents in k neighbors of document m are sensitive, then m is a sensitive document otherwise it is a legitimate document.
Another technique is called the support vector machine which views the training set as vectors with n attributes which corresponds to  points in a hyperspace of n dimensions. in order to perform a binary classification.
Another technique is the principle of entropy. The principle here is to find the probability distribution that maximizes the entropy as the negative sum of probabilities with their logs taken together where the variatons are taken over both the set of possible classes and the set of possible values of features.
Another technique is the use of neural networks. For example, with a technique called perceptron, the idea is to define a linear function in terms of weights and a bias vector such that the function is greater than zero if the variable is of first class and negative otherwise. The perceptron therefore clsssifies based on the given vector otherwise it adjusts the weights and the bias iteratively.
The technique that is chosen for the learning is only one of the considerations in the overall design. In order to come up with the right classifier, we need to consider the following steps in sequence:
The first step is collecting the dataset. If an expert is available, then we may know the fields (attributes, features) which are most informative otherwise we try to measure everything in a brute-force method in order to shortlist the attribute, feature pair
The second step is the data preparation and the data preprocessing. Depending on the circumstances, researchers have a number of methods to choose from to handle missing data. These methods not only handle noise but also cope with large data by working on smaller datasets.
Then the features are pruned by selecting a subset of features and is used to identify and remove as many irrelevant and redundant features as possible. Sometimes features depend on one another and unduly influence the classifier. This is solved by constructing new features from the basic feature set.
Then the training set is defined. One rule of thumb is to split the training set by using two thirds for training and the other third for estimating performance.  Another technique is to do cross validation where the training set is mutually exclusive and equal sized and the classifier is trained on the union. Generally we have only one test dataset of size N and all estimates must be obtained from this sole dataset. A common method for comparing supervised ML algorithms involves comparing the accuracies of the classifiers statistically. Algorithm selection depends on the feedback from this evaluation.  The feedback from the evaluation of the test set is also used to improve each of the steps described above.
#codingexercise
Find sum of all subsequences of digits in a number made up of  non repeating digits
Void  Combine(ref List<int> digits, ref List<int> b, int start, int level, ref List<List<int>> combinations)  
{  
for (int I =start; I < digits.length; I++)  
{   
  if (b.contains(digits[i])== false){  
    b[level] = digits[i];  
    combinations.Add(b);
 if (I < digits.length)  
   Combine(ref digits, ref b, m, i+1, level+1, ref combinations);  
 b[level] = '/0';  
}  
combinations.sum();

No comments:

Post a Comment