Friday, April 7, 2017

Today we continue discussing a survey of content based techniques in document filtering for indexing. We discussed statistical techniques that use conditional probabilities and Bayes theorem. Naïve Bayes  classifier exhibits good performance on a small feature set but the same cannot be said for larger feature sets. Features are therefore selected. Feature selection techniques include Document Frequency, Information Gain, and Chi-Square.
One technique is k nearest neighbours. This technique states that if at least t documents in k neighbors of document m are sensitive, then m is a sensitive document otherwise it is a legitimate document.
Another technique is called the support vector machine which views the training set as vectors with n attributes which corresponds to  points in a hyperspace of n dimensions. in order to perform a binary classification.
Another technique is the principle of entropy. The principle here is to find the probability distribution that maximizes the entropy as the negative sum of probabilities with their logs taken together where the variatons are taken over both the set of possible classes and the set of possible values of features.
Another technique is the use of neural networks. For example, with a technique called perceptron, the idea is to define a linear function in terms of weights and a bias vector such that the function is greater than zero if the variable is of first class and negative otherwise. The perceptron therefore clsssifies based on the given vector otherwise it adjusts the weights and the bias iteratively.
The technique that is chosen for the learning is only one of the considerations in the overall design. In order to come up with the right classifier, we need to consider the following steps in sequence:
The first step is collecting the dataset. If an expert is available, then we may know the fields (attributes, features) which are most informative otherwise we try to measure everything in a brute-force method in order to shortlist the attribute, feature pair
The second step is the data preparation and the data preprocessing. Depending on the circumstances, researchers have a number of methods to choose from to handle missing data. These methods not only handle noise but also cope with large data by working on smaller datasets.
Then the features are pruned by selecting a subset of features and is used to identify and remove as many irrelevant and redundant features as possible. Sometimes features depend on one another and unduly influence the classifier. This is solved by constructing new features from the basic feature set.
Then the training set is defined. One rule of thumb is to split the training set by using two thirds for training and the other third for estimating performance.  Another technique is to do cross validation where the training set is mutually exclusive and equal sized and the classifier is trained on the union. Generally we have only one test dataset of size N and all estimates must be obtained from this sole dataset. A common method for comparing supervised ML algorithms involves comparing the accuracies of the classifiers statistically. Algorithm selection depends on the feedback from this evaluation.  The feedback from the evaluation of the test set is also used to improve each of the steps described above.
#codingexercise
Find sum of all subsequences of digits in a number made up of  non repeating digits
Void  Combine(ref List<int> digits, ref List<int> b, int start, int level, ref List<List<int>> combinations)  
{  
for (int I =start; I < digits.length; I++)  
{   
  if (b.contains(digits[i])== false){  
    b[level] = digits[i];  
    combinations.Add(b);
 if (I < digits.length)  
   Combine(ref digits, ref b, m, i+1, level+1, ref combinations);  
 b[level] = '/0';  
}  
combinations.sum();

Thursday, April 6, 2017

Today we continue discussing a survey of content based techniques in document filtering for indexing. The concept of excluding sensitive documents from indexing is similar to filtering spams from email. The statistical techniques use conditional probabilities and Bayes theorem. Naïve Bayes exhibits good performance on a small feature set but the same cannot be said for larger feature sets. The supervised statistical filtering algorithms therefore do two stages – the training stage and the classification stage. During training a set of labeled documents provide training data. During classification, the prediction is made. When the dimensions of the vector grows, the features are pruned or selected. This has the side benefit that it improves predictions. Feature selection was usually effective to the degree of reducing the features to orders of magnitude smaller than the original. Feature selection techniques include Document Frequency, Information Gain, and Chi-Square.
Other techniques include k nearest neighbours. This technique states that if at least t documents in k neighbors of document m are sensitive, then m is a sensitive document otherwise it is a legitimate document. In this technique there is no separate phase of training and the comparision between documents represented by vectors is real time. The comparision is based on the indexing we have already done for the documents and consequently do not need the document feature matrix however this does involve an update of the sample before the comparisions which is dependent on the sample size because each in the sample is labeled.
Another technique is called the support vector machine which views the training set as vectors with n attributes which corresponds to  points in a hyperspace of n dimensions. in order to perform a binary classification. In the case of binary classification such as document filtering, this technique looks for the hyperplane to separate the points of the first class from that of the second class such that their distance from the hyperplane is maximum.
Another technique is the use of maximum entropy. The principle here is to find the probability distribution that maximizes the entropy as the negative sum of probabilities with their logs taken together where the variatons are taken over both the set of possible classes and the set of possible values of features. The maximization should ensure the probabilities meet all known values in the training set.
Another technique is the use of neural networks. For example, with a technique called perceptron, the idea is to define a linear function in terms of weights and a bias vector such that the function is greater than zero if the variable is of first class and negative otherwise. The perceptron therefore clsssifies based on the given vector otherwise it adjusts the weights amd the bias iteratively.
Another technique is the use of search engines. Often the sensitive information in the document may not be directly inline but referred through links and references.  These are equally exploitable just as much as if the sensitive information was inline.  A proposed technique to overcome this challenge is to use the mechanism of search engines where links are crawled and their content is known.  Links are followed to an already classified document and then the current document is marked as sensitive or not. Additional inputs from the user can then be used to determine whether the document should or should not be served in the search results. If the content is not already known, it is classified with a simple classifier such as a Naive Bayes classifier. This makes the model more dynamic.
Another technique is the use of genetic programming. In the design of a Bayesian filter,  the characteristic vector includes the frequencies of words selected as features generally by manual inspection. This selection plays a dominant effect on the filter performance. Instead the filter can be represented by syntactic tree where nodes are the numbers that represent the frequencies, operations on the numbers, words and the operations on the words. Syntactic rules then can be used to check the correctness of the tree. Then functions are used that evaluate the fitness of a population to some criteria. For example, this criteria can be the sum of squares of error between classification of the document by the filter and the correct classification of the document for both the sensitive as well as the allowed document sets.
#codingexercise
find sum of all substrings of digits
number = "1234" 1 2 12
sumofdigit[0] = 1 = 1
sumofdigit[1] = 2 + 12  = 14
sumofdigit[2] = 3 + 23  + 123 = 149
sumofdigit[3] = 4 + 34  + 234 + 1234  = 1506

Total = 1670
Int GetSumDP(List<int>A, int index,  ref int[] dp)
for (int  I = 0; I <= index; i++){ 
//    for (int j = 0; j <= i;j++) { 
            dp[i]= dp[i]*10 + A[i];
   } 
Return dp.Sum(); 
}

int GetSum(List<int> A)
{
int result = 0; 
var dp = new int[A.Count + 1]();
for (int i = 0; i < A.count; i++)
        result += GetSumDP( A, i, ref dp);
return result;
}

Wednesday, April 5, 2017

Spam filters based on content analysis:
We referred to filtering techniques using statistical and learning methods and described a few examples here. This does not mean that we cannot have filters based on manual learning.  It’s true that unsupervised learning such as decision trees can build a classifier using a large volume of input but some rules may be gleaned from manual inspection and these are just as useful and should be used during filtering.
For example, in spam messages we have:
1)      Emails that solicit personal information in response as spams
2)      Emails that provide urls for seeking user action but which are not over encrypted channel of communications are to be treated as spams
3)      Emails where the sender has an arbitrary email address that does not confirm to well known formats or the supposed sender in the text are considered as spams
4)      Emails where the text claims to have authority but does not present any bonafide signatures are considered spams
5)      Emails where there are grammatical mistakes and the sender requests money are considered spams
6)      Emails where the sender has a file attachment that is not recognized or has a link to download a file without a recognized sender is considered spam.
The examples above show that there are many rules that we can craft from manual inspection that may or may not be found from unsupervised learning. It is important to have a rules based classifier as well in the filters.
Writing this rule based classifier is equivalent to executing a body of logic because each rule is like a statement in a program. There are logical expressions within the statement and the sequence of statements determine the order of evaluation. Therefore writing the classifier as a system administrator defined function and then executed as a custom classifier in addition to our statistics and learning based classifiers only aids and improves the classification.

Since we measure classification as precision and recall, these custom classifiers that can be authored independently for each domain can also be quantitatively measured for effectiveness.  This can be done both as independent execution of the said classifier or in tandem with the existing classifiers to see whether the classifiers need to be sequential or joint.  The same effectiveness evaluation technique holds true for any number of filters applied in series to an input document or mail for classification.
#codingexercise
Find the sum of all substrings of a string representing a number
For example:
number = "1234"
sumofdigit[0] = 1 = 1
sumofdigit[1] = 2 + 12  = 14
sumofdigit[2] = 3 + 23  + 123 = 149
sumofdigit[3] = 4 + 34  + 234 + 1234  = 1506
Total = 1670
int GetSumOfDigits(string num, int i) // zero based index
{
if (i == 0) return num[i];
return (i+1)*num[i] + 10* GetSumOfDigits(num, i-1);
}
int GetSum(string num)
{
int result = 0;
for (int i = 0; i < num.Length; i++)
    result += GetSumOfDigits(num, i);
return result;

Tuesday, April 4, 2017

Techniques in document filtering
When we check our email inbox, we pay little attention to the filters that separated the spam  from the genuine messages. However, filters serve a very useful purpose whether they are on the email server or in your local email application. Filters have been around for a while going as way back as when email was introduced. Filters are also ubiquitous appearing anywhere from clients to servers and anything in between and target a variety of data whether they are structured or unstructured. Filters are also becoming increasingly intelligent and applying a variety of techniques from key value matching, rules evaluation to text analysis and those that are specified not just by administrators but also by users.
We discuss some of these techniques in the context of document filtering where sensitive documents are filtered from being automatically indexed to help with search based on words and expressions. This kind of filtering differs from spam filtering. It is much more severe to misclassify a legitimate mail as a spam and missed from the inbox than allowing a spam to pass the filter and show up in the inbox. The missed mail may have been a personal mail and could be  a disaster when it wasn't let through. The filtering for documents on the other hand is more focused on not letting a sensitive document through to indexing where it may find its way to public search results. it is far less severe to not let a document be indexed in which case it could have been found from file explorer. We review some of the techniques involved in doing intelligent filtering.
There are primarily three different types of techniques involved in filtering – statistical techniques, content based techniques and learning based techniques.
First, statistical techniques are based on counting words and their probabilities and to be precise conditional probabilities. If a document has higher emphasis on certain words than other messages, then it can be classified as sensitive. This emphasis can be measured quantitatively and normalized to determine the binary category of the document. A naiive Bayes Classifier jdoes just that. There are more sophisticated versions as well. Although a variety of metrics may be used to measure, almost all treat the document as a vector of features. Features are words in a document and when there are too many words to grade documents with, the features are pruned. In order to use a classifier to label a document, it must first be trained. So a certain set of documents are used to train the classifier before it can be tested.
Second, content based techniques are used when we want to find out how similar a document is to others. Sensitive documents may share something in common and sets of documents might be used collaboratively to rate an incoming document. We can also use the document as a vector again but with similarity measures rather than conditional probabilities. Content based learning also explores a variety of classification techniques that are not only based on document neighbors but also spread out the documents based on a hyperspace with as many dimensions as features so as to divide them into categories.
Third, learning based techniques involve a variety of techniques from artificial intelligence but I want to bring up the notion of a not so popular adaptive filtering. Documents like email messages are personal and over time a user may get messages or generate documents that have a trait based on user. We can build this trait using a collection of time series content and classify what may be sensitive and what mat not be. We make the filter as personal as possible and intelligent over time by building the set that has a distinct touch by the user.
In conclusion, filters are like lenses. They do not change the reality but help us take actions on what is most relevant.

Monday, April 3, 2017

Today we continue discussing a survey of statistical and learning techniques in document filtering for indexing. The concept of excluding sensitive documents from indexing is similar to filtering spams from email. The statistical techniques use conditional probabilites and Bayes theorem.Naïve Bayes exhibits good performance on a small feature set but the same cannot be said for larger feature sets. The supervised statistical filtering algorithms therefore do two stages – the training stage and the classification stage. During training a set of labeled documents provide training data. During classification, the prediction is made. When the dimensions of the vector grows, the features are pruned or selected. This has the side benefit that it improves predictions. Feature selection was usually effective to the degree of reducing the features to orders of magnitude smaller than the original. Feature selection techniques include Document Frequency, Information Gain, and Chi-Square.
The learning based techniques include k nearest neighbours. This technique states that if at least t documents in k neighbors of document m are sensitive, then m is a sensitive document otherwise it is a legitimate document. In this technique there is no separate phase of training and the comparision between documents represented by vectors is real time. The comparision is based on the indexing we have already done for the documents and consequently do not need the document feature matrix however this does involve an update of the sample before the comparisions which is dependent on the sample size because each in the sample is labeled.
Another technique is called the support vector machine which views the training set as vectors with n attributes which corresponds to  points in a hyperspace of n dimensions. in order to perform a binary classification. In the case of binary classification such as document filtering, this technique looks for the hyperplane to separate the points of the first class from that of the second class such that their distance from the hyperplane is maximum.
Another technique is the use of maximum entropy. The principle here is to find the probability distribution that maximizes the entropy as the negative sum of probabilities with their logs taken together where the variatons are taken over both the set of possible classes and the set of possible values of features. The maximization should ensure the probabilities meet all known values in the training set.
Another technique is the use of neural networks. For example, with a technique called perceptron, the idea is to define a linear function in terms of weights and a bias vector such that the function is greater than zero if the variable is of first class and negative otherwise. The perceptron therefore clsssifies based on the given vector otherwise it adjusts the weights amd the bias iteratively.
#codingexercise
Sum of all substrings of digits from number:
       Get all valid combinations using combine method above
        Sum the combinations

Sum of all possible numbers from a selection of digits of a number:
        get all valid combinations as well as their permutations of the digits
        Sum the generated numbers

Sunday, April 2, 2017

Today we continue discussing a survey of statistical and learning techniques in document filtering for indexing. The concept of excluding sensitive documents from indexing is similar to filtering spams from email. The statistical techniques use conditional probabilites and Bayes theorem.Naïve Bayes exhibits good performance on a small feature set but the same cannot be said for larger feature sets. The supervised statistical filtering algorithms therefore do two stages – the training stage and the classification stage. During training a set of labeled documents provide training data. During classification, the prediction is made. When the dimensions of the vector grows, the features are pruned or selected. This has the side benefit that it improves predictions. Feature selection was usually effective to the degree of reducing the features to orders of magnitude smaller than the original. Feature selection techniques include Document Frequency, Information Gain, and Chi-Square.
The learning based techniques include k nearest neighbours. This technique states that if at least t documents in k neighbors of document m are sensitive, then m is a sensitive document otherwise it is a legitimate document. In this technique there is no separate phase of training and the comparision between documents represented by vectors is real time. The comparision is based on the indexing we have already done for the documents and consequently do not need the document feature matrix however this does involve an update of the sample before the comparisions which is dependent on the sample size because each in the sample is labeled.
Another technique is called the support vector machine which views the training set as vectors with n attributes which corresponds to  points in a hyperspace of n dimensions. in order to perform a binary classification. In the case of binary classification such as document filtering, this technique looks for the hyperplane to separate the points of the first class from that of the second class such that tgeir distance from the hyperplane is maximum.

Saturday, April 1, 2017

Today we continue discussing a survey of statistical techniques in document filtering for indexing. The concept of excluding sensitive documents from indexing is similar to filtering spams from email. These techniques use conditional probabilites and Bayes theorem.Naïve Bayes exhibits good performance on a small feature set but the same cannot be said for larger feature sets. The supervised statistical filtering algorithms therefore do two stages – the training stage and the classification stage. During training a set of labeled documents provide training data. During classification, the prediction is made. When the dimensions of the vector grows, the features are pruned or selected. This has the side benefit that it improves predictions. Feature selection was usually effective to the degree of reducing the features to orders of magnitude smaller than the original. Feature selection techniques include Document Frequency, Information Gain, and Chi-Square.
The content based filtering techniques use simple stop word searches, format-specifiers and their matches for sensitive data such as SSN etc. They can also be analyzed based on text analysis methods such as mutual information or pointwise mutual information with a glossary of sensitive words and their occurrences.   One of the most appealing content based techniques where indexing is already utilized in the classification steps is k nearest neighbours. This technique states that if at least t documents in k neighbors of document m are sensitive, then m is a sensitive document otherwise it is a legitimate document. In this technique there is no separate phase of training and the comparision between documents represented by vectors is real time. The comparision is based on the indexing we have already done for the documents and consequently do not need the document feature matrix however this does involve an update of the sample before the comparisions which is dependent on the sample size because each in the sample is labeled.

#codingexercise
Find the largest contiguous subarray sum:
we can solve it another way from the one discussed yesterday.
// Alternatively,
Initialize and maintain the following variables
x = sum of elements
y = minimum of sum encountered
z = maximum of sum encountered
For each element of the array
           y = min(x, y)
           z = max(z, x-y)