Cluster computing

Wednesday, April 5, 2017

Spam filters based on content analysis:

We referred to filtering techniques using statistical and learning methods and described a few examples here. This does not mean that we cannot have filters based on manual learning. It’s true that unsupervised learning such as decision trees can build a classifier using a large volume of input but some rules may be gleaned from manual inspection and these are just as useful and should be used during filtering.

For example, in spam messages we have:

1) Emails that solicit personal information in response as spams

2) Emails that provide urls for seeking user action but which are not over encrypted channel of communications are to be treated as spams

3) Emails where the sender has an arbitrary email address that does not confirm to well known formats or the supposed sender in the text are considered as spams

4) Emails where the text claims to have authority but does not present any bonafide signatures are considered spams

5) Emails where there are grammatical mistakes and the sender requests money are considered spams

6) Emails where the sender has a file attachment that is not recognized or has a link to download a file without a recognized sender is considered spam.

The examples above show that there are many rules that we can craft from manual inspection that may or may not be found from unsupervised learning. It is important to have a rules based classifier as well in the filters.

Writing this rule based classifier is equivalent to executing a body of logic because each rule is like a statement in a program. There are logical expressions within the statement and the sequence of statements determine the order of evaluation. Therefore writing the classifier as a system administrator defined function and then executed as a custom classifier in addition to our statistics and learning based classifiers only aids and improves the classification.

Since we measure classification as precision and recall, these custom classifiers that can be authored independently for each domain can also be quantitatively measured for effectiveness. This can be done both as independent execution of the said classifier or in tandem with the existing classifiers to see whether the classifiers need to be sequential or joint. The same effectiveness evaluation technique holds true for any number of filters applied in series to an input document or mail for classification.
#codingexercise
Find the sum of all substrings of a string representing a number
For example:
number = "1234"
sumofdigit[0] = 1 = 1
sumofdigit[1] = 2 + 12 = 14
sumofdigit[2] = 3 + 23 + 123 = 149
sumofdigit[3] = 4 + 34 + 234 + 1234 = 1506
Total = 1670
int GetSumOfDigits(string num, int i) // zero based index
{
if (i == 0) return num[i];
return (i+1)*num[i] + 10* GetSumOfDigits(num, i-1);
}
int GetSum(string num)
{
int result = 0;
for (int i = 0; i < num.Length; i++)
result += GetSumOfDigits(num, i);
return result;

Cluster computing

Wednesday, April 5, 2017

No comments:

Post a Comment