Spam filters based on content analysis:
We referred to filtering techniques using statistical and
learning methods and described a few examples here. This does not
mean that we cannot have filters based on manual learning. It’s true that unsupervised learning such as
decision trees can build a classifier using a large volume of input but some
rules may be gleaned from manual inspection and these are just as useful and
should be used during filtering.
For example, in spam messages we have:
1)
Emails that solicit personal information in
response as spams
2)
Emails that provide urls for seeking user action
but which are not over encrypted channel of communications are to be treated as
spams
3)
Emails where the sender has an arbitrary email
address that does not confirm to well known formats or the supposed sender in
the text are considered as spams
4)
Emails where the text claims to have authority
but does not present any bonafide signatures are considered spams
5)
Emails where there are grammatical mistakes and
the sender requests money are considered spams
6)
Emails where the sender has a file attachment
that is not recognized or has a link to download a file without a recognized
sender is considered spam.
The examples above show that there are many rules that we
can craft from manual inspection that may or may not be found from unsupervised
learning. It is important to have a rules based classifier as well in the
filters.
Writing this rule based classifier is equivalent to
executing a body of logic because each rule is like a statement in a program.
There are logical expressions within the statement and the sequence of
statements determine the order of evaluation. Therefore writing the classifier
as a system administrator defined function and then executed as a custom
classifier in addition to our statistics and learning based classifiers only
aids and improves the classification.
Since we measure classification as precision and recall,
these custom classifiers that can be authored independently for each domain can
also be quantitatively measured for effectiveness. This can be done both as independent
execution of the said classifier or in tandem with the existing classifiers to
see whether the classifiers need to be sequential or joint. The same effectiveness evaluation technique
holds true for any number of filters applied in series to an input document or
mail for classification.
#codingexercise
Find the sum of all substrings of a string representing a number
For example:
number = "1234"
sumofdigit[0] = 1 = 1
sumofdigit[1] = 2 + 12 = 14
sumofdigit[2] = 3 + 23 + 123 = 149
sumofdigit[3] = 4 + 34 + 234 + 1234 = 1506
Total = 1670
int GetSumOfDigits(string num, int i) // zero based index
{
if (i == 0) return num[i];
return (i+1)*num[i] + 10* GetSumOfDigits(num, i-1);
}
int GetSum(string num)
{
int result = 0;
for (int i = 0; i < num.Length; i++)
result += GetSumOfDigits(num, i);
return result;
#codingexercise
Find the sum of all substrings of a string representing a number
For example:
number = "1234"
sumofdigit[0] = 1 = 1
sumofdigit[1] = 2 + 12 = 14
sumofdigit[2] = 3 + 23 + 123 = 149
sumofdigit[3] = 4 + 34 + 234 + 1234 = 1506
Total = 1670
int GetSumOfDigits(string num, int i) // zero based index
{
if (i == 0) return num[i];
return (i+1)*num[i] + 10* GetSumOfDigits(num, i-1);
}
int GetSum(string num)
{
int result = 0;
for (int i = 0; i < num.Length; i++)
result += GetSumOfDigits(num, i);
return result;
No comments:
Post a Comment