Tuesday, April 4, 2017

Techniques in document filtering
When we check our email inbox, we pay little attention to the filters that separated the spam  from the genuine messages. However, filters serve a very useful purpose whether they are on the email server or in your local email application. Filters have been around for a while going as way back as when email was introduced. Filters are also ubiquitous appearing anywhere from clients to servers and anything in between and target a variety of data whether they are structured or unstructured. Filters are also becoming increasingly intelligent and applying a variety of techniques from key value matching, rules evaluation to text analysis and those that are specified not just by administrators but also by users.
We discuss some of these techniques in the context of document filtering where sensitive documents are filtered from being automatically indexed to help with search based on words and expressions. This kind of filtering differs from spam filtering. It is much more severe to misclassify a legitimate mail as a spam and missed from the inbox than allowing a spam to pass the filter and show up in the inbox. The missed mail may have been a personal mail and could be  a disaster when it wasn't let through. The filtering for documents on the other hand is more focused on not letting a sensitive document through to indexing where it may find its way to public search results. it is far less severe to not let a document be indexed in which case it could have been found from file explorer. We review some of the techniques involved in doing intelligent filtering.
There are primarily three different types of techniques involved in filtering – statistical techniques, content based techniques and learning based techniques.
First, statistical techniques are based on counting words and their probabilities and to be precise conditional probabilities. If a document has higher emphasis on certain words than other messages, then it can be classified as sensitive. This emphasis can be measured quantitatively and normalized to determine the binary category of the document. A naiive Bayes Classifier jdoes just that. There are more sophisticated versions as well. Although a variety of metrics may be used to measure, almost all treat the document as a vector of features. Features are words in a document and when there are too many words to grade documents with, the features are pruned. In order to use a classifier to label a document, it must first be trained. So a certain set of documents are used to train the classifier before it can be tested.
Second, content based techniques are used when we want to find out how similar a document is to others. Sensitive documents may share something in common and sets of documents might be used collaboratively to rate an incoming document. We can also use the document as a vector again but with similarity measures rather than conditional probabilities. Content based learning also explores a variety of classification techniques that are not only based on document neighbors but also spread out the documents based on a hyperspace with as many dimensions as features so as to divide them into categories.
Third, learning based techniques involve a variety of techniques from artificial intelligence but I want to bring up the notion of a not so popular adaptive filtering. Documents like email messages are personal and over time a user may get messages or generate documents that have a trait based on user. We can build this trait using a collection of time series content and classify what may be sensitive and what mat not be. We make the filter as personal as possible and intelligent over time by building the set that has a distinct touch by the user.
In conclusion, filters are like lenses. They do not change the reality but help us take actions on what is most relevant.

No comments:

Post a Comment