Saturday, November 2, 2013

Trend and behavior detection is applied not only in monitoring but also in information retrieval. In a paper by Wang, Bownas and Berry, they discuss it in regard to web queries.
They discuss the type and nature of query characteristics that can be mined from web server logs.  They studied the vocabulary to the queries to a website over a long period of time and found that the vocabulary did not have a well-defined Zipf distribution. Instead trends could be better represented as piecewise polynomial data fits i.e based on regression analysis.
Studies of traditional information retrieval systems reveal many problems that searchers encountered. These include the complexity of query syntaxes, the semantics of Boolean logic operators, and the linguistic ambiguities of natural language Since user education was unlikely, systems are designed to facilitate self-learning. Search logs from different search engines were collected and studied. Even though the web search engines differed widely in data collection, processing, focus of search etc. their results were comparable. Web queries were short. They averaged two words and were simple in structure, few queries used advanced search engine features. and many contained errors. These were observed from web logs that were quite large but covered a very short period of time. Hence the study by the authors involved logs from a university website spanning a longer time instead. This facilitated studies for trends over time
 The  log data included a date stamp, the query statement and hits. The zero hits were caused by many factors. First, the stop words set by the engine excluded many words. Second the queries contained a high percentage of misspelled words. Third many searches entered name of individuals. Fourth, the default Boolean operator was and. The other boolean operator had to be included explicitly. Fifth, the empty queries also contributed to zero hit.
The queries submitted to the search engine were very short and from different users, so it was difficult to predict if the vocabulary will show similar statistical distributions. A plot of logarithmic rank-frequency distribution of terms showed lines that were not smooth and straight.

No comments:

Post a Comment