Saturday, November 2, 2013

We will continue on our discussion on trend analysis of web searching. From the sample discussed, the number of queries per month showed similar pattern. In a plot of natural log of frequency versus natural log of rank, one line represents ranks of all words ordered by their descending frequencies and another line represents ranks of unique frequencies. The first line was not straight but had a pretty good slope. The second line had a great overlap and dropped down for higher ranks. The first did not include words that shared the same frequency. The second did not include the size of the vocabulary.
There were a few more graphs made, to study if they showed similar pattern. Many terms had the same frequencies. The low frequencies cluster many words. And different time periods were chosen for the additional graphs.  They all showed similar trends. There would be overlap for the high frequency words and a divergence on the lower. Consecutively, there were two equations formed one for each half. The lower frequency and higher rank was represented by a line while the higher frequency and lower rank was represented by a polynomial.
The size of the vocabulary increased much more slowly than the size of the queries. Investigations into the overlapping vocabulary across years showed that the overlap had higher frequencies each year. Some of these words were site specific and the others were more seasonal in nature. The matrix of vocabulary terms to term pairs is sparse with only a small portion of the vocabulary words co-occurring.  In terms of improvements to the search engine, the following observations were made. First the zero hits due to the use of stop-words could be solved either by automatically excluding them or providing context sensitive information. Second the search engine could interpret term pairs and queries although users may have to be informed on this advanced query construction. Third the word associations measured by the frequencies of word pairs may be predictable by word frequencies alone. Collection of user vocabulary could help for the generation of content based metadata.

No comments:

Post a Comment