Friday, May 10, 2013

Inverted file, TF- IDFs and Shannon information associated with terms

This is a data structure that is useful to retrieve all documents that contain the terms in a query. For each word, there is an ordered list of document identifiers maintained that has the given data. All query terms are organized in a second index structure such as a B+ tree or hash index and is referred to as the vocabulary index. With a single term from the query, the vocabulary index is looked up to find an inverted list and with that list, the rids are mapped to find the documents containing that term. For conjunction of several terms, the process is repeated for one word at a time. For a query with disjunction of terms, the inverted lists are merged.
A Term frequency - Inverse document frequency (Salton & Yang 1973) gives a weight for the given term among other terms and can be useful for such things as keyword extraction. A TF-IDF is defined as N(w)log D/D(w) where D(w) represents a number of documents containing word w in the training data, and D the total number of documents in a training data. 
Shannon information similarly gives a code for encoding the words in a set W. We can use this code to say compress or transmit the document in an efficient manner. Then l(w) = -logP(w) gives the length required for a code of a given word. We can use this length to compute Shannon information as -N(w)log P(w) where N(w) is the number of times the word w occurs in the text. Hang Yamanishi paper further compares Shannon information to TF-IDF on the basis of estimating the words with maximum likelihood of occurrence and defines it as -N(w)logF(w)/F where F(w) is the frequency of the word w in the training data and F the total frequency of word w in training data. Note that Shannon information now looks similar to TF-IDF and improves on the weights associated with words for keyword extraction.

No comments:

Post a Comment