Thursday, May 9, 2013

signature file

A signature file is data structure for text based database systems that support efficient evaluation of boolean queries. It contains an index record for each document in the database. This index record is the signature and it is has a fixed size of b bits, b is called the signature width. The bits that are set depend on the words that appear in the document.Words are mapped to bits based on a hash function. Bits are set from the  result of the hash function. The same bit could be set twice by different words because the hash function maps both words to same bit. One signature matches another if it has at least as many bits set as the other. For a query consisting of a conjunction of terms, we first generate the query signature by applying the hash function to each word in the query. Then we scan the signature file to get all the documents that match the query signature, because every such document is a potential result to the query. Note that the documents may not contain the words in which case it is called a false positive. These can be expensive since we have to retrieve the document, parse, stem and then check to determine whether it contains the query terms. For the queries that have a disjunction of terms, we generate a list of query signatures and apply a comparision rule of any signature match by the documents. Also, for each query, we have to scan the complete signature file, and there are as many records in the signature file as there are documents. To reduce the amount of data that has to be retrieved by each query, we can vertically partition the signature file into a bit of slices, and we call such an index a bit sliced signature file. The length of each bit slice is still equal to the number of documents in the database, but for q bits set, we only need to retrieve q bit slices. - as read from textbook on Database management systems

No comments:

Post a Comment