Sunday, January 20, 2013

Search over text is enabled by creating tuples of (word, documentID, position), and building a B+-tree index over the word column. Sure words may need to be canonicalized and additional per-tuple attributes to aid in rank-ordering search results. For improved performance, some search optimizations include tuples to have each word appear once and a list of their occurences as in (word, list<documentID, position>). This is helpful especially given that skewed distribution of words in documents. However, the text search implementation described above may run slower by an order of magnitude than custom text indexing engines. This is true for both full-text documents and over short textual attributes in tuples. In most cases, the full-text index is updated asynchronously  ("crawled") rather than being maintained transactionally. In transactional queries, the semantics of relational queries with ranked document search results need to be bridged.

No comments:

Post a Comment