Thursday, May 9, 2013

Indexing for text search

A text search database is a collection of documents and is referred to as text database. Let us think of this database as a single relation schema with one field of type document. Each record in the document contains exactly one document. Each record in this relation contains exactly one document.
Now, let us look at a class of queries based on keyword search that enables us to ask for all documents containing a given keyword. This is the most common kind of query on the web today. A more complex query is find all documents that have keyword 1 or keyword 2  and not keyword 3. For such documents we may also want to look for synonyms such as automobile instead of car. We can rank retrieved documents by the proximity of the query keywords in the document.
There are two types of such queries for text databases: Boolean queries and Ranked queries. In a Boolean query, the user supplies a Boolean expression like the one mentioned above and is called the conjunctive normal form. This query consist of j conjuncts, each of which consists of several disjuncts. Each conjunct corresponds to one concept, and the different words within each conjunct correspond to different terms for the same concept. In a ranked query, the user also specifies a list of words, but the results of the query is a list of documents ranked by their relevance to the list of user terms. Determining when and how a document is relevant to the set of user terms is a different problem. The algorithms for that problem belong to information retrieval, which is closely related to database management. Information retrieval systems unlike database systems don't address updates, concurrency control and recovery because the data is largely static. The criteria used to evaluate such retrieval system includes precision and recall which we define in that order as percentage of retrieved documents that are relevant and the percentage of relevant documents retrieved in response to a query. For Boolean queries, there are two index schemas - the inverted file index and the signature file index. The first index is widely used due to its simplicity and good performance but it has a huge space overhead. The second index has a small space overhead and enables a quick search for eliminating non-matching documents. Words are stemmed and analyzed for a common root during index creation. - as read from textbook on Database management systems.

No comments:

Post a Comment