Wednesday, February 4, 2015

Today we read the paper the Gold project by Barbara, Molina and Mehrotra. The purpose of this project was to develop a set of tools that interoperate with unstructured and semi-structured data and services. Its a mailer software that allows users to send and receive messages using different media. (e-mail, faxes, phones), efficiently store and retrieve these messages and access a variety of sources of other useful information. It solves the problem of information overload, organization of messages, and multiple interfaces. The mailer doesn't provide a relational like query facilities but it provides a database system interface. What that means is the querying is convenient from programmability. The reason emails were chosen for the kind of tools they develop was that email are ubiquitous and it provides a convenient form of data.  Moreover, messages need to be stored or filed. Most mailers have primitive filing facilities and stored messages are seldom found. By its sheer volume, variations in size and number, and organizations, emails are often not easily retrievable. This paper was written prior to the indexing and full text search that's now common on the operating system and mailing software. At the same time, this software uses a mail database. Each stored message is viewed as a tuple with fixed fields and treated as structured and semi-structured sequences of words. The paper mentions that the unstructured model is more appropriate for message searches. and that it can also encompass a wide variety of message formats, even unknown formats. They also improved the search  by introducing operations that are different from sql and relations. Another goal was to allow the use of existing interfaces for electronic mail. The project also provided a graphical user interface but this added convenience was not just limited to search and UI. By abstracting the messages and putting them in a store, a user now could search based on words and they would appear as a set of virtual messages. When the user responds to any of the virtual cards, the corresponding email address is looked up and then the mail dispatched by the mailer.  The mailer also provides a flexible model for the stored objects. This is another reason why they favor the unstructured approach as opposed to looking them up based on customary fields. This paper details the indexing issues encountered  in designing the mailer.
#codingexercise
Double GetAlternateOddNumberRangeSumPower ()(Double [] A, int n, int m)
{
if (A == null) return 0;
Return A.AlternateOddNumberRangeSumPower(n,m);
}
We continue our discussion on the mailer. The architecture of the mailer is described as follows:
Users interact with a GUI, that issues commands to read new mail messages and retrieve others from database. When a user decides to store a new message, the user interface passes it to a preprocessor which breaks it up into tokens  and gives it to the index engine. The preprocessor can also receive messages directly for parsing and subsequent indexing.It produces a canonical file that is used by index engine to create the appropriate data structures. Messages are stored in an object store or database. The user interface also submits queries to index engine which results in a list of matching file identifiers. The user interface retrieves the objects for displaying. The index engine can receive queries and storage requests from many different user interfaces and hence implements concurrency control.  The message that is indexed has headers and body. The user sees it as a bag of words. Alternatively, the user may want to view the message as a semistructured object with separate fields. This distinction enables different type of queries for the two The first enables queries involving relative position and the second involves that only in same segments.
#codingexercise
Double GetAlternateOddNumberRangeProductPower ()(Double [] A, int n, int m)
{
if (A == null) return 0;
Return A.AlternateOddNumberRangeProductPower(n,m);

}

We will continue our discussion with the challenges encountered during indexing. We noticed that the  user interface allows messages to flow to the preprocessor which tokenizes the words in the message. In addition, news feed can directly flow into the preprocessor.  The preprocessor then gives these messages to the indexing engine.  which interacts with the user interface for querying and retrieving based on users searches. The index engine also maintains its index structures so that the querying and retrieving can be efficient. In addition, the front end has access to an object store.  The messages themselves comprise of header and body. Gold mailer handles structured fragments within a document as well as treat the document as a bag of words. The index engine maintains attributes for each keyword encountered. This is represented in a relational table as keyword, document_id, position and header. The document_id points to the indexed object while the position points to where the keyword occurs within the document. The header attribute is for referencing the header if present. This table is used only by the indexer. The query engine has no knowledge of this internal data structure. The values in any of the cell of this table of attributes doesn't have to be discrete or atomic. This comes in useful to the indexer.  At the same time the query engine can differentiate between words by their position as hinted in the query. For example, if the engine sees subject lunch and time as three words to be searched where they appear consecutively, it can do so by setting the positional difference between the words to be zero. This differentiates this particular query from other queries where the three words may co-occur at different positions in the document.
#codingexercise
Double GetAlternateOddNumberRangeProductCube ()(Double [] A, int n)
{
if (A == null) return 0;
Return A.AlternateOddNumberRangeProductCube(n);
}

Furthermore, the query engine can select from the table mentioned above where the header field is subject to look for the occurrence of the three keywords.  A mail message can be considered as a collection of labeled fragments and the processor will report the word subject as the label for the words and provide byte offsets of the words within the fragments. No matter whether the structured fields are used or not, the indexing engine works the same way. The index engine stores a variety of documents and a user may be interested only in documents of a specific type. 

The Gold Mailer supports the feature of message annotation. When a new message is to be stored, the user interface allows the user to enter additional words for indexing. For instance, the user may want to associate  the words budget and sales with a message even though these two words don't appear in the message itself. The user interface appends these added keywords to the end of the message, and the index engine does not treat them differently. Mailers usually provide the capability of defining folders. Messages flow into folders for organization. Gold mailer uses annotation to implement folders. This means that a virtual folder can also be created via message annotations but its not the same as an actual folder. An instance querying the system for all messages that container the keyword database retrieves messages that are annotated by the keyword as well as those that contain the keyword somewhere in the text. Folders and annotations give the user enough flexibility to organize their mails.
Next we look at their usages. Users may choose to use folders for redirecting the emails into folders that are classified appropriately. This gives them the flexibility to look for emails in narrower domains. But its merely a convenience for the user. The annotations are used on the other hand as labels. This allows both system and user to search.
Not all messages have the same format. The Gold mailer deals with messages that come from fax and scan. These are run through the optical character recognition server. This software with the help of an online dictionary recognizes as many words from the electronic document as possible and creates a file with the words.  Even a handful of keywords are enough to be recognized for indexing. The keyword file and the image are both stored in a fax directory that's owned by the addressee. The same folder can be shared. The Gold mailer also sends out faxes using the same media. 
Messages can be composed and received using both text and images and through a variety of media.
The query language was developed  with a simple and flexible message model. The goal was to keep it easier to specify. For example, mail friend Princeton command would search through the database for objects using those words friend and Princeton.
The Gold mailer expects most queries not to involve proximity searches. If proximity is required, it can be specified by grouping together the terms involved. The user interface allows the user to incorporate new messages, send messages using the information stored in address cards, edit address cards, and pose queries to the database to retrieve messages.

Tomorrow we wrap up our discussion on this project. 

The information stored in the index engine is viewed, as mentioned,  as a tuple of keyword, document identifier, position and header. The index engine supports the following three operations : 
insert document : this inserts all the records pertaining to a specific document 
delete document : this deletes all the records pertaining to a specific document
retrieve query : this command instructs the engine to find all documents that satisfy the query and return the list of names to the mailer. 
This structure and operations enable efficient evaluation of queries which is critical to the mailer.

The index engine cannot be considered as a warehouse where insertions are batched at low time and deletions are never done. In that sense, the mailer provides newly inserted documents immediately.  Index reconstruction is considered maintenance and not part of the workflow.

The file organization required for the mailer is dependent on the types of queries and their space requirements.  The queries are keyword based lookups and don't involve ranges between keywords.  Therefore the keywords are indexed and the records are stored as a hashed file. For each keyword, a list of positions where the keyword appears is maintained resulting in efficient disk access as opposed to keeping a record one for each position. If this reminds us of the BigData principles, it would behoove us to note that this paper was published in 1993.

The records corresponding to a document in a  hashed file are chained together forming a list, referred to as the next key list for the document.  This enables easier deletions.

When the records are inserted, they are not directly placed in the hashed files because the number of keywords in a document can be arbitrary. Instead an overflow list is maintained one per keyword in a block or more of main memory and the records are inserted at the tail. The records of the document therefore appear contiguously in these lists. Compare this to the buckets where the records are kept contiguously based on keywords and not the same document. When documents are deleted the records are not deleted one by one, instead blocks can be returned. This way the mailer avoids having to request n blocks for n keywords. 

Concurrency control has to make accommodations for the fact that the insert and delete operations may not just be user invoked but are also background tasks. This implies that operations can just be sequential because the user operation may then have to wait till the background job completes. Therefore locks are introduced and these are taken at page level.  Also failures can occur from processes and similar to concurrency, we cannot rollback all operations since background jobs may have substantial loss of work. This is overcome by restarting where the jobs were prior to the failure. Logging is supported for both physical page level operations as well as logical inserts and deletes.


This system shows how to organize the mail messages in a way that retrieval is efficient as opposed to user having to browse the previously created folders. As an extension, it could be made to index not only mail messages but library catalogs and databases. This mailer software also intended to investigate improvements in accessing data from other information sources such as file systems.

No comments:

Post a Comment