Wednesday, November 27, 2013

In the post about Keyword extraction by Kullback-Leibler technique, we talked about having a sliding scale of threshold for selecting keywords. We also talked about a viewer for users to see the keywords as highlighted in place using fonts bigger than the others in the text. As the threshold on the sliding scale is increased, we talked about more keywords appearing on the viewer.
 In today's post, I'm not only going to cover how to render the text but also cover how to make it independent of the actual text.
We will use offsets or word counts to find relative positions of keywords from start.
With the relative positions, we are able to find the keywords dynamically. We use pagination for blocks of texts where each page can having varying word counts as can fit the page view determined independently. We represent the page with an abstraction that lets us keep track of blocks of text which we will refer to as pages. We represent what the user sees with a page view that can change to accommodate text spanning one or more of the pages. Whenever that happens, re-pagination is involved and the page views is updated with the corresponding page. Thus there is a one to one relationship between the page and the page view. The page for each page view is expected to change as the number of keywords in the text is increased and the space they occupy grows. By decoupling the view from the page, we let both of them vary independently. The views are also dependent on the available space on the user screen and the overall font size and the stretching or skewing that users do via the window in which it displays. The views could also change by the changes to the font size of the text. On the other hand, the page could change whenever text is added or removed from the overall document. The viewer could be built with tools to enable flipping through the pages and to goto a specific in the document via bookmarks. Decoupling the viewer altogether to programs that are dedicated editors is possible by just marking the keywords for the text with special markups that those editors recognize. It's also possible that those very editors may support custom rendering in case they don't already support the special treatment for keywords.
Whether a viewer is written as part of the program that selects the keywords, or written as a plugin for existing editors, the viewer is concerned with only the user attention to keywords. The program for selecting keywords need not happen for each document rendering but done either before hand or as part of document life cycle. That is the scope in which the keyword extraction works is determined by what's convenient to the tool. As an example, the text of each document could be sent to a centralized database where the documents are processed and keywords are selected. The viewers that reside on the devices accessed by the user could choose to download this processed document whenever necessary. The uploading and downloading of the text may need to be done only when the contents change otherwise the processed document can just be cached. This eliminates the cost in terms of time to serve the document to the user.
The page and the page views are artifacts for the viewer and have no place in the processed text that we keep in such a proposed centralized repository. The repository and processing only sees unpaginated structured text.

No comments:

Post a Comment