Saturday, March 2, 2013

Page numbers for a book index

After the index words have been chosen, they are usually displayed with the page numbes. The trouble with the page numbers is that they are dependent on the rendering software. The page sizes could change for an electronic document, hence the page numbers for the index may need to be regenerated. This change in page size could be tracked with a variety of options and mostly after the index words have been selected.We will consider the trade offs here.  First, the page number for the different pages can be associated with the index words with a linear search for every occurance of the word. This is probably a naiive approach. Second, we already have word offsets from start for each of the words. Hence we can translate the offsets to page numbers if we know the last word offset of each page. So we can build an integer array where the index gives the page number and the data gives the offset of the last word on the page. Then for each word that we list in the index, we can also print the associated page numbers. Third, we can utilize any information the document rendering system provides including methods and data structures for page lookup.  From these approaches, it should be clear that the best bet for the page number data source is the rendering engine which is external to the index generator program.

For rendering of word documents or for the retrieving the page numbers, we could use either microsoft.office.interop.Word or  aspose.words library.

To determine the renderer, the program could look up the file extension type to use the appropriate library. The library can be loaded as the input file is read or used via dependency injection. Wrapper over the libraries to provide only the page information method may be sufficient or a design pattern to abstract away the implementations on a file format basis can be used.

No comments:

Post a Comment