Saturday, February 23, 2013

Document parsing

Structured text is very valuable to identifying topics or keywords in a text. Word documents provides markup for such information and this can be useful to find topics. Word documents can be parsed to retrieve the table of contents or the structure and this can be used to divide the text into sections that can then be treated as unstructured. Content that is not text but have titles or captions to go with them should be treated the same as headings for section text. These are also candidates for document indexing. Thus an improvement to indexing unstructured text is to add logic to extract document structure and utilize the layout of the information presented by the user. The data structure used for capturing this layout for the purposes of indexing could have elements holding references to sections, their type and location and these elements are but a list populated from document parsing. Each element of this list will be treated the same was as any length unstructured text.
 

No comments:

Post a Comment