Friday, May 23, 2014

We explore pair wise relationships between words in text indexing and summarization using graph theory. Documents are often treated as  a bag of words and represented by term vectors with limited dimensions. They are then clustered to extract the co-occurrences or the semantic relationships between terms in a document, thus helping with keyword extraction, summarization or classification. Such techniques have relied on latent and probabilistic latent semantic analysis. However words in a text not only represent the content of a document but they bear some ranking and relationship with each other in a document that could be represented with graphs. Graphs enable the discovery of connected salient words in a bag. While similar in nature to clustering and occasionally used together with it, graphs can be used to model both syntax and semantics such as typed features and hierarchical ontology. However, the focus of this articles is the usage of graph to connect words within a document and to retrieve the salient words based on the weights and the minimum spanning trees. The assignment of weights enables keyword extraction and the spanning of those keywords enable summarization.
First we lay some premise on which we will model our approach. We state that text can be both verbose or terse but can have the same content in terms of the import. The same text can have more than one nuances thereby requiring disambiguation either with context or with categorization. Text need not have meaning and can be a clutter of jargons with no equivalent synonyms. That is why no matter how we represent the semantics of the text, the model should be anchored to the terms in the text. In other words, we take the keywords from within the documents even if we represent their semantic model with alternate words. We focus exclusively on a given language and leave the variations that arise in syntax or expressions out of the scope of the samples we investigate.  We take simple samples at first - say blog posts regarding a technical topic. We are keen about the format of the text that is we look at some syntax if it helps our model but we don't rely on the distribution or clustering. We continue to treat the document as a collection of terms without emphasis on where the text are tiled, segmented or paragraphed. We ignore the layout of the text and focus instead on the weights of the terms, the densities and the divergence of topics. Initially we begin with a single graph for our document while we recognize the possibility of several sub-graphs in a document. We will attempt to find a model and we will have metrics for our graph. For example if we have two sets of keywords constituting two different graphs for the same keyword, we will keep the one that has a smaller minimum spanning tree. We know there are usages for our graph but we establish the criteria to come up with a graph.

No comments:

Post a Comment