Tuesday, September 4, 2018

Applying regions of interest to latent topic detection in text document mining:
Regions-of-interest is a useful technique to detect objects in raster data which is data laid out in the form of a matrix. The positional information is useful in aggregating data within bounding boxes with the hope that one or more boxes will stand out from the rest and will likely represent the object. When the data is representative of the semantic content and the aggregation can be performed, the bounding boxes become very useful differentiators from the background and thus help detect objects.
Text is usually considered flowing with no limit to size, number and arrangement of sentences – the logical unit of semantic content. Keywords however represent significant information and their relative positions also give some notion of the topic segmentation within the text. Several keywords may represent a local topic. These topics are similar to objects and therefore topic detection may be considered similar to object detection.
Furthermore, topics are represented by a group of keywords. These groupings are meaningless because their number and content can vary widely and yet remain the same topic. It is also very hard to treat these groupings as any standardized form of representing topics. Fortunately words are represented by vectors with mutual information with other keywords. When we treat a group of words as a bag of word vectors, we get plenty of information on what stands out of the documents as keywords.
The leap from words to groups is not straightforward as vector addition. With vector addition, we lose the notion that some keywords represent a topic better than a combination of any other. On the other hand, we know classifiers are able to represent clusters that are more meaningful. When we partition the keywords into discrete set of clusters, we are able to represent some notion of topics in the form of clusters.
Therefore cluster and not groups become much more representative of topics and subtopics. Positional information merely allows us to cluster only those keywords that appear in a bounding bag of words. By choosing different bags of words based on their positional information, we can aspire to detect topics just as we were determining regions of interest. The candidates within this bounding bags may change with the bag as we window the bag over the text knowing that the semantic content is progressive in a text.
However, clustering is not cheap and the sizing and rolling over of bags of words across the text from start to finish provides a lot of combinations. These therefore imply that clustering is beneficial only when it is done with sufficient samples such as the whole document and done once over all the word vectors taken in the bag of words representing the document. How then do we view several bags of words as local topics within the overall document.
While vectors and clusters formation may be expensive, if their combinations could be made much simpler, then we have the ability to try different combinations enhanced with positional information to for regions of interest. The positional information is no more than offset and size but the combinations are harder to represent and compute without re-clustering selective subsets of word vectors.  It is this challenge that could significantly boost the simultaneous detection of topics as well as keywords in a streaming model of the text.

No comments:

Post a Comment