Friday, May 24, 2013

So far from our posts we have seen that there are several tools for text mining. For example, we used machine based learning with tagged corpus and ontology. Vast collection of text has been studied and prepared in this corpus and a comprehensive collection of words has been included in the ontology. This gives us great resource to work with any document. Next we define different distance vectors and use clustering techniques to group and extract keywords and topics. We have refined the distance vectors and data points to be more representative of the content of the text. There have been several ways to measure distance or similarity between words and we have seen articulation of probability based measures. We have reviewed the way we cluster these data points and found out methods that we prefer over others.
We want to remain focused on keyword extraction even though we have seen similar usages in topic analysis and some interesting areas as text segmentation. We don't want to resort to a large corpus for light weight application plugins but we don't mind a large corpus for database searches.  We don't want processing that is better than O(N^2) in working with the data to extract keywords and we have the luxury to have a pipeline of steps to get to the keywords.

No comments:

Post a Comment