Sunday, October 27, 2013


In the previous two posts, we looked at K-means clustering and improving its performance for large data sets. K-means is very useful for forming known K clusters. Besides we can choose what we want to cluster. For example, we can bisect the data points into two clusters. Then we can bisect one of the clusters further if it does not satisfy say a threshold we set. The threshold could be based on distance or density. If we bisect one of the topics cluster further recursively because its larger than we expected, we will now have a binary tree of clusters. In topic analysis, these clusters correlate to topics and subtopics without restricting to the syntax or layout of the text. On the other hand, text segmentation relies on the layout and generates segmentation blocks based on a stochastic topic model. Here the topics are rudimentarily spotted from the text and formed into word clusters and then word clusters sharing common words are merged together. This gives us independent topics. Next, we follow the text to see how similar the preceding block is to the succeeding block and wherever there are valleys or low similarity values, we can segment the text. Contrast this with the bisecting method where the granularity of text segmentation could also be left to user discretion to have as much segmentation  or as little. Thus clustering can be applied in both bisecting as well as agglomerative ways.
For the purposes of keyword extraction, we rely on term weighting. Here the weights are determined based on probabilistic latent semantic analysis or some other technique. These include distances based on cosine, Jensen-Shannon or the feature weights we discussed in the previous posts. We could also consider Kullback-Leibler distance based background suppression to extract keywords.
Meanwhile, We have yet to look at several other implementation considerations such as the following:
Lemmatization - Stemming can help remove the various forms in which words appear be it in noun, form or adjective form.
Text tagging - could possibly be done with TreeTagger
MultiWord Lookup - We may have to look at collocations and words that appear in titles.
named-entity recognition and filtering - Proper nouns and stop word filterings will help with improving the data.
All of the above are necessary before the topic / keyword extraction or assigning term weights.

No comments:

Post a Comment