Wednesday, December 4, 2013

In the previous post, we mentioned an innovative approach to extract the keywords from a single document. It uses word co-occurrences and we wanted to use Kullback-Leibler, clustering and cosine similarity. We were keen on extracting keywords by associating with topics simultaneously.  and repeat the partitioning until cluster centers stabilize. We wanted the flexibility for topic overlap with fuzzy memberships. We were also interested in a term-attribute table that spanned all the words in a dictionary and with attributes that helped us discern topics, tones and era. Note that we attached the relevance or term weights before we clustered the topics. But we used them together in each iteration. And we wanted to evaluate the clusters with the measures we discussed earlier.
In Matsuo's paper, the extracted keyword quality is improved by selecting the proper set of columns from a co-occurrence matrix. This set of columns are the set of terms or keywords and they are preferably orthogonal and they extract it with clustering. They mention two major approaches to clustering - similarity based clustering and pairwise clustering. If terms w1 and w2 have similar distribution of co-occurrence with other terms, w1 and w2 are considered to be the same cluster. For pairwise clustering, if terms w1 and w2 co-occur frequently, w1 and w2 are considered to be the same cluster. They found similarity based clustering to be effective in grouping paraphrases and phrases. They measure similarity of two distributions is measured statistically by Jensen-Shannon divergence. On the other hand, they found pairwise clustering to yield relevant terms in the same cluster.  Thresholds are determined by preliminary experiments.  Proper clustering of frequent terms results in an appropriate chi-square value for each term. The steps imvolve preprocessing. selection of frequent terms, clustering of frequent terms, calculation of expected probability, calculation of chi-dash-square value, and output keywords. Frequent terms are selected as the top 30% . Frequent terms are clustered by pairs whose Jensen-Shannon divergence is above threshold 0.95 * log 2

No comments:

Post a Comment