Cluster computing

Sunday, August 31, 2014

Today I will resume some discussion on Keyword extraction.
We discussed co-occurrence of terms as an indicator of the keywords. This has traditionally meant clustering keywords based on similarity. Similarity is often measured based on Jensen-Shannon divergence or Kullback-Leibler divergence. However similarity doesn't give an indication of relevance. Pair-wise co-occurrence or mutual information gives some indication of relevance.
Sometimes we need to use both or prefer one over the other based on chi-square goodness of fit.
In our case, co-occurrence of a term and a cluster means co-occurrence of the term and any term in the cluster although we could use nearest, farthest terms or the average from the cluster.
What we did was we populated co-occurrence matrix from the top occurring terms and their counts. For each of the terms, we count the co-occurrences with the frequent terms that we have selected. These frequent terms are based on a threshold we select.
When we classify, we take two terms and find the clusters they belong to. Words don't belong to any cluster initially. They are put in the same cluster based on the mutual information which is calculated as the ratio of the probability of co-occurring terms to the individual probabilities of the terms. We translate this to the counts and calculate each probability in terms of counts from the co-occurrence matrix.
We measure the cluster quality by calculating the chi-square. This we do by summing over all the components of the chi-square as measured for each word in the frequent terms. Each component is the square of the difference between the observed co-occurrence frequency and the expected frequency and divided by the expected frequency of co-occurrence. The expected frequency is calculated in turn as the combination of the expected probability pg of that word g from the frequent terms and the co-occurrence of the term w with frequent terms denoted by nw.
If the terms have a large chi-square value, then they are relatively more important. If the terms have a low chi-square value then they are relatively trivial. Chi-square gives a notion of the deviation from the mean indicating the contribution each cluster makes and hence its likelihood to bring out the salient keywords. For a look at the implementation, here it is. We attempted Kullback-Leibler divergence as a method for keyword extraction as well. Here we used one of the formula for the divergence.

Cluster computing

Sunday, August 31, 2014

No comments:

Post a Comment