Thursday, December 26, 2013

I will take a short break from our readings from the book on building a data warehouse, to discuss the implementation of Matsuo algorithm. I've mentioned  that the algorithm proceeds with the following steps : preprocessing, selection of frequent terms, clustering frequent terms, calculation of expected probability as the ratio of term co-occurrences with the cluster to the total terms, calculation of the chi-square value based on the differences between expected and observed frequencies, and output the keywords.
In this post, I want to revisit the step involving  clustering frequent terms. Note that we find the term frequencies, set the cutoff to 30% to find the top frequent terms and then find the co-occurrence matrix between all terms and frequent terms where the co-occurrences are found by calculating the number of sentences where they co-occur.
Now we transpose the columns of this co-occurrence matrix and find the clusters taking two at a time. We find the mutual information measure between two terms x and y as P(x,y)/ P(x)P(y)
We can take the P(x) as the expected probability which is the sum of the total number of terms in sentences where x appears divided by the total number of terms in the document.
We could use hierarchical clustering methods. We take terms pairwise and put them in a cluster if their mutual information is above a threshold. This threshold we choose to be log(2.0). A term is said to co-occur with a cluster if it co-occurs with any term in the cluster.
Just like we populated the co-occurrence matrix, we take pairs of terms and cluster them. We maintain a term array containing cluster  labels or numbers where we start with 1 for the first cluster and increment for every new cluster. If one of the terms is already in a cluster, the other term joins the cluster. If the other terms belongs to another cluster, the clusters are merged. The highest of either cluster numbers is used for the merged cluster number When the clusters are merged the cluster matrix reassigns all members of both clusters to this new cluster and so on. If a term belongs to more than one clusters, then it is merged with the nearest cluster. Eventually, all the terms will be assigned to one cluster each. More than one terms may be assigned to a single cluster. All clusters will have distinct labels or numbers.




No comments:

Post a Comment