Wednesday, December 4, 2013

In the previous post we were describing the steps for the algorithm by Matsuo. We mentioned that the number of running terms was taken as Ntotal and that the top 30% of these terms were taken as frequent terms. Then the frequent terms were clustered pair wise whose Jensen-Shannon divergence is above the threshold. This results in C clusters. The number of terms that co-occurs with c and denoted by nc gives us the expected probability as nc/Ntotal. Then we compute the chi-square for each term and then show the given number of terms as the ones with the largest chi-square .
Index terms are evaluated based on precision and recall but this algorithm does not use a corpus.  So the algorithm was run on different data sets by different authors who were also asked to provide five or more terms which they thought were indispensable keywords. Coverage of each method was calculated by taking the ratio of the indispensable terms to the net 15 terms found by this algorithm. The results were comparable with tf-idf.

No comments:

Post a Comment