Sunday, December 8, 2013

While Matsuo's paper discusses an approach to extract keywords by first taking the top 30% of the frequent terms and then clustering based on pairwise co-occurrence of terms, we strive to collect the terms that differentiate from the background by automatically finding the number of clusters using DBSCAN algorithm and KLD distances based on similarity of distributions. In this case we use the KLD-distance as defined by Bigi :
D-KLD(P/Q) = Sum-x(P(x)-Q(x))log(P(x)/Q(x))
and the distribution is taken with the probabilities as
pg = the sum of the total number of terms in sentences where g appears) divided by the total number of terms in the document.

In the KLD metric, when we have a limited set of sentences, this probability is then normalized by multiplying with a scaling factor when the term occurs or set to a default value
when it doesn't.

We use the DBSCAN algorithm like this. Any two core points that are close enough i.e. within the radius of one another are put in the same cluster. Any border point that is close enough to a cluster is included within the same cluster as a core point. If there are ties between two or more clusters, they are resolved by choosing the one that's closest. Noise points are discarded.
So the steps can be enumerated as follows:
1. Label the points as core, border and noise points
2. Eliminate the noise points
3. Put an edge between all core points that are within the specified distance of one another
4. Make a cluster out of each group of connected core points that are within the specified distance of one another.
5. Assign each border point to one of the clusters of its associated core points.
For each point we find the points that are within the neighborhood of this point, so the complexity of this algorithm is O(m^2). However, there are data structures that can help reduce the complexity in retrieving the points adjacent to this center to O(mlogm). The space complexity is linear because we only persist a small amount of metadata for each of the points namely the cluster label and the classification of each point as the core, border or noise point.
The parameters for this algorithm are the radius and the minimum number of points in a cluster to distinguish the core from the border points.
By clustering points, we incrementally provide user-defined number of keywords by taking the high valued clusters first and enumerating the keywords in that cluster.

No comments:

Post a Comment