Saturday, November 16, 2013

In the previous post, we discussed using KLD for narrow domain short texts. Here we will cover the feature selection techniques used in that discussion. The first technique is the DocumentFrequency (DF)  which assigns the value dft to each term t where dft is the number of texts in each collection where t occurs. The second technique is the term strength where weights are assigned to each term t from similar texts. The third is the transition point. This is found from the vocabulary frequencies of each text.
Apart from these, there was a mention for extracting the keywords based on KLD itself. i.e we find terms that contribute most to the distance between two probability distributions P and Q. We had proposed two approaches. The first one was the selection of a few terms at a time and choosing the ones that was consistently higher than a threshold. The second one was the selection of a term at a time and adding it to an empty document so that the KLD distance is maintained higher than a threshold.
The clustering techniques used by the authors include the agglomerative clustering proposed by Johnson.
This is described by Borgatti as follows:
Let there be N items to be clustered and a N * N distance matrix to begin with:
Assign each item to its own cluster so that we have N clusters each containing just one term. The distances between the clusters equals the distances between the items they contain.
The closest most similar pair of clusters are found and merged into a single cluster, so that we now have one less cluster.
The distances between the new cluster and each of the old cluster is computed
The above two steps of merging and computing the distances is repeated until we are left with one cluster
Computing the distance between the new cluster and the old clusters can be done in one of the following three ways:
single link -  here the distance between one cluster and another is the shortest distance between any member of the cluster to any member of the other cluster. This method is also called the connectedness or minimum method
complete link - here the distance between one cluster and another is the longest distance from any member of one cluster to any member of other cluster. This method is also called the diameter or maximum method.
average link - here the distance between one cluster and another is the average distance between any member of one cluster to any member of the other cluster.
Notice that in the N * N matrix the distances along the diagonal between same elements is zero. And as the elements are combined into clusters, the cluster names are the combination of the names of the elements.

No comments:

Post a Comment