Thursday, December 5, 2013

In today's post we talk about maximal Kullback-Leibler Divergence cluster analysis by Anne Sheehy. In this paper, she discusses a new procedure for performing a cluster analysis along with a proof for a consistency result for the procedure. The clustering of data X1, X2, ..., Xn with distribution P on sample space X in R with a set of known partitions K results in members Vk from the collection of k partitions of space X. The partition that best describes the clustering structure of the data is defined to be the one which maximizes a criterion. In this case, the criterion is a weighted sum of Kullback-Leibler divergences.
The Kullback-Leibler function K(Q,P) takes the value 0 when m = EpX and increases as m moves away from EpX.The rate of increase of the function will depend on many things besides the distance  of m from EpX, including the dispersion of P in the direction in which m is moving.
The value K(Qn-min(A),Pn) differentiates the subset A of {X1, X2, ...., Xn} from the rest of the sample. If the cluster A of the sample is far removed from the rest, it will have a very large value for K(Qn-min(A),Pn).
Here the subset A are chosen such that P(A) > 0.
The paper mentions a reference to Blaschke's Selection Theorem
If a clustering procedure results in k clusters on a given set of data, and all points in any one of these clusters is removed from the sample space and re-clustered, it should give the original k-1 clusters except for the one that is removed. This "cluster omission admissibility" is a characterisitic of a good clustering procedure.
In order to select a  subset of the original sample space, we could consider clustering the probabilites into  a set of k -partitions or ranges of values. Since the value ranges are bounded to be between 0 and 1 and we can divide the space into k non-overlapping ranges, we will be able to use one or more of these clusters for different subsets.  The goal of the subset is to maximize the KLD criterion so this works to choose one from the different subsets based on the value of the K(Qn-min(A), Pn)
The probabilities used in KLD is based on term frequencies and can be taken as the ratio of the sum of the total number of terms in sentences where this term appears divided by the total number of terms in the document.
Thus we can apply the term frequency at any scope, whether sentences relative to documents or documents relative to collection.
The procedure is the same for selection regardless of the scope and the clustering too can be applied using measures that are based on different scopes.

No comments:

Post a Comment