Wednesday, December 4, 2013

In the previous post we were describing the steps for the algorithm by Matsuo. We mentioned that the number of running terms was taken as Ntotal and that the top 30% of these terms were taken as frequent terms. Then the frequent terms were clustered pair wise whose Jensen-Shannon divergence is above the threshold. This results in C clusters. The number of terms that co-occurs with c and denoted by nc gives us the expected probability as nc/Ntotal. Then we compute the chi-square for each term and then show the given number of terms as the ones with the largest chi-square .
Index terms are evaluated based on precision and recall but this algorithm does not use a corpus.  So the algorithm was run on different data sets by different authors who were also asked to provide five or more terms which they thought were indispensable keywords. Coverage of each method was calculated by taking the ratio of the indispensable terms to the net 15 terms found by this algorithm. The results were comparable with tf-idf.
In the previous post, we mentioned an innovative approach to extract the keywords from a single document. It uses word co-occurrences and we wanted to use Kullback-Leibler, clustering and cosine similarity. We were keen on extracting keywords by associating with topics simultaneously.  and repeat the partitioning until cluster centers stabilize. We wanted the flexibility for topic overlap with fuzzy memberships. We were also interested in a term-attribute table that spanned all the words in a dictionary and with attributes that helped us discern topics, tones and era. Note that we attached the relevance or term weights before we clustered the topics. But we used them together in each iteration. And we wanted to evaluate the clusters with the measures we discussed earlier.
In Matsuo's paper, the extracted keyword quality is improved by selecting the proper set of columns from a co-occurrence matrix. This set of columns are the set of terms or keywords and they are preferably orthogonal and they extract it with clustering. They mention two major approaches to clustering - similarity based clustering and pairwise clustering. If terms w1 and w2 have similar distribution of co-occurrence with other terms, w1 and w2 are considered to be the same cluster. For pairwise clustering, if terms w1 and w2 co-occur frequently, w1 and w2 are considered to be the same cluster. They found similarity based clustering to be effective in grouping paraphrases and phrases. They measure similarity of two distributions is measured statistically by Jensen-Shannon divergence. On the other hand, they found pairwise clustering to yield relevant terms in the same cluster.  Thresholds are determined by preliminary experiments.  Proper clustering of frequent terms results in an appropriate chi-square value for each term. The steps imvolve preprocessing. selection of frequent terms, clustering of frequent terms, calculation of expected probability, calculation of chi-dash-square value, and output keywords. Frequent terms are selected as the top 30% . Frequent terms are clustered by pairs whose Jensen-Shannon divergence is above threshold 0.95 * log 2

Tuesday, December 3, 2013

I came across a paper that has some similarity with my interest in keyword extraction. The paper is titled Keyword extraction from a single document using word co-occurrence statistical information by Matsuo and Ishizuka. Their algorithm works on a single document without using a corpus. They extract the frequent terms first. Then they count the co-occurrences of a term and frequent terms. If a term appears selectively with a particular subset of frequent terms, the term is likely to have an important meaning. They measure the degree of bias of the co-occurrence distribution by Chi-square goodness. They show that their algorithm performs just as well as a tf-idf with corpus.
They assume that a sentence as a basket of words ignoring term order and grammatical information. They make a co-occurrence matrix by counting frequencies of pairwise term co-occurrence. This matrix is a symmetric N x N matrix where N is the number of different terms and different from the number of frequent terms G. They ignore the diagonal components. If a term w appears independent from frequent terms G, the distribution of w and G is similar to unconditional distribution. On the other hand, if there is a semantic relation with a subset g from G, then the w and g have a biased distribution. Since the term frequency could be small, the degree of biases is not reliable.So they test the significance of biases using chi-square. In this case, the chi-square is defined as Sum-g((freq(w,g) - nw,pg)^2/nw.pg) where
pg is the expected probability equal to the unconditional probability of the frequent term g  and
nw is the total number of co-occurrence of term w and frequent terms G.
nwpg represents the expected frequency of co-occurrence.
Terms with high chi-square value are relatively more important in the document than the ones with low chi-square.
They use clustering.

We mentioned similarity-oriented measures for supervised clustering. We will look into Jaccard coefficient now. We can view this approach to cluster validity as involving the comparision of two matrices - an ideal cluster similarity matrix and an ideal class similarity matrix defined wrt class labels. As before we can take the correlation of these two matrices as the measure of cluster validity. Both of these matrices are N * N where N is the number of data points. and they have both binary values 1 or 0 - 1 if the two objects belong to the same cluster or class respectively or 0 otherwise. For all m(m - 1)/2 pairs of distinct objects, we compute the following:
f00 = number of pairs of objects that have a different class and a different cluster
f01 = number of pairs of objects that have a different class and the same cluster.
f10 = number of pairs of objects that have the same class and a different cluster
f11 = number of pairs of objects that have the same class and the same cluster.
These four pairs define a two way contingency table where the columns are the same or different clusters and the rows are same or different classes.
Then we can use one of the two most frequently used similarity measures as follows
1) Rand statistic = (f00 + f11)/ (f00 + f01 + f10 + f11)
and
2) Jaccard co-efficient = (f11)/(f01 + f10 + f11)
For hierarchical clustering, the key idea for a measure is that at least one of the cluster is relatively pure and includes most of the objects of that class and to evaluate it, we use the F-measure as defined earlier for each cluster in the cluster hierarchy.
The overall F-measure for the hierarchical clustering is computed as the weighted average of the per-class F- measures i.e Sum((mj/m)max-F-measure(i,j)) where the weights are based on class sizes and the maximum is taken over all clusters i at all levels, mj is the number of objects in class j and m is the total number of objects
Thus we have seen different cluster measures. Their values are indicative of the goodness of the cluster.  A purity of 0 is bad and 1 is good. An entropy of 0 and an SSE of 0 are both good. We can use absolute number if we want only a certain level of error. If this is not the case, we can use statistical measures which measure how non-random the structure is given the high or low values of the measure.

Monday, December 2, 2013

We talked about cluster validation measures and saw some details about the measures for partitioning and hierarchical clustering. We continue our discussion with a way to verify the number of clusters. Let us take a dataset where there are well-formed ten clusters and we use K-means clustering which also yields ten clusters.Then if we plot the SSE versus the number of clusters for the same data as well as plot the average silhouette coefficient versus the number of clusters, both plots show a change in trend when the number of clusters is ten. The SSE plot shows a knee and the silhouette co-efficient plot shows a peak.
Another validation concern is that almost all clustering algorithms yield clusters but the given data may or may not have any clusters. To workaround this, we refine the validation to see that at least a few of the resulting clusters are well formed.
An alternate approach is something called the clustering tendency where we can try to evaluate whether a data has clusters without clustering. The most common approach is to use statistical tests for spatial randomness. However, choosing the correct model, estimating the parameters and evaluating the significance that the data is non-random can be time-consuming. Still a few are mentioned in this post.
Hopkins statistic - For this approach, we generate p points that are randomly distributed across the data space, and also sample p actual data points. For both sets of points, we find the distance to the nearest neighbor in the original data set. If ui is this nearest distance for the generated points and wi is this nearest distance for the sample points, the statistic measure is given by
H = Sum(wi) / (Sum(ui) + Sum(wi)) Notice that this has a form similar to precision.
If they are the same then H value comes out to be 0.5 indicating that our selection just happened to be representative of the sample data. For highly clustered data, we get H close to 1 and 0 to indicate the randomness in the data.
When we have external information about data such as when we have labels for the data objects,  we use two different kinds of approaches, the first set of techniques based on entropy, purity and F-measure and the second set of techniques based on something similar to similarity measures for binary data such as Jaccard measure. The approaches are based on the measures of the extent to which two objects belong to the same cluster and the extent to which two clusters share the same object. They can also be named classification-oriented and similarity-oriented approaches.
Entropy is the degree to which each cluster consists of objects of a single class.
If mi is the number of objects in cluster i and mij is the number of objects of class j in cluster i, then the probabibility that a member of cluster i belongs to class j  as pij = mij/mi and entropy is -Sum(pij.log-base-2(pij)). Purity on the other hand is (mi/m).pi
F-measure is 2*precision*recall/(precision + recall).

We mentioned that the lower values of the silhouette coefficient indicate that they are outliers. In addition, this measure can be used not just for the points but for the cluster and the overall as well since we just take the average of those for all the points.
We can also measure cluster validity via correlation.If we are given a similarity matrix for a dataset  and a cluster label from a cluster analysis of the data set, we can measure the goodness by looking at the correlation between the similarity matrix and the ideal version of the matrix. An ideal matrix is one whose points have a similarity of 1 to all points in the cluster and 0 to all other points. This means that the ideal matrix will only have non-zero values along the diagonals because the diagonals correspond to the same cluster in both rows and columns i.e intracluster similarity. Hence such matrix is also called the block diagonal.
High correlation between the ideal and the actual means that the points belonging to the same cluster are close together while the low correlation means this is not the case. Note that since both matrices are symmetric, only the upper or lower half of the diagonal needs to be compared.
The shortcoming of this approach is that this works when the clusters are globular but not so when they are interspersed.
In order to visualize the symmetry matrix, we can transform the distances into similarities with the formula:
s = 1 - (d - min_d)/(max_d - min_d)
This lets us see the separation between the clusters better. When we apply this to the results of the clustering to the same data set by K-means, DBSCAN and complete link, we see that the K-means similarity matrix has more uniform separation. All three have well defined separations between the three clusters but K-means displays it better.
We also referred to the Cophenetic correlation co-efficient. This comes in useful to evaluate the agglomerative hierarchical clustering.
The cophenetic distance between two objects is the proximity at which an agglomerative hierarchical clustering technique puts the objects in the same cluster for the first time. For example, if the smallest distance between two clusters that are merged is 0.1, then all the points in one cluster will have a cophenetic distance of 0.1 with respect to the points in the other cluster. Since different hierarchical clustering of a set of points merge the clusters differently, the cophenetic distance between each pair of objects varies with each. In a cophenetic distance matrix the entries are cophenetic distances between each pair of objects. Notice that the cophenetic distance matrix may have zero values along the diagonals.  The cophenetic correlation coefficient is the correlation between the entries of this matrix and the original dissimilarity matrix.


Sunday, December 1, 2013

We talked about evaluating clusters in the previous post. We briefly introduced the various measure categories. We could look into a few details now. For partitional clustering, we have measures based on cohesion and separation based on proximity matrix. For hierarchical clustering, we have cophenetic correlation coefficient. We could also look into entropy, purity and Jaccard measure.In the former  case, we have the proximity as the square Euclidean distance. For cohesion, we aggregate the distance between the points and the centroid.  For separation, we take the link distances whether minimum or maximum or average. The relationship between cohesion and separation is written as
TSS = SSE + SSB
where TSS is the total sum of squares.
SSE is the sum of squared error
and SSB is the between group sum of squares, the higher the total SSB, the more separated the clusters are.
Another thing to note here is that minimizing SSE (cohesion) automatically results in maximizing SSB (separation).  This can be proved in this way. The total sum of squares is the sum of the square of the distance of each point from each centroid (x-c). This distance between each point and each centroid can be taken as the difference between the distance of the point from a centroid ci and the distance between the centroid ci and c which can then be expressed in the quadratic form and from which only the significant components are taken to form the sum of SSE and SSB.
We have so far talked about using these measures for all of the clusters. In fact, these measures can also be used for individual clusters. For example, a cluster that has a high degree of cohesion, may be considered better than a cluster that has a lower value.  A lower value cluster is also indicative of requiring partitioning to result in better clusters.  Thus clusters can be ranked and processed based on these measures.  Having talked about the individual clusters, we can also rank objects within the cluster based on these measures. An object that contributes more to cohesion is likely to be near the core of the cluster. And for others, they would likely be near the edge.
Lastly, we talk about Silhouette co-efficient. This combines both cohesion and separation.  This is done in three steps.
For the ith object, calculate its average distance to all other objects in cluster and call it ai
For the ith object, and any cluster not containing the object, calculate the object's average distance to all the objects in the given cluster. Use the minimum value and call it bi
For the ith object, the silhouette coefficient is given by (bi-ai)/max(ai, bi)