Sunday, October 27, 2013


In the previous two posts, we looked at K-means clustering and improving its performance for large data sets. K-means is very useful for forming known K clusters. Besides we can choose what we want to cluster. For example, we can bisect the data points into two clusters. Then we can bisect one of the clusters further if it does not satisfy say a threshold we set. The threshold could be based on distance or density. If we bisect one of the topics cluster further recursively because its larger than we expected, we will now have a binary tree of clusters. In topic analysis, these clusters correlate to topics and subtopics without restricting to the syntax or layout of the text. On the other hand, text segmentation relies on the layout and generates segmentation blocks based on a stochastic topic model. Here the topics are rudimentarily spotted from the text and formed into word clusters and then word clusters sharing common words are merged together. This gives us independent topics. Next, we follow the text to see how similar the preceding block is to the succeeding block and wherever there are valleys or low similarity values, we can segment the text. Contrast this with the bisecting method where the granularity of text segmentation could also be left to user discretion to have as much segmentation  or as little. Thus clustering can be applied in both bisecting as well as agglomerative ways.
For the purposes of keyword extraction, we rely on term weighting. Here the weights are determined based on probabilistic latent semantic analysis or some other technique. These include distances based on cosine, Jensen-Shannon or the feature weights we discussed in the previous posts. We could also consider Kullback-Leibler distance based background suppression to extract keywords.
Meanwhile, We have yet to look at several other implementation considerations such as the following:
Lemmatization - Stemming can help remove the various forms in which words appear be it in noun, form or adjective form.
Text tagging - could possibly be done with TreeTagger
MultiWord Lookup - We may have to look at collocations and words that appear in titles.
named-entity recognition and filtering - Proper nouns and stop word filterings will help with improving the data.
All of the above are necessary before the topic / keyword extraction or assigning term weights.
We talked about summarizing data points when clustering very large data sets. This approach is common to many improvement algorithms to clustering and in particular today we will discuss a simple summarizing technique called BFR for K-means clustering we discussed earlier.
Data Points are read into memory and here too the initial k-cluster centroids are chosen by some simple approach. For example, we can take a sample, pick a random point and then k-1 points as far apart from each other as possible.
We maintain three sets of data points and these are called discard set, compression set, and retained set. The discard set are points close enough to a centroid to be summarized. The compression set is a group of  points that are close together but not close to any centroid.  They are summarized but not assigned to a cluster. The retained sets are isolated points.
The discard set is summarized by the number of points N.
It has a vector SUM for the ith component which is the sum of the co-ordinates of the point in the ith dimension.
The vector SUMSQ for the  ith component which is the sum of the squares of co-ordinates in the ith dimension
2d +1 values that represent any number of points where d is the number of dimensions
Centroid in the ith dimension is calculated as SUMi/N
And the variance is calculated as the (SUMSQi/N) - (SUMi/N)^2
This summarization data is used with the processing of the data points from the very large data set.
The data points are processed in the following manner:
Data points that are sufficiently close to a cluster centroid are added to that cluster and its discard set.
Cluster the remaining points and the old retained set. Clusters that are formed goto the cohesive set and outliers go to the retained set. Since the clusters are not assigned yet, we will not add them to the discard set.
The data summarized for these clusters is then updated with the inclusion of these new points. Some cohesive set clusters can even be merged. If this is the last round, merge all the cohesive set and the retained set to their nearest cluster.

Saturday, October 26, 2013

We discussed SKWIC and Fuzzy SKWIC in that order. I wanted this post to be in favor of implementation starting from the basics. Actual implementation details may follow subsequently but in this post we start with the basic.
Given a set of tuples, with fixed set of attributes, we can implement different clustering algorithms. Let us pick at something real simple like classifying each tuple into one of three different categories say X, Y and Z. Let's say the tuples represent points with (x,y) axis co-ordinates. On a graph with x and y axis, these points appear in a cartesian plane that we may be familiar with. Visualizing the points on this graph helps us find the clusters X, Y and Z as separate from each other. It also helps us calculate the distance between any two points as the sqrt of the sum of squares of x axis projections and y-axis projections.
One approach is to use a centroid - the point closest to other points in a cluster and represent the cluster with the centroid. This works in our case since we can keep track of three points for three categories we want our points in. This is called the K-means. and in our case K = 3
So the steps involved are something like this:
1) Initialize the cluster by picking three points for centroids. Pick one point randomly, then the other two points as far away as possible from the previous points.
2) For each point, place it in the nearest cluster, the distance to the centroids is minimum. Recompute the centroid for this cluster
3) After all the points are assigned, fix the centroids of these three clusters.
4) Re-assign all the points to their closest centroid.
Sometimes K is not clear initially
And we may need to try different k, then we see the changes in the average distance to centroid. When K increases, this drops rapidly until after the right k, it changes very little.
This is K-means clustering. Next we will look at optimizing memory and CPU.
For now, we can assume that initialization costs can be done away with random and re-selection picks if they are not far apart.
In general when there are lots of tuples, we load them in batches that can fit in memory.  Then we maintain three classes of points and use summarization so that we can represent them with shorter data structures and continue our clustering.
For each cluster, the summarization info includes the number of points N the vectors for the sum of the co-ordinates of the points in the ith dimension and the sum of the  square of the co-ordinates in the ith dimension.
we now summarize the Fuzzy SKWIC algorithm below
Fix the number of clusters C;
Fix m, m can range from 1 to infinity; m goes with fuzzy memberships so suitable value can be chosen to amplify that.
Initialize the centers by randomly selecting C documents.
Initialize the fuzzy partition matrix U  = C x N matrix = [uii] where uii is the fuzzy memberships subject to the following constraints: a) uij is bounded between 0 and 1. b)sum of uij for j = 1 to N  is bounded by 0 to N and c) sum of uij for clusters 1 to C must equal 1.
REPEAT
1) Find the cosine distance along the individual dimensions k as 1/n - xjk.cik where the cik is the kth component of the ith cluster center vector. Do this for i ranging from 1 to C, j ranging from 1 to N and k ranging from 1 to n ( n is the total number of terms in a collection of N documents)
2) Before we find the aggregated cosine distance, update the relevance weights vik by using the solution from partial differential equations  where vik is assigned a  default value of 1/n and a bias involving no fuzzy memberships and only the aggregated differences in cosine distances.
3) Now compute the aggregated cosine based distance for each cluster i from 1 to C and each document j from 1 to N using the relevance weight and the cosine distances along the individual dimensions  and aggregating them
4) If the aggregated cosine based distance from above turns out to be negative, adjust it so that it does not affect the sign of the fuzzy memberships. We adjust by updating it with the minimum magnitude of the aggregated cosine based distance
5) Update the partition matrix Uk by the ratio of the aggregated cosine based distances Dwcij and Dwckj and the ratio summed for each cluster. The result is then inverted to update the matrix.
6) Update the centers by taking into account the fuzzy memberships. Specifically, we set cik to 0 when the relevance weight is zero and set it to fuzzy membership normalized average xjk.  The fuzzy membership is amplified by raising to power m and aggregated over the number of document j ranging from 1 to N in both the numerator and the denominator for normalizing the document frequency vector
7) Update the tau for the ith cluster by using the cosine based distance along the individual dimensions using the cosine based distance along individual dimension and the feature weights summed for all the terms and divided by the sum of the square of the feature weights. This is then multiplied by the sum of the amplified fuzzy memberships and a constant K. We find the value of tau by repeating the iterations with the values for the fuzzy membership, the feature weight, cluster center cik from the previous iteration.
(Continue the iterations) UNTIL centers stabilize.
Note that we have used the fuzzy memberships in the initialization and steps 4,5,6 and 7. So fuzzy SKWIC improves the partition matrix, the center updates and the choice of tau in each iteration.

Friday, October 25, 2013


we will continue to solve our objective function for the overlapping clusters so that documents can have fuzzy labels. Our fuzzy SKWIC objective function had similar components as the one we discussed earlier i.e the first component is the error to the clusters and the second component is the feature weights. In addition, the first component now had the fuzzy membership degrees aggregated on various clusters and terms. The objective function was subject to the condition that the feature weights will range between 0 and 1 and the sum of all feature weights will equal 1. The second component has a tau parameter that lets us tune the function so that both components have equal orders of magnitude. This we mentioned was required to keep both components equally relevant. Each cluster has its owns set of feature weights Vi = vi1, .. vin. 
Now we use the same Lagrange multiplier technique to optimize the objective function with respect to V. Since the rows of V are independent of each other, we reduce the objective function to C independent problems.
Then we find the partial differential equations by setting the gradient of the objective function to zero. We do this with respect to both the multiplier and the feature weights.
Solving the differential equations for feature weight vik we get two components where the first term is the default value again 1/n in this case and the next component is the bias and this one has the fuzzy membership degree included in the aggregated dot product.
The choice of tau is very important since it reflects the importance of the second term relative to the first term. The consequence of choosing a small tau will result in choosing only one keyword per cluster with feature weight 1 whereas choosing a larger value will result in all the words in a cluster to be relevant and assigned equal weights.
Since the second component of the objective function does not depend on the fuzzy membership, the update equation of the fuzzy membership is similar to Fuzzy C Means which is explained as the inverse of the cluster-wise aggregation of the ratio of aggregated cosine based distances raised to m-1.



When we discussed how the SKWIC works in the previous post, we described a cluster partitioning method that assigns a document to a single cluster. However, this is rarely the case when a document is for a single subject only. Most documents will tend to straddle the subjects of two or more different documents. And even manual classification is difficult and poor. When a document is classified to a single group, it affects the retrieval ability once a classification model is built. K-means and SKWIC are both hard partitioning models and by assigning the documents to the same category, they have limited use in real large document collections.
Instead if we assign soft labels to documents that is the documents are not confined to  a single category, then this can be improved. Especially when we use fuzzy or soft partition models, the data is modeled better than their hard partitioning counterparts . This is because fuzzy memberships are richer than crisp memberships in that they describe the degrees of associations of data points lying in overlapping clusters.
If we denote this fuzzy membership as uii which a datapoint may have as varying degrees to each cluster Xi for a datapoint x, then the fuzzy partition is described as a C x N matrix U = |uii|. This matrix is subject to the following constraints:
1. The degrees of association uij can range only between 0  and 1. This means that we assign them continuous values that are bounded by 0 and 1
2. The sum of the degrees of association uij for  all the N terms must range between 0 and N
3. The sum of the degrees of association uij for all the clusters must equal 1.
We define a new objective function where we take this fuzzy memberships together with the sum of the errors to the cluster center as the first component . This component is minimized when only one keyword in each cluster is completely relevant and all other keywords are irrelevant. The other keyword is the sum of the squared keyword weights. The global minimum of this keyword is achieved when all the keywords are equally weighted. The goal of the objective function is to reduce the sum of the intracluster weighted distances.
We find similar partial differential equations to solve this objective function in the same manner as the previous one.

Thursday, October 24, 2013

In the previous post, we talked about finding tau and adjusting it to have the same magnitude between the two components of the objective function. What we will see next is that the cluster partition that minimizes J is the one that assigns each data sample to the cluster with the nearest center. In other words, the fraction of xj over the cosine based distance Dwcij will always be less than the weighted aggregated cosine based distance for all i that is not at the center. Ties are resolved arbitrarily for candidates that are vie for center.
Since the optimization function cannot be minimized with respect to the centers, we compute new cluster centroids and normalize them to unit-length.
Notice that this results in two cases for the feature weight vik which we may remember from the previous post was the default value (1/n) and the bias. The latter can be both positive and negative.
So the first case is when the vik = 0
This means that the kth feature is completely irrelevant relative to the ith cluster. Hence, regardless of the value of cik, the value of this feature will not contribute to the overall weighted distance computation. Therefore, in this situation, any arbitrary value can be chosen for cik and is generally set to 0.
The second case is when vik != 0
In this case the center is recomputed as cik is equal to the normalized aggregated sum of the document frequency vectors across the different attribute directions. i.e we are saying the kth feature has some relevance to the ith cluster we will simply pick nearly the mid-point.
Now, let us look at summarizing the above steps for clustering a collection of N normalized document vectors defined over a vocabulary of n keywords:
Fix the number of clusters C
Initialize the centers by randomly selecting C documents
Initialize the partitions X, using equal feature weights (1/n) where n is the total number of items in a collection of N documents.
REPEAT
Compute cosine based distance between document xj and ith cluster center vector using a formula that involves the kth component of the ith cluster center vector as 1/n-(xjk,cik). Repeat this for every cluster and for every term from 1 to N and along every dimension from 1 to k
Update the relevance weights vik by using the default value and computing the bias along this dimension after computing that across all dimensions
Compute the weighted aggregated sum of cosine based distances along the individual dimensions ( dotproduct of feature weights and cosine based distance.
Update the cluster partitioning with the method discussed above
Update the tau
UNTIL centers stabilize
The difference between some similar approaches in pattern recognition and here is that the former has an assumption that data has a multivariate Gaussian distribution where as this treats them as independent.