Saturday, October 26, 2013

We discussed SKWIC and Fuzzy SKWIC in that order. I wanted this post to be in favor of implementation starting from the basics. Actual implementation details may follow subsequently but in this post we start with the basic.
Given a set of tuples, with fixed set of attributes, we can implement different clustering algorithms. Let us pick at something real simple like classifying each tuple into one of three different categories say X, Y and Z. Let's say the tuples represent points with (x,y) axis co-ordinates. On a graph with x and y axis, these points appear in a cartesian plane that we may be familiar with. Visualizing the points on this graph helps us find the clusters X, Y and Z as separate from each other. It also helps us calculate the distance between any two points as the sqrt of the sum of squares of x axis projections and y-axis projections.
One approach is to use a centroid - the point closest to other points in a cluster and represent the cluster with the centroid. This works in our case since we can keep track of three points for three categories we want our points in. This is called the K-means. and in our case K = 3
So the steps involved are something like this:
1) Initialize the cluster by picking three points for centroids. Pick one point randomly, then the other two points as far away as possible from the previous points.
2) For each point, place it in the nearest cluster, the distance to the centroids is minimum. Recompute the centroid for this cluster
3) After all the points are assigned, fix the centroids of these three clusters.
4) Re-assign all the points to their closest centroid.
Sometimes K is not clear initially
And we may need to try different k, then we see the changes in the average distance to centroid. When K increases, this drops rapidly until after the right k, it changes very little.
This is K-means clustering. Next we will look at optimizing memory and CPU.
For now, we can assume that initialization costs can be done away with random and re-selection picks if they are not far apart.
In general when there are lots of tuples, we load them in batches that can fit in memory.  Then we maintain three classes of points and use summarization so that we can represent them with shorter data structures and continue our clustering.
For each cluster, the summarization info includes the number of points N the vectors for the sum of the co-ordinates of the points in the ith dimension and the sum of the  square of the co-ordinates in the ith dimension.
we now summarize the Fuzzy SKWIC algorithm below
Fix the number of clusters C;
Fix m, m can range from 1 to infinity; m goes with fuzzy memberships so suitable value can be chosen to amplify that.
Initialize the centers by randomly selecting C documents.
Initialize the fuzzy partition matrix U  = C x N matrix = [uii] where uii is the fuzzy memberships subject to the following constraints: a) uij is bounded between 0 and 1. b)sum of uij for j = 1 to N  is bounded by 0 to N and c) sum of uij for clusters 1 to C must equal 1.
REPEAT
1) Find the cosine distance along the individual dimensions k as 1/n - xjk.cik where the cik is the kth component of the ith cluster center vector. Do this for i ranging from 1 to C, j ranging from 1 to N and k ranging from 1 to n ( n is the total number of terms in a collection of N documents)
2) Before we find the aggregated cosine distance, update the relevance weights vik by using the solution from partial differential equations  where vik is assigned a  default value of 1/n and a bias involving no fuzzy memberships and only the aggregated differences in cosine distances.
3) Now compute the aggregated cosine based distance for each cluster i from 1 to C and each document j from 1 to N using the relevance weight and the cosine distances along the individual dimensions  and aggregating them
4) If the aggregated cosine based distance from above turns out to be negative, adjust it so that it does not affect the sign of the fuzzy memberships. We adjust by updating it with the minimum magnitude of the aggregated cosine based distance
5) Update the partition matrix Uk by the ratio of the aggregated cosine based distances Dwcij and Dwckj and the ratio summed for each cluster. The result is then inverted to update the matrix.
6) Update the centers by taking into account the fuzzy memberships. Specifically, we set cik to 0 when the relevance weight is zero and set it to fuzzy membership normalized average xjk.  The fuzzy membership is amplified by raising to power m and aggregated over the number of document j ranging from 1 to N in both the numerator and the denominator for normalizing the document frequency vector
7) Update the tau for the ith cluster by using the cosine based distance along the individual dimensions using the cosine based distance along individual dimension and the feature weights summed for all the terms and divided by the sum of the square of the feature weights. This is then multiplied by the sum of the amplified fuzzy memberships and a constant K. We find the value of tau by repeating the iterations with the values for the fuzzy membership, the feature weight, cluster center cik from the previous iteration.
(Continue the iterations) UNTIL centers stabilize.
Note that we have used the fuzzy memberships in the initialization and steps 4,5,6 and 7. So fuzzy SKWIC improves the partition matrix, the center updates and the choice of tau in each iteration.

Friday, October 25, 2013


we will continue to solve our objective function for the overlapping clusters so that documents can have fuzzy labels. Our fuzzy SKWIC objective function had similar components as the one we discussed earlier i.e the first component is the error to the clusters and the second component is the feature weights. In addition, the first component now had the fuzzy membership degrees aggregated on various clusters and terms. The objective function was subject to the condition that the feature weights will range between 0 and 1 and the sum of all feature weights will equal 1. The second component has a tau parameter that lets us tune the function so that both components have equal orders of magnitude. This we mentioned was required to keep both components equally relevant. Each cluster has its owns set of feature weights Vi = vi1, .. vin. 
Now we use the same Lagrange multiplier technique to optimize the objective function with respect to V. Since the rows of V are independent of each other, we reduce the objective function to C independent problems.
Then we find the partial differential equations by setting the gradient of the objective function to zero. We do this with respect to both the multiplier and the feature weights.
Solving the differential equations for feature weight vik we get two components where the first term is the default value again 1/n in this case and the next component is the bias and this one has the fuzzy membership degree included in the aggregated dot product.
The choice of tau is very important since it reflects the importance of the second term relative to the first term. The consequence of choosing a small tau will result in choosing only one keyword per cluster with feature weight 1 whereas choosing a larger value will result in all the words in a cluster to be relevant and assigned equal weights.
Since the second component of the objective function does not depend on the fuzzy membership, the update equation of the fuzzy membership is similar to Fuzzy C Means which is explained as the inverse of the cluster-wise aggregation of the ratio of aggregated cosine based distances raised to m-1.



When we discussed how the SKWIC works in the previous post, we described a cluster partitioning method that assigns a document to a single cluster. However, this is rarely the case when a document is for a single subject only. Most documents will tend to straddle the subjects of two or more different documents. And even manual classification is difficult and poor. When a document is classified to a single group, it affects the retrieval ability once a classification model is built. K-means and SKWIC are both hard partitioning models and by assigning the documents to the same category, they have limited use in real large document collections.
Instead if we assign soft labels to documents that is the documents are not confined to  a single category, then this can be improved. Especially when we use fuzzy or soft partition models, the data is modeled better than their hard partitioning counterparts . This is because fuzzy memberships are richer than crisp memberships in that they describe the degrees of associations of data points lying in overlapping clusters.
If we denote this fuzzy membership as uii which a datapoint may have as varying degrees to each cluster Xi for a datapoint x, then the fuzzy partition is described as a C x N matrix U = |uii|. This matrix is subject to the following constraints:
1. The degrees of association uij can range only between 0  and 1. This means that we assign them continuous values that are bounded by 0 and 1
2. The sum of the degrees of association uij for  all the N terms must range between 0 and N
3. The sum of the degrees of association uij for all the clusters must equal 1.
We define a new objective function where we take this fuzzy memberships together with the sum of the errors to the cluster center as the first component . This component is minimized when only one keyword in each cluster is completely relevant and all other keywords are irrelevant. The other keyword is the sum of the squared keyword weights. The global minimum of this keyword is achieved when all the keywords are equally weighted. The goal of the objective function is to reduce the sum of the intracluster weighted distances.
We find similar partial differential equations to solve this objective function in the same manner as the previous one.

Thursday, October 24, 2013

In the previous post, we talked about finding tau and adjusting it to have the same magnitude between the two components of the objective function. What we will see next is that the cluster partition that minimizes J is the one that assigns each data sample to the cluster with the nearest center. In other words, the fraction of xj over the cosine based distance Dwcij will always be less than the weighted aggregated cosine based distance for all i that is not at the center. Ties are resolved arbitrarily for candidates that are vie for center.
Since the optimization function cannot be minimized with respect to the centers, we compute new cluster centroids and normalize them to unit-length.
Notice that this results in two cases for the feature weight vik which we may remember from the previous post was the default value (1/n) and the bias. The latter can be both positive and negative.
So the first case is when the vik = 0
This means that the kth feature is completely irrelevant relative to the ith cluster. Hence, regardless of the value of cik, the value of this feature will not contribute to the overall weighted distance computation. Therefore, in this situation, any arbitrary value can be chosen for cik and is generally set to 0.
The second case is when vik != 0
In this case the center is recomputed as cik is equal to the normalized aggregated sum of the document frequency vectors across the different attribute directions. i.e we are saying the kth feature has some relevance to the ith cluster we will simply pick nearly the mid-point.
Now, let us look at summarizing the above steps for clustering a collection of N normalized document vectors defined over a vocabulary of n keywords:
Fix the number of clusters C
Initialize the centers by randomly selecting C documents
Initialize the partitions X, using equal feature weights (1/n) where n is the total number of items in a collection of N documents.
REPEAT
Compute cosine based distance between document xj and ith cluster center vector using a formula that involves the kth component of the ith cluster center vector as 1/n-(xjk,cik). Repeat this for every cluster and for every term from 1 to N and along every dimension from 1 to k
Update the relevance weights vik by using the default value and computing the bias along this dimension after computing that across all dimensions
Compute the weighted aggregated sum of cosine based distances along the individual dimensions ( dotproduct of feature weights and cosine based distance.
Update the cluster partitioning with the method discussed above
Update the tau
UNTIL centers stabilize
The difference between some similar approaches in pattern recognition and here is that the former has an assumption that data has a multivariate Gaussian distribution where as this treats them as independent.
We saw that the objective function is expressed in cluster center errors and feature weights. Each cluster i is allowed to have its own set of feature weights Vi = [vi1, vi2, ..., vin]  The feature weights are of at most unit length and after clustering, the sum of all feature weights vi1 +vi2 ... + vin = 1. If we denote the objective function by J, then by setting the gradient of J to zero, we get the partial derivative equations we wanted to solve. The partial derivatives are with respect to the Lagrange multiplier in the first equation and with respect to feature weight in the second equation.
The objective function has reached its final state when it does not change with regard to the changes in the multiplier or the feature weight, That is why we set the gradient to zero and then solve for the feature weight vik.
We solve this equation mathematically to write an equation in terms of vik. This is only one time. The resulting equation for vik has two parts. The first part is a constant 1/n and is the default value if all attributes/keywords are treated equally and no discrimination is performed.  The second term is a bias that can be either positive or negative. It comprises of the aggregated difference between the average and the actual cosine based distances along the individual dimensions.It is positive for compact attributes where the distance along this dimension is, on the average, less than the total distance using all of its dimensions. If an attribute is very compact, it is much more relevant to the cluster and could even be the center. Since the individual cosine based distance along a term-wise individual dimension could be negative, we allow them to be dissimilar and shows up as negative value. The resulting bias only increases emphasizing that dimension further. If the total aggregate dissimilarity becomes negative, then it no longer forms part of the cluster since we partition the data based on minimum distance. Thus we have obtained the attribute weight vik and are comfortable with the second component of the objective function.
The choice of the tuning parameter tau with the objective function is important because it reflects the importance of the second term with respect to the first term. If the tau for the ith cluster is very small then only one keyword in that cluster will be relevant. The rest will receive zero weights. If it is large, then all the terms in that cluster will be relevant and may receive equal weights. The tau could be chosen such that both the terms of the objective function have the same order of magnitude.
We compute this tuning parameter tau numerically in iterations. In iteration t we use the aggregated cluster center errors from t-1 iteration divided by the sum of the squared feature weights again from
t-1 iteration. The resulting fraction is taken with a constant for tau to adjust the magnitude.
The feature relevance values could change to a value in excess of 1
In our iterations we may find the equations to change a lot initially however there is little or no changes towards the later iterations.

Wednesday, October 23, 2013

In this post we will take an in depth look at all the formula behind each of the steps mentioned in the previous post. Our goal is to understand not only how the SWKIC works but also describe it in stages that enables an implementation. We will assume that given distance vectors with precomputed distances, we know how to cluster them.
In our case we begin with the similarity measure between document frequency vectors xi and xj.
This distance is measured as the dot product over their magnitudes. By that we mean we take an arbitrary term k between 1 and n which is the number of terms. We know that the document frequency vectors for terms to appear together such as xik and xjk can be calculated. This has to be repeated for k = 1 to n so we take the aggregated sum of the dot product xik * xjk over this range and normalize them by dividing with their magnitudes to compute the cosine similarity. Intuitively, cosine similarity could be positive and negative since the vectors may point in different directions. Here the steps involved are iteration, summation and normalization.  So this is easy to do.
We have a computational complexity of O(N^2) We will call this the first step.
Now lets look at the next formula: Here we said we need to find the dissimilarity measure across the different attribute directions.we defined the dissimilarity between document frequency vector xj and the ith cluster center vector using the cosine measure. This we define as the second equation which requires  a third formula for the distance along the individual dimensions.
The third formula is easy because the distance along individual dimensions is something smaller than 1/n and this is the dot product of xjk and cik
Note that the cik is the kth component of the ith cluster center vector and the cluster centers are terms themselves so cik is a specially marked xik. That is why the third step is easy to compute given that we did first step.
Next we look at the objective function. This has several components so we will take a look at each of them one at a time. The objective function searches for optimal cluster centers and the optimal set of feature weights simultaneously.
The first part of the objective function is the error to the cluster centers. It is minimized when only one keyword in each cluster is completely relevant and all other keywords are irrelevant. So we sum over each of the clusters.
The Second part of the objective function is the sum of the squared keyword weights. When all the keywords have equal weights, we get the global minimum of this component.  We also have a tuning parameter to use with the second component that we can set. Together, the objective function minimizes the sum of intra-cluster weighted distances, where the keyword weights are optimized for each cluster.
Instead of using the objective function directly, it is decomposed into C independent problems. This way we remove the summation for all the clusters. and we use a multiplier technique with partial derivatives. To get the equations for partial derivatives, we set the gradient to zero. We will also look at these equations in detail shortly.