In this post we will take an in depth look at all the formula behind each of the steps mentioned in the previous post. Our goal is to understand not only how the SWKIC works but also describe it in stages that enables an implementation. We will assume that given distance vectors with precomputed distances, we know how to cluster them.
In our case we begin with the similarity measure between document frequency vectors xi and xj.
This distance is measured as the dot product over their magnitudes. By that we mean we take an arbitrary term k between 1 and n which is the number of terms. We know that the document frequency vectors for terms to appear together such as xik and xjk can be calculated. This has to be repeated for k = 1 to n so we take the aggregated sum of the dot product xik * xjk over this range and normalize them by dividing with their magnitudes to compute the cosine similarity. Intuitively, cosine similarity could be positive and negative since the vectors may point in different directions. Here the steps involved are iteration, summation and normalization. So this is easy to do.
We have a computational complexity of O(N^2) We will call this the first step.
Now lets look at the next formula: Here we said we need to find the dissimilarity measure across the different attribute directions.we defined the dissimilarity between document frequency vector xj and the ith cluster center vector using the cosine measure. This we define as the second equation which requires a third formula for the distance along the individual dimensions.
The third formula is easy because the distance along individual dimensions is something smaller than 1/n and this is the dot product of xjk and cik
Note that the cik is the kth component of the ith cluster center vector and the cluster centers are terms themselves so cik is a specially marked xik. That is why the third step is easy to compute given that we did first step.
In our case we begin with the similarity measure between document frequency vectors xi and xj.
This distance is measured as the dot product over their magnitudes. By that we mean we take an arbitrary term k between 1 and n which is the number of terms. We know that the document frequency vectors for terms to appear together such as xik and xjk can be calculated. This has to be repeated for k = 1 to n so we take the aggregated sum of the dot product xik * xjk over this range and normalize them by dividing with their magnitudes to compute the cosine similarity. Intuitively, cosine similarity could be positive and negative since the vectors may point in different directions. Here the steps involved are iteration, summation and normalization. So this is easy to do.
We have a computational complexity of O(N^2) We will call this the first step.
Now lets look at the next formula: Here we said we need to find the dissimilarity measure across the different attribute directions.we defined the dissimilarity between document frequency vector xj and the ith cluster center vector using the cosine measure. This we define as the second equation which requires a third formula for the distance along the individual dimensions.
The third formula is easy because the distance along individual dimensions is something smaller than 1/n and this is the dot product of xjk and cik
Note that the cik is the kth component of the ith cluster center vector and the cluster centers are terms themselves so cik is a specially marked xik. That is why the third step is easy to compute given that we did first step.
Next we look at the objective function. This has several components so we will take a look at each of them one at a time. The objective function searches for optimal cluster centers and the optimal set of feature weights simultaneously.
The first part of the objective function is the error to the cluster centers. It is minimized when only one keyword in each cluster is completely relevant and all other keywords are irrelevant. So we sum over each of the clusters.
The Second part of the objective function is the sum of the squared keyword weights. When all the keywords have equal weights, we get the global minimum of this component. We also have a tuning parameter to use with the second component that we can set. Together, the objective function minimizes the sum of intra-cluster weighted distances, where the keyword weights are optimized for each cluster.
Instead of using the objective function directly, it is decomposed into C independent problems. This way we remove the summation for all the clusters. and we use a multiplier technique with partial derivatives. To get the equations for partial derivatives, we set the gradient to zero. We will also look at these equations in detail shortly.
The first part of the objective function is the error to the cluster centers. It is minimized when only one keyword in each cluster is completely relevant and all other keywords are irrelevant. So we sum over each of the clusters.
The Second part of the objective function is the sum of the squared keyword weights. When all the keywords have equal weights, we get the global minimum of this component. We also have a tuning parameter to use with the second component that we can set. Together, the objective function minimizes the sum of intra-cluster weighted distances, where the keyword weights are optimized for each cluster.
Instead of using the objective function directly, it is decomposed into C independent problems. This way we remove the summation for all the clusters. and we use a multiplier technique with partial derivatives. To get the equations for partial derivatives, we set the gradient to zero. We will also look at these equations in detail shortly.
No comments:
Post a Comment