Friday, October 25, 2013

When we discussed how the SKWIC works in the previous post, we described a cluster partitioning method that assigns a document to a single cluster. However, this is rarely the case when a document is for a single subject only. Most documents will tend to straddle the subjects of two or more different documents. And even manual classification is difficult and poor. When a document is classified to a single group, it affects the retrieval ability once a classification model is built. K-means and SKWIC are both hard partitioning models and by assigning the documents to the same category, they have limited use in real large document collections.
Instead if we assign soft labels to documents that is the documents are not confined to  a single category, then this can be improved. Especially when we use fuzzy or soft partition models, the data is modeled better than their hard partitioning counterparts . This is because fuzzy memberships are richer than crisp memberships in that they describe the degrees of associations of data points lying in overlapping clusters.
If we denote this fuzzy membership as uii which a datapoint may have as varying degrees to each cluster Xi for a datapoint x, then the fuzzy partition is described as a C x N matrix U = |uii|. This matrix is subject to the following constraints:
1. The degrees of association uij can range only between 0  and 1. This means that we assign them continuous values that are bounded by 0 and 1
2. The sum of the degrees of association uij for  all the N terms must range between 0 and N
3. The sum of the degrees of association uij for all the clusters must equal 1.
We define a new objective function where we take this fuzzy memberships together with the sum of the errors to the cluster center as the first component . This component is minimized when only one keyword in each cluster is completely relevant and all other keywords are irrelevant. The other keyword is the sum of the squared keyword weights. The global minimum of this keyword is achieved when all the keywords are equally weighted. The goal of the objective function is to reduce the sum of the intracluster weighted distances.
We find similar partial differential equations to solve this objective function in the same manner as the previous one.

No comments:

Post a Comment