Sunday, January 26, 2014

I tested my KLD clusterer with real data. There are some interesting observations I wanted to share.
First lets consider a frequency distribution of words from some text as follows:
categories, 4
clustering, 3
data, 2
attribute, 1

Next let's consider their co-occurrences as follows:
categories clustering 1
data categories 1
data clustering 2

Since the terms 'categories' and 'clustering ' occur together and with other terms, they should belong to the same cluster. This cluster is based on the KLD divergence measure and the measure can fall in one of the several K-means clusters chosen for the different ranges of such measure.

Furthermore the term 'attribute' occurs only once and does not co-occur. Therefore it must be different from 'categories' and 'clustering' terms and should fall in a cluster different from theirs.

The cluster for the term 'data' is ambiguous. If the K-means clustering chose say two partitions only, then it would be merged with one of the clusters. If the K-means clustering had many partitions, it could end up in its own cluster.

Items are clustered together based on their KLD measure. Similar KLD measures between pairs of terms should end up in the same cluster.

No comments:

Post a Comment