Monday, November 11, 2013

We will pick up the implementation discussion on FuzzySKWIC from here : http://ravinote.blogspot.com/2013/11/today-im-going-to-describe-few-steps.html
and we will try to wrap up the implementation so that we can finish it.  We only need to make sure the cosine based distances are calculated correctly. Rest everything follows in the iteration like we discussed. The cosine based distance are along components. Each document is represented by a vector of its document frequencies and this vector is normalized to unit-length. The cosine based distance is calculated as 1/n - xjk.cik. where xjk is the kth component of the document frequency vector xj. This can be calculated as the component affected by this document  on the total document frequency of that term. xjk  therefore does not change with iterations. cik is the kth component of the ith cluster center vector. Note that both are less than one. That is we are referring to cell values in the document vectors. The cluster center is also a document vector. We start with a matrix N x n in this case and for each document we will maintain feature weights as well as cosine distances to the centers of the cluster. When the cluster center changes, the cosine document distance changes and the aggregated cosine consequently changes. If the centers have not changed, we can skip some of the calculations. We will want a fuzzy matrix of size C x N but for now , we will first try out without it. Initial assignments of the document frequency vector populates the N x n matrix purely in the IDF. The feature weights are initialized.
For test data we use the brown categories for news, government, adventure and humor. We will pick ten documents from each category.

No comments:

Post a Comment