Wednesday, November 20, 2013

In the previous post, I talked about MCMM, Zipf's law and assumptions made.
 In Zipf's law, we relate the number of occurrences of a word r with the number of r occurrences and in this model, we find a threshold for the number of times a word must occur before we indicate the word as occurring in the binary document vector. Z curves are drawn for the document's word frequencies to find this threshold for each document. Longer document will have a larger threshold which will eliminate spurious word occurrences from the document vector. As a support for the binary representation of the terms in the document vector, other studies have shown that there is very little difference in the results between binary and weighted terms. And that categorization can be achieved without term weights.
 Another assumption made is the independence between the components of the input vector (i.e. words)  There are no hidden units in the input..
To test the viability of MCMM, because it had high computational cost, it was tried out with different factors. For example,  the model was made to work with reduced dimensions and Zipf's law was used to eliminate words with very few or very many occurrences in  the documents in the corpus.

No comments:

Post a Comment