Sunday, November 24, 2013

Johnson's method of clustering scheme was first written during the summer of 1965 when he was working at Bell Labs during a break from his PhD. It introduced cluster hierarchies and came with a program written in FORTRAN. Agglomerative hierarchical clustering places each object into its own cluster and gradually merges these atomic clusters into larger and larger clusters until there is only one. Divisive hierarchical clustering reverses this process by starting with all objects in one cluster and then subdividing into small clusters. Hierarchical classification is  a special sequence of partition-al classification.
Serial clustering handles the classification one by one while whereas simultaneous clustering works with the entire set of patterns at the same time.  A clustering algorithm can be expressed mathematically either by graph theory or by matrix algebra.  The former is used to describe connectedness and completeness and the latter is used to describe say mean-square-error.
For the purposes of clustering vectors based on KLD distances, we will use a simple partition scheme like the K-means we talked about. This is because we already know the sources categories when we pick the sample document. Besides, when we do keyword extraction, we will build a vocabulary from these documents. We may come across a large number of lemmas and we planned on adding the terms to an empty document that keeps the KLD distance above a threshold i.e D(avg-pt/q) > 1.
In addition, we mentioned varying the threshold on a sliding scale to retrieve more and more of the terms as needed. This we can set based on the samples at the hand that gives reasonable results. There is no pegging of the threshold that may be universally true however larger and larger values of the threshold are certain to give discriminative keywords. This can be built with a visual tool that renders the keywords in a slightly bigger font than the rest. This way the keywords will attract attention in place with the rest of the text. Together with a tool to render the sliding scale, we can increase the number of terms that have this bigger font and make it easier for the reader to find the keywords and see the changes in the number of keywords. Building this visual tool should be fairly straightforward in that it renders the same text in different forms. The bigger font or magnifying glass display for keywords can be enabled by setting the font and size markup to be different from the others.

No comments:

Post a Comment