While we discussed graph based algorithms, we mentioned that there have been improvements since. We will discuss some of these today. We will use the survey by Frigui and Nasraoui who come up with their own technique as well. One improvement has been to use clustering. Clustering is efficient for finding the nearest neighbors be it terms or documents. We use a vector space model where each term is represented by a vector in the document space. The document is considered a bag of words. The vectors can be assigned based on term-frequencies or other statistical or probabilistic metrics. If term-frequencies are used, frequent words with little discriminating power can be discounted with its Inverse Document Frequency in the document collection. Since documents vary a lot, both TF and IDF or their combination is not sufficient. So we go by a large corpus where the text has been manually classified. and this provides a rich base for classifying terms or documents.  However, both the categories and their respective keyword sets need to be  discovered simultaneously. Frigui and Nasraoui mention that selecting and weighting subsets of keyword in text documents is similar to the problem of feature selection and weighting in pattern recognition and data mining. Selecting the best set of features is important for real world tasks because there could be noise from irrelevant features that can severely degrade performance. Even if the data samples have already been classified into known classes, the authors suggest that it is preferable to model each complex class by several simple sub-classes or clusters and to use different set of feature weights for each cluster. This can help in classifying documents into one of the preexisting categories. Since clustering and feature selection have been dealt with independently, here's where the authors suggest a scheme that does both simultaneously.  They call this scheme SCAD which stands for simultaneous clustering and attribute discrimination.
They also improve on weighting by using a continuous feature weighting method. So feature relevance representation is better than binary or other discrete basis. Also, SCAD learns a different feature relevance representation for each cluster in an unsupervised manner.Hence the keyword weighting is both dynamic and category-dependent while being simultaneous with document clustering. One thing to note is that we described metrics for distance vectors and these can be Euclidean however that is not appropriate for text. For text, we rely on cosine similarity or Jaccard index to assess the similarity between documents. Moreover, a text document can belong to several categories  with different degrees. This is done by assigning fuzzy or soft labels instead of assigning them to single label category. There are also noise magnet clusters that take away the outliers from the clusters. So SCAD is refined to work on text documents and this is called Simultaneous Keyword Identification and Clustering of text documents (SKWIC).
 
No comments:
Post a Comment