Thursday, November 7, 2013

We talked about the cluster retrieval based on summarization and on the premise known as cluster hypothesis. This summarization data could include naturally occurring overlap of clusters. This will prevent omission of clusters that have a significant overlap with others.
LSI and COV can be used to find major document clusters. During the dimension reduction step, the minor clusters are often mistaken for noise and are discarded. There is an algorithm by Ando to identify small clusters in limited contexts. Here major themes from the documents are prevented from dominating the process of selecting the basis vectors for subspace projection step in LSI. This is done during the basis vector selection process by introducing an unfavorable weight also known as filter or negative bias to documents that belong to clusters that are well represented by basis vectors that have already been selected. This weight is computed by finding the residual of each document vector which is the proportion of the document vector that cannot be represented by the basis vectors that have been selected up to that point. Then the magnitude of the document vector is raised to a power q of the magnitude of its residual.
This process is repeated for every cluster where the document vectors are initialized by their r1, r2..rm residual and the residual matrix is calculated as R - RbibiT where bi is the first Eigenvector of RTsRs. We set R to be A initially. After each iterative step, the residual steps are updated to  take into account the new basis vector f. After the klh basis vector is computed, each document vector is mapped to its counterpart in the k-dimensional subspace. This is followed by the relevance ranking.
However, this algorithm does identify all major and minor clusters. Finding the Eigen vector may be unstable when q is large. The basis vectors may not always be orthogonal. The number of documents in the database is even moderately large, intermediate data may not fit into main memory. These shortcomings arise because there is loss of information during rescaling of document vectors after each computation.  Also the document attribute matrices are usually highly sparse but after a few iterations, the residual matrix may become very dense. COV based algorithms can overcome these.

No comments:

Post a Comment