Thursday, November 7, 2013

In the previous post we talked about LSI and COV and we are seeing that these fundamentals improve our understanding of why and how to perform the cosine computations and their aggregation, when and how to reduce the dimensions and what benefits come with it and finally how to scale the implementation for large data. The last one is particularly interesting since there is no limit to how much data can be clustered and categorized. In this regard, I want to draw attention to how the document vectors are computed. The document vectors are represented as di = [ai1, ai2, ai3, ... ain] in the m document * n attribute matrix. The ai1 is computed as the attribute relevancy to term 1 in the n terms chosen for the document attribute matrix.  It could be a simple inverse document frequency for that term or it could be the (1 + tf)log2(n/df).  This is a scalar value. The cosine of the two document vectors are computed as their dot products. and the angle between them is determined by the  component wise product divided by their individual vector magnitudes.
Now we can discuss dynamic -rescaling algorithms. We saw that the LSI and COV prevented the major themes from dominating the process of selecting basis vectors for the subspace into which the IR problem is projected. This was done by weights that would decrease the relative importance of attributes that were already well represented by the basis vectors that have already been computed.
 In dynamic rescaling of LSI, the residual matrix R and Rs are computed. Initially R is set to the document term matrix A itself. We follow the same steps as the LSI until the document and query vectors are mapped into the k-dimensional subspace. The steps involved upto this point include updating the R and Rs after each iterative step by taking into account the most recently computed basis vector bi. After the kth basis vector is computed, each document vector dj in the original 1R problem is mapped to its counterpart in the subspace. The query vector is also mapped to this subspace. We rescale document vectors after the computation of the basis vectors and the residual vectors because the length of the residual vector prevents the loss of information during reduction of dimension  and we take the maximum residual vector length. The rescaling factor q is adjusted as a function of this maximum residual vector length tmax. from the previous iteration and depending on the values of tmax. For example, if tmax is greater than 1, we take the inverse as the rescaling factor. If tmax is approximately equal to 1, we just add it to 1 to compute the rescaling factor. If tmax is less than 1, we raise the tmax^-2 to the order of 10. This way we compute the rescaling factor dynamically after each iterations since the max residual vector length could change. Thus the rescaling factor prevents the deletion of vectors from overweighting or over reduction.

No comments:

Post a Comment