In the previous post I mentioned about Latent Semantic Indexing and in this post I want to introduce both LSI and Covariance Matrix Analysis COV and then we could compare LSI and COV.
In LSI, we are given a database with m documents and n distinguishing attributes for relevancy ranking. A is the corresponding m x n document-attribute matrix model. The idea is to reduce the dimensions to k where k is much smaller than m or n and where k is a ranking.
In the COV algorithm, document and query vectors are projected onto the subspace spanned by the k-eigenvectors corresponding to the corresponding to the max eigenvalues of the covariance matrix of document vectors. Thus the COV is based on selecting a subset k vectors. The covariance is computed by taking the averages along the individual component or dimensions Note that taking the averages along the individual dimensions from 1 to n is something that we find in Fuzzy SKWIC too. So its particularly helpful to pay attention to the fundamentals here. We define the component wise averages as d' = a1', a2' ... an'. where the ith document for i ranging from 1 to m has attributes /components as ai1, ai2, ai3, .. ain. Then the averages for each attribute is computed as the average across the m documents along that attribute. In a matrix with m documents as rows and n attributes as columns, this component-wise average translates to averaging the individual columns. This matrix can be decomposed into a set of matrices of the form VEVt where V is the orthogonal matrix that diagnolizes C, E is the diagonal matrix whose diagonal values are in decreasing order from top to bottom and proceeding from left to right and finally Vt which is the term matrix. The diagonal values with k-largest are used to construct the eigen vectors v1, v2, ... vk that reduces the dimensions to k from m,n. In other words all the document vectors and the query vectors are then projected into the subspace defined by these k-eigen vectors. The difference between COV and LSI is that in the LSI we reduce the dimensions to k while in the COV we select a subset of vectors and transform the document and query vectors to a subspace involving a subset k eigenvectors. By reducing the dimensions, the vectors continue to have the same origin and there is less distinction between the documents in the document-attribute matrix. In the covariance analysis, the origin is the center of the basis vectors in the subspace so that the document vectors are more evenly spaced so that the document vectors are more evenly spaced. This enables finer detection of differences between document vectors in different clusters. Both COV and LSI project the query and document vectors into the k-dimensional subspace and then measure similarity between documents.
Another advantage with COV is that the computations for the k eigenvectors and Eigen values are independent of the document vectors and query vectors. So they need not be performed for each document and query vector and since this is done only once, the document and query vectors can be processed in near real-time. This is a significant advantage when compared with the LSI which did not scale as the size of the database grew. However, the cost for the Eigen-vectors is still significant in comparision to the cost of the individual vector processing.
In LSI, we are given a database with m documents and n distinguishing attributes for relevancy ranking. A is the corresponding m x n document-attribute matrix model. The idea is to reduce the dimensions to k where k is much smaller than m or n and where k is a ranking.
In the COV algorithm, document and query vectors are projected onto the subspace spanned by the k-eigenvectors corresponding to the corresponding to the max eigenvalues of the covariance matrix of document vectors. Thus the COV is based on selecting a subset k vectors. The covariance is computed by taking the averages along the individual component or dimensions Note that taking the averages along the individual dimensions from 1 to n is something that we find in Fuzzy SKWIC too. So its particularly helpful to pay attention to the fundamentals here. We define the component wise averages as d' = a1', a2' ... an'. where the ith document for i ranging from 1 to m has attributes /components as ai1, ai2, ai3, .. ain. Then the averages for each attribute is computed as the average across the m documents along that attribute. In a matrix with m documents as rows and n attributes as columns, this component-wise average translates to averaging the individual columns. This matrix can be decomposed into a set of matrices of the form VEVt where V is the orthogonal matrix that diagnolizes C, E is the diagonal matrix whose diagonal values are in decreasing order from top to bottom and proceeding from left to right and finally Vt which is the term matrix. The diagonal values with k-largest are used to construct the eigen vectors v1, v2, ... vk that reduces the dimensions to k from m,n. In other words all the document vectors and the query vectors are then projected into the subspace defined by these k-eigen vectors. The difference between COV and LSI is that in the LSI we reduce the dimensions to k while in the COV we select a subset of vectors and transform the document and query vectors to a subspace involving a subset k eigenvectors. By reducing the dimensions, the vectors continue to have the same origin and there is less distinction between the documents in the document-attribute matrix. In the covariance analysis, the origin is the center of the basis vectors in the subspace so that the document vectors are more evenly spaced so that the document vectors are more evenly spaced. This enables finer detection of differences between document vectors in different clusters. Both COV and LSI project the query and document vectors into the k-dimensional subspace and then measure similarity between documents.
Another advantage with COV is that the computations for the k eigenvectors and Eigen values are independent of the document vectors and query vectors. So they need not be performed for each document and query vector and since this is done only once, the document and query vectors can be processed in near real-time. This is a significant advantage when compared with the LSI which did not scale as the size of the database grew. However, the cost for the Eigen-vectors is still significant in comparision to the cost of the individual vector processing.
No comments:
Post a Comment