We continue with Vector Space Model next. In this model, we represent documents of a collection by using vectors with weights according to the terms appearing in each document. The similarity between two documents d1 and d2 with respect to query q is used is measured as the cosine of the angle between the documents and the query. A document is said to be closer to the query if its cosine is smaller. Each query is modeled as a vector using the same attribute space as the documents.
Groups of documents that are similar with respect to the users' information needs will be obtained by clustering, the documents representing a collection of points in a multidimensional space where the dimensions are given by the features selected There are different algorithms to clustering and almost all algorithms can cluster. Their fitness or quality is measured.
Sometimes, it is better to reduce the number of dimensions without significant loss of information. This transformation to a subspace helps with clustering and reducing the computations Another technique uses a matrix for principal component analysis. In this method, the document and query methods are projected onto a different subspace spanned by k principal components. The matrix helps define the principal values. as for example a covariance matrix.
Notice that reducing the number of dimensions does not shift the origin but only the number of dimensions, so the vectors are spaced apart the same. The principal component analysis on the other hand shifts the origin to the center of the basis vectors in the subspace so the documents are better differentiated.
def LSI (vectors):
return vectors.transform(vectors.k_largest_singular_values);
def COV(vectors):
return vectors.transform(vectors.subset_principal_components_with_k_highest_values);
No comments:
Post a Comment