Friday, January 24, 2014

In this post, we will begin with a way to differentiate between LSI and PLSI. Latent Semantic Indexing is an approach where we try to associate terms with the same main hidden semantic component as found by co-occurrences in similar context.  LSI uses conditional probabilities.. In  PLSI, we use similar occurrence and co-occurrence of terms but with non-negative weights whose sum is 1. Comparing and clustering  this co-occurrence distributions gives us the latent components  as represented by the centers of the clusters.
We use similar clustering methods with document classification as well. Formally, we use the vector space model. This model represents documents of a collection by using vectors with weights according to the terms appearing in each document.
Vector Space model can be both boolean and term-weighted models. In the boolean model, the presence or occurrence of each term is given a boolean value. In the term-weighted model, the same term may be assigned different weights by different weighting methods such as tf-idf, frequency ratio. When the weights add up to one, they have been normalized. Typically probabilities or conditional probabilities adding up to one are used.
One of the advantages of using vector space model is that it enables relevance ranking of documents of heterogeneous format with respect to user input queries as long as the attributes are well-defined  characteristics of the document.

No comments:

Post a Comment