Monday, November 4, 2013

I will continue today's discussion on the fuzzy SKWIC implementation. We are getting closer to covering the implementation details. The computations for the feature weights, fuzzy membership labels and tau involve components that we could be better prepared for. Although we have discussed the sequence, we are looking at it for the storage or data structures. And to pick up the thread where we left off from the previous post, we have not discussed whether the sample data collections a sufficient for our case since the test document could be from any origin. The sample we collect was part of a predefined finite set of categories of documents. Hence. The given test document could enc
"up as a noise in a cluster. So we have to avoid that as much as possible. One way would have been to used conditional probability. but we did not proceed with it because we will have to use PLSI techniques such as a probability that a randomly selected occurrence of term t has source document d and the randomly selected term occurrence from d is an instance of term t as Wartena put it. There is a standard to building a Vector Space Model for information retrieval. One of the advantages of the method is that it enables relevance ranking of documents of heterogeneous format (eg text, multilingual text, ) with respect to the user input queries as long as the attributes are well defined characteristics of the documents. If the attribute is present, the corresponding co-ordinate for the document vector is unity and when the attribute is absent, the coordinate is zero. This is called the binary vector model. Term weighting is another model that is a refinement over the Boolean model. When the term-weight is Term frequency  Inverse document frequency TF-IDF , the weight of the ith term in the jth document is defined as weight(i, j) = (1 + tf) log n /df when the term frequency is >= 1 and 0 if the term frequency is equal to zero and where df is the number of documents in which the term appears out of n. Given a single query term , two documents can be ranked for similarity based on the difference in the angles their vectors make with respect to q and hence the use of cosine. Cosine is simpler to use than the computing the Euclidean distance since we want relative comparision. Every query is modeled as  a vector using the same attribute space as the documents. Even this similarity measure is too time-consuming for computation in real time for large databases. This is a serious concern for massive databases.  In addition, users consistently select the most relevant features from the IR engines. Scalability of such techniques suffers simply because of the large number of dimensions involved in such computations.
 One approach is to reduce the number of dimensions by transforming it to a subspace with sufficiently small dimensions  that enable fast response time.  The subspace should still be able to discriminate contents of individual documents. This is where Latent Semantic Indexing helps and when it is probability such that the weights add up to unity, we call it Probabilistic Latent Semantic Indexing. For our implementation, we are first concerned with a working implementation prior to using LSI or PLSI and performance optimization techniques.

No comments:

Post a Comment