Sunday, May 5, 2013

calculating distance measure

Similarity distance measures between terms require that probabilities and conditional probabilities for  the terms are computed. We rely on the corpus text to compute these. In addition, we use a naive Bayes classifier to determine the probability of term occurrences. Some of these probabilities were mentioned in an earlier post but today we take a look at whether we need to calculate them on the fly as we cluster the terms.  Probabilities associated with the corpus text can be calculated in advance of the processing of a given text. For example, the probability of selecting an occurrence of a term  from a source document  as given by the number of occurence of the term in that document and the total number of occurrences in the corpus is something we can calculate and keep.
The distance measure itself is calculated once for each term that we evaluate from the document. If we choose a distance measure like the Jaccard coefficient, then the we evaluate the parts corresponding to each term in the pair. The calculation is a bit different when we use cosine similarity (Wartena 2008) between terms because we now use the sum of the respective probalities as well as their squares. The distance measure is calculated as one minus the cosine similarity.
The terms as well as the measure depend on summation of the probabilities over all documents. These documents are those from a collection C where each term from the term collection T can be found in exactly one source document. This doesn't mean the other documents cannot have occurences of the same term just that this particular instance of the term cannot be in multiple documents. So each occurrence is uniquely identified by the term, document pair and position. When we want to find the number of occurrences of a term, we sum the occurrences over all the documents in the collection.
We also consider Markov chain evolution for finding distributions of co-occurring terms. The first step is the calculation of the term occurrence in a particular document given that the term distribution of the occurrences is p. This we find as a sum over all the terms.
If we have a document distribution instead of the term distribution, we similarly compute the probability of finding a document with a particular term occurrence and their sum over all the documents. This leads to a weighted average of all the term distributions in the documents.
We can combine the above two when we evaluate the chain twice to get a new distribution which we use to find the distribution of the co-occurring terms t and z. By that we mean we find the distribution of one term given the first step of finding the distribution of  a previous term.  This gives an indication of the density of the document rather than the mere occurrence or non-occurrence of a keyword in a document. Otherwise it is similar to the previous model.
The document collection plays an important factor in the evaluation of the probabilities.  A good choice of the documents and their processing will improve the results from the keyword analysis here. The corpus text is a comprehensive collection of documents and has already been tagged and parsed. While there could be improvements to the corpus text such as with the substitution of pronouns with corresponding nouns such that the frequency and distribution of terms are improved, the existing set of documents in the corpus text, their variance and the size is sufficient for a reasonable results from the term set.



 

No comments:

Post a Comment