Sunday, April 28, 2013

probability distribution in topic analysis

Probability distribution is useful for analysis of topics. As from the Wartena's paper on topic detection, similarity distance function between terms is defined as a distance between distributions associated to keywords by counting co-occurences in documents. So we look into the elaboration on the probability distributions useful for topic analysis. Let’s say we have a collection of T terms that can each be found in exactly one document in d in a collection of D. We consider the natural probability distributions Q on C X T that measures the probability to randomly select an occurrence of a term, from a source document or both. Let n(d,t) be the number of occurrences of term t in d. Then the probability to randomly select an occurrence of a term t from a source document = n(t)/n on T. And if we consider N as the sum of all term occurrences in d, then the conditional probability q(t/d) = q(t) = n(d,t)/N(d) on T. Then we can measure similarity between two terms as a metric d(i,j) between the elements i, j after eliminating non-negativity, indiscernables and triangle inequality. Two elements or more are similar if they are closer. Similarity measure is defined by Jensen-Shannon divergence or information radius between two distributions p and q is defined as JSD(p | q) = 1/2D(p|m) + 1/2D(q|m) where m = ½(p+q) is the mean distribution and D(p|q) is the relative entropy defined by D(p|q) = Sum(pt, log(pt/qt)).
For a developer, implementing a probability distribution such as the above could be broken out into the following cases:
1) Probability distribution  Q(d,t) = n(d,t) /n on C X T. Here the n(d,t) as mentioned earlier is the number of occurrences of term t in d. n is the number of distinct terms we started our analysis with.
2) Probability distribution  Q(d) =  N(d)/n on C. Here N(d) is the number of term occurrences on C and is defined as the sum of all n(d,t) for all terms t in document d.
3) Probability distribution  q(t) = n(t)/n on T where n(t) is the number of occurrences of term t across all documents.
Note that n(t) on C and N(d) on D are similar in that they form the sum of all occurrences over their respective collections. 
This we use for the respective conditional probability Q(d|t) = Qt(d) = n(d,t)/n(t) on C and
Q(t|d) = qd(t) = n(d,t)/N(d) on T
 As a recap of the various terms we use the following table
Term Description
t                 is a distinct term whose occurrences is being studied ( typically these are representative of topics)
d                is one of the documents
C               is the collection of documents d1, d2, ... dm being studied
T               is the set of term occurrences t1, t2, .... tm such that each term can be found in exactly one source document
n(d,t)        is the number of occurrences of term t in d,
n(t)           is the cumulative number of occurences of term t
n               is the number of term occurrences
N(d)         is the cumulative number of term occurrences in d
Q(d,t)       is the distribution n(d,t)/n on C X T and pertains both to documents and terms
Q(d)         is the distribution N(d)/n on C and pertains to documents
q(t)           is the distribution n(t)/n on T and pertains to terms
Q(d|t)       is the source distribution of t and represents the probability that the randomly selected occurrence of term t  has source d
Q(t|d)      is the term distribution of d and is the probability that a randomly selected term occurrence  from document d is an instance of term t

No comments:

Post a Comment