Probability distribution is useful for analysis of topics. As from the Wartena's paper on topic detection, similarity distance function between terms is defined as a distance between distributions associated to keywords by counting co-occurences in documents. So we look into the elaboration on the probability distributions useful for topic analysis. Let’s say we have a collection of T terms that can each be found in exactly one document in d in a collection of D. We consider the natural probability distributions Q on C X T that measures the probability to randomly select an occurrence of a term, from a source document or both. Let n(d,t) be the number of occurrences of term t in d. Then the probability to randomly select an occurrence of a term t from a source document = n(t)/n on T. And if we consider N as the sum of all term occurrences in d, then the conditional probability q(t/d) = q(t) = n(d,t)/N(d) on T. Then we can measure similarity between two terms as a metric d(i,j) between the elements i, j after eliminating non-negativity, indiscernables and triangle inequality. Two elements or more are similar if they are closer. Similarity measure is defined by Jensen-Shannon divergence or information radius between two distributions p and q is defined as JSD(p | q) = 1/2D(p|m) + 1/2D(q|m) where m = ½(p+q) is the mean distribution and D(p|q) is the relative entropy defined by D(p|q) = Sum(pt, log(pt/qt)).
For a developer, implementing a probability distribution such as the above could be broken out into the following cases:
1) Probability distribution Q(d,t) = n(d,t) /n on C X T. Here the n(d,t) as mentioned earlier is the number of occurrences of term t in d. n is the number of distinct terms we started our analysis with.
2) Probability distribution Q(d) = N(d)/n on C. Here N(d) is the number of term occurrences on C and is defined as the sum of all n(d,t) for all terms t in document d.
3) Probability distribution q(t) = n(t)/n on T where n(t) is the number of occurrences of term t across all documents.
Note that n(t) on C and N(d) on D are similar in that they form the sum of all occurrences over their respective collections.
This we use for the respective conditional probability Q(d|t) = Qt(d) = n(d,t)/n(t) on C and
Q(t|d) = qd(t) = n(d,t)/N(d) on T
As a recap of the various terms we use the following table
Term Description
t is a distinct term whose occurrences is being studied ( typically these are representative of topics)
d is one of the documents
C is the collection of documents d1, d2, ... dm being studied
T is the set of term occurrences t1, t2, .... tm such that each term can be found in exactly one source document
n(d,t) is the number of occurrences of term t in d,
n(t) is the cumulative number of occurences of term t
n is the number of term occurrences
N(d) is the cumulative number of term occurrences in d
Q(d,t) is the distribution n(d,t)/n on C X T and pertains both to documents and terms
Q(d) is the distribution N(d)/n on C and pertains to documents
q(t) is the distribution n(t)/n on T and pertains to terms
Q(d|t) is the source distribution of t and represents the probability that the randomly selected occurrence of term t has source d
Q(t|d) is the term distribution of d and is the probability that a randomly selected term occurrence from document d is an instance of term t
For a developer, implementing a probability distribution such as the above could be broken out into the following cases:
1) Probability distribution Q(d,t) = n(d,t) /n on C X T. Here the n(d,t) as mentioned earlier is the number of occurrences of term t in d. n is the number of distinct terms we started our analysis with.
2) Probability distribution Q(d) = N(d)/n on C. Here N(d) is the number of term occurrences on C and is defined as the sum of all n(d,t) for all terms t in document d.
3) Probability distribution q(t) = n(t)/n on T where n(t) is the number of occurrences of term t across all documents.
Note that n(t) on C and N(d) on D are similar in that they form the sum of all occurrences over their respective collections.
This we use for the respective conditional probability Q(d|t) = Qt(d) = n(d,t)/n(t) on C and
Q(t|d) = qd(t) = n(d,t)/N(d) on T
As a recap of the various terms we use the following table
Term Description
t is a distinct term whose occurrences is being studied ( typically these are representative of topics)
d is one of the documents
C is the collection of documents d1, d2, ... dm being studied
T is the set of term occurrences t1, t2, .... tm such that each term can be found in exactly one source document
n(d,t) is the number of occurrences of term t in d,
n(t) is the cumulative number of occurences of term t
n is the number of term occurrences
N(d) is the cumulative number of term occurrences in d
Q(d,t) is the distribution n(d,t)/n on C X T and pertains both to documents and terms
Q(d) is the distribution N(d)/n on C and pertains to documents
q(t) is the distribution n(t)/n on T and pertains to terms
Q(d|t) is the source distribution of t and represents the probability that the randomly selected occurrence of term t has source d
Q(t|d) is the term distribution of d and is the probability that a randomly selected term occurrence from document d is an instance of term t
No comments:
Post a Comment