Wednesday, November 13, 2013

In the paper titled Kullback-Leibler distance for text categorization, Brigitte Bigi introduces the tf-idf classifier and the KLD classifier for text categorization. Documents are represented by their term vectors where each term is a content term and the documents represent weights in terms of the content terms selected. Each term is assumed to occur at least once in one document.  In the tf-idf case, each weight is calculated as the combination of term frequency and the inverse document frequency. If the idf of the term is lower, it appears in more documents and consequently not a good discriminating content word. The weight is zero when the term is not assigned to document dj and for the assigned terms, it equals the count of the term in that document with the idf calculated as the log of the total number of documents in the training divided by the document frequency of the term. The classifier for a tf-idf scheme is based on the cosine similarity where we sum the component wise products over their magnitudes.
The KLD classifier works similarly but measures how different two probability distributions are. The KL divergence of probability distributions P,Q on a finite set X is defined as the probability of selecting a term from the probability distribution P given Q and is expressed as the aggregated conditional probability P(x)log(P(x)/Q(x)) . The author mentions that a better distance metric would be the aggregated P(x) - Q(x) log P(x) / log Q(x). With this distance measure, we can choose the categories that contribute most to this distance. Note that the document vectors may have a lot of zeros for the terms that don't appear in the document. For these, there is a backoff scheme mentioned by the author. In general, C is the number of categories and V is the number of terms from all the categories. For each category, a distribution based only on the selected terms is computed from the corpus. The term distribution of a sample document is compared with the distribution of these categories. In the event of a backoff, term frequencies are discounted and all those terms that don't occur in the document are given a default value and treating them as if they were unknown words. This is done so that the distances don't come out to be infinite.
A document is represented by a term vector of probabilities dj while a category is represented by a term vector of probabilities ci. The probability distribution for a term tk in a document dj is computed as P(tk/dj) which is the ratio of the term frequencies in that document as compared to the overall term frequency across all documents in the case that the term appears in the document otherwise zero.  Similarly, the probability of a term tk in a category ci is given as the ratio of the term frequency in the category c1 as compared to the term frequencies over all the categories. The probability distributions across the documents and the categories for the same term are normalized to unit length i.e the default values and the computed ratios add up to 1. The categorization method based on the KL distance computes the distance as the aggregated sum of the difference between the probability distribution of a term in a category versus a document taken together with the log of their ratio.
Notice that this has four cases  : tk could appear in document dj and category ci, tk appears in document dj but not category ci, tk appears in category ci but not document dj and tk appears in neither. In the last case, the term degenerates to an unknown word. Finally, we normalize the distance based on the distance computed for the document dj and the distance computed for an empty document.
So now we add terms to the empty document that maximizes this distance to extract keywords. And it is also clear that we need the categories and documents  for this keyword extraction and we do not want to rely in the test document or it's summary itself.

No comments:

Post a Comment