Today I wrote a document instead of my post. But I did follow up on my suggestions to use KLD starting from an empty document and adding terms to differentiate from the background based on a threshold.
Sunday, November 17, 2013
Saturday, November 16, 2013
In the previous post, we discussed using KLD for narrow domain short texts. Here we will cover the feature selection techniques used in that discussion. The first technique is the DocumentFrequency (DF) which assigns the value dft to each term t where dft is the number of texts in each collection where t occurs. The second technique is the term strength where weights are assigned to each term t from similar texts. The third is the transition point. This is found from the vocabulary frequencies of each text.
Apart from these, there was a mention for extracting the keywords based on KLD itself. i.e we find terms that contribute most to the distance between two probability distributions P and Q. We had proposed two approaches. The first one was the selection of a few terms at a time and choosing the ones that was consistently higher than a threshold. The second one was the selection of a term at a time and adding it to an empty document so that the KLD distance is maintained higher than a threshold.
The clustering techniques used by the authors include the agglomerative clustering proposed by Johnson.
This is described by Borgatti as follows:
Let there be N items to be clustered and a N * N distance matrix to begin with:
Assign each item to its own cluster so that we have N clusters each containing just one term. The distances between the clusters equals the distances between the items they contain.
The closest most similar pair of clusters are found and merged into a single cluster, so that we now have one less cluster.
The distances between the new cluster and each of the old cluster is computed
The above two steps of merging and computing the distances is repeated until we are left with one cluster
Computing the distance between the new cluster and the old clusters can be done in one of the following three ways:
single link - here the distance between one cluster and another is the shortest distance between any member of the cluster to any member of the other cluster. This method is also called the connectedness or minimum method
complete link - here the distance between one cluster and another is the longest distance from any member of one cluster to any member of other cluster. This method is also called the diameter or maximum method.
average link - here the distance between one cluster and another is the average distance between any member of one cluster to any member of the other cluster.
Notice that in the N * N matrix the distances along the diagonal between same elements is zero. And as the elements are combined into clusters, the cluster names are the combination of the names of the elements.
Apart from these, there was a mention for extracting the keywords based on KLD itself. i.e we find terms that contribute most to the distance between two probability distributions P and Q. We had proposed two approaches. The first one was the selection of a few terms at a time and choosing the ones that was consistently higher than a threshold. The second one was the selection of a term at a time and adding it to an empty document so that the KLD distance is maintained higher than a threshold.
The clustering techniques used by the authors include the agglomerative clustering proposed by Johnson.
This is described by Borgatti as follows:
Let there be N items to be clustered and a N * N distance matrix to begin with:
Assign each item to its own cluster so that we have N clusters each containing just one term. The distances between the clusters equals the distances between the items they contain.
The closest most similar pair of clusters are found and merged into a single cluster, so that we now have one less cluster.
The distances between the new cluster and each of the old cluster is computed
The above two steps of merging and computing the distances is repeated until we are left with one cluster
Computing the distance between the new cluster and the old clusters can be done in one of the following three ways:
single link - here the distance between one cluster and another is the shortest distance between any member of the cluster to any member of the other cluster. This method is also called the connectedness or minimum method
complete link - here the distance between one cluster and another is the longest distance from any member of one cluster to any member of other cluster. This method is also called the diameter or maximum method.
average link - here the distance between one cluster and another is the average distance between any member of one cluster to any member of the other cluster.
Notice that in the N * N matrix the distances along the diagonal between same elements is zero. And as the elements are combined into clusters, the cluster names are the combination of the names of the elements.
KLD and clustering
In a paper on how to do clustering with KLD on narrow-domain short texts by Pinto, Benedi & Rosso, there is a good example of such a clustering. Although KLD is used to calculate a distance between two probability distributions, here the authors use it compare how similar or dissimilar two documents are so that they can be clustered. In this study, they use narrow domain texts to show that this method works even in such case. The challenge with narrow domain texts is that term frequencies are very low and term selection has to be done carefully. If we take the vocabulary across all the text, only a very small percentage of those terms might appear in the short text and that too their frequency may be one or two or so. Moreover, a small change in the frequency affects the clustering results widely. Therefore this is a good test for the clustering. A real life example of such a problem space is seen with the categorization of the abstracts of scientific papers by repositories. Although authors provide keywords for the same, the keywords are not sufficient to classify the text and often some are misleading resulting in poor classification. Moreover, the number of categories are not known before hand.
In their study, the authors have used hierarchical clustering methods and they use the symmetric KLD distance for divergence. There are various ways of expressing KLD.
We have seen asymmetric version as D(P/Q) = Sum(P(x)log(P(x)/Q(x))) .
Then we have seen the symmetric one as D(P/Q) = Sum((P(x)-Q(x))log(P(x)/Q(x)))
and then there is the D(P/Q) = 1/2[ D(P / (P + Q)/2) + D (Q / (P + Q)/2)]
and finally D(P/Q) = max [D (P / Q) + D (Q / P)]
Since the texts vary in terms, the frequency of many terms is zero. To avoid these, a back off scheme is used similar to what we discussed earlier. The document probability of the term is given as a constant times the Probability of term tk in document dj, if tk occurs in document dj or a constant epsilon otherwise. i.e. all the words from the vocabulary that don't appear in the document are treated the same as unknown words and given a default value. This probability of a term tk in dj is calculates as P(tk/dj) = tf(tk, dj) / Sum(x belongs to dj)[tf(tx, dj)]. A document j is represented by a term vector of probabilities dj and the distance between two documents is the KLD measure. Finally all the term vectors are normalized so that the document vectors are of unit length. Therefore the scaling factors for the terms occurring in the document are found as 1 - sum(all epsilons for terms not occurring in the document). Note the convention has been to use a predetermined vocabulary. We don't invert the problem by building the vocabulary from the documents as we process them. In the case of this study, most of the corpora have several thousands of terms, the vocabulary consists of a few thousands of terms and the document texts consist of about a hundred terms
In a paper on how to do clustering with KLD on narrow-domain short texts by Pinto, Benedi & Rosso, there is a good example of such a clustering. Although KLD is used to calculate a distance between two probability distributions, here the authors use it compare how similar or dissimilar two documents are so that they can be clustered. In this study, they use narrow domain texts to show that this method works even in such case. The challenge with narrow domain texts is that term frequencies are very low and term selection has to be done carefully. If we take the vocabulary across all the text, only a very small percentage of those terms might appear in the short text and that too their frequency may be one or two or so. Moreover, a small change in the frequency affects the clustering results widely. Therefore this is a good test for the clustering. A real life example of such a problem space is seen with the categorization of the abstracts of scientific papers by repositories. Although authors provide keywords for the same, the keywords are not sufficient to classify the text and often some are misleading resulting in poor classification. Moreover, the number of categories are not known before hand.
In their study, the authors have used hierarchical clustering methods and they use the symmetric KLD distance for divergence. There are various ways of expressing KLD.
We have seen asymmetric version as D(P/Q) = Sum(P(x)log(P(x)/Q(x))) .
Then we have seen the symmetric one as D(P/Q) = Sum((P(x)-Q(x))log(P(x)/Q(x)))
and then there is the D(P/Q) = 1/2[ D(P / (P + Q)/2) + D (Q / (P + Q)/2)]
and finally D(P/Q) = max [D (P / Q) + D (Q / P)]
Since the texts vary in terms, the frequency of many terms is zero. To avoid these, a back off scheme is used similar to what we discussed earlier. The document probability of the term is given as a constant times the Probability of term tk in document dj, if tk occurs in document dj or a constant epsilon otherwise. i.e. all the words from the vocabulary that don't appear in the document are treated the same as unknown words and given a default value. This probability of a term tk in dj is calculates as P(tk/dj) = tf(tk, dj) / Sum(x belongs to dj)[tf(tx, dj)]. A document j is represented by a term vector of probabilities dj and the distance between two documents is the KLD measure. Finally all the term vectors are normalized so that the document vectors are of unit length. Therefore the scaling factors for the terms occurring in the document are found as 1 - sum(all epsilons for terms not occurring in the document). Note the convention has been to use a predetermined vocabulary. We don't invert the problem by building the vocabulary from the documents as we process them. In the case of this study, most of the corpora have several thousands of terms, the vocabulary consists of a few thousands of terms and the document texts consist of about a hundred terms
Friday, November 15, 2013
Today let us just focus on an implementation topic for text categorizing software. How do we design a class for a clusterer ? Let us take a specific example for the clusterer. We can generalize it later. We want a K-means clusterer that classifies text documents to categories/clusters. Text documents are represented by term vectors. Distance between documents and categories are measured by a distance function or metric. Documents can vary in number. They can be quite large. Number of clusters are fixed. Clusters will have centers. Distance to the nearest center for a document can be found. The clusterer has a method called Classify. It reads the input documents, computes the distances, assigns the clusters, adjusts the cluster centers and repeats until the cluster centers stabilize. The mechanism of the classify method is known. The class needs to be aware of the following
cluster centers
distance function
document set
With the classify method, there will be an output where each document has been assigned a category label or cluster.
How do we test the classifier ? We give it different sets of inputs and check for the labels. What can we vary for this classifier. Surely we can vary the sample space. We can include a small number or a large number of documents. We can check for the proper assignments of each documents We can vary the number of clusters. We can check the output for different distance functions. We may have to manually classify the samples before we run the clusterer. But how do we check the correctness of the program ? We want an invariant, an initial condition and a final state. How do we check the progress especially when the classify method takes a long time ? We could add some supportability.
Next we could consider some design factors for flexibility.
We could decouple the distance calculating function from the clusterer. We could use design patterns to switch different distance functions to the class. We could use a strategy pattern.
We could also parameterize the input and the output so that we don't need to know what the vectors are that we cluster. We could create a base class for the clustered and derive others.
cluster centers
distance function
document set
With the classify method, there will be an output where each document has been assigned a category label or cluster.
How do we test the classifier ? We give it different sets of inputs and check for the labels. What can we vary for this classifier. Surely we can vary the sample space. We can include a small number or a large number of documents. We can check for the proper assignments of each documents We can vary the number of clusters. We can check the output for different distance functions. We may have to manually classify the samples before we run the clusterer. But how do we check the correctness of the program ? We want an invariant, an initial condition and a final state. How do we check the progress especially when the classify method takes a long time ? We could add some supportability.
Next we could consider some design factors for flexibility.
We could decouple the distance calculating function from the clusterer. We could use design patterns to switch different distance functions to the class. We could use a strategy pattern.
We could also parameterize the input and the output so that we don't need to know what the vectors are that we cluster. We could create a base class for the clustered and derive others.
pseudo-code in python syntax for finding KLD:
tokens = nltk.word_tokenize(document)
words = sorted(set(tokens))
porter = nltk.PorterStemmer()
words = [porter.stem(t) for t in words]
wnl = nltk.WordNetLemmatizer()
lemma = [wnl.lemmatize(t) for t in words]
vocabulary = lemma[1:200]
# categories
news = brown.words(categories = 'news')
editorial = brown.words(categories = 'editorial')
reviews = brown.words(categories = 'reviews')
hobbies = brown.words(categories = 'hobbies')
fdist-news = nltk.FreqDist(news)
fdist-editorial = nltk.FreqDist(editorial)
fdist-reviews = nltk.FreqDist(reviews)
fdist-hobbies = nltk.FreqDist(hobbies)
fdist-document = nltk.FreqDist(tokens)
def Ptk-d(term): return fdist-document[term] / sum([fdist-document[term] for term in words])
def Ptk-editorial(term): return fdist-document[term] / sum([fdist-document[term] for term in editorial])
def Ptk-reviews(term): return fdist-document[term] / sum([fdist-document[term] for term in reviews])
def Ptk-hobbies(term): = return fdist-document[term] / sum([fdist-document[term] for term in hobbies])
def Ptk-document(term): = return fdist-document[term] / sum([fdist-document[term] for term in document])
KLD-editorial = sum([(Ptk-editorial(term) - Ptk-d(term)) * log (Ptk-editorial(term)/Ptk-d(term) for term in vocabulary)]
[in progress]
tokens = nltk.word_tokenize(document)
words = sorted(set(tokens))
porter = nltk.PorterStemmer()
words = [porter.stem(t) for t in words]
wnl = nltk.WordNetLemmatizer()
lemma = [wnl.lemmatize(t) for t in words]
vocabulary = lemma[1:200]
# categories
news = brown.words(categories = 'news')
editorial = brown.words(categories = 'editorial')
reviews = brown.words(categories = 'reviews')
hobbies = brown.words(categories = 'hobbies')
fdist-news = nltk.FreqDist(news)
fdist-editorial = nltk.FreqDist(editorial)
fdist-reviews = nltk.FreqDist(reviews)
fdist-hobbies = nltk.FreqDist(hobbies)
fdist-document = nltk.FreqDist(tokens)
def Ptk-d(term): return fdist-document[term] / sum([fdist-document[term] for term in words])
def Ptk-editorial(term): return fdist-document[term] / sum([fdist-document[term] for term in editorial])
def Ptk-reviews(term): return fdist-document[term] / sum([fdist-document[term] for term in reviews])
def Ptk-hobbies(term): = return fdist-document[term] / sum([fdist-document[term] for term in hobbies])
def Ptk-document(term): = return fdist-document[term] / sum([fdist-document[term] for term in document])
KLD-editorial = sum([(Ptk-editorial(term) - Ptk-d(term)) * log (Ptk-editorial(term)/Ptk-d(term) for term in vocabulary)]
[in progress]
Thursday, November 14, 2013
In the previous post, we discussed the algorithm to differentiate probability distribution. With this we have seen how to differentiate a sample document against document categories.
Here we will walk through how to implement it. First we take a document collection that is already categorized. We then select a vocabulary V from these documents using the lemmatization, stop words. With the terms in the vocabulary we compute the probability distribution as by counting the term frequencies in a document versus and taking the ratio with respect to all the frequencies. We can do the same for the categories so that we have vectors for both. Computing the vectors and normalizing them helps us with the preparation for computing the distance. This is measured using the four cases we discussed and normalizing them with the distance corresponding to the empty document vector. After computing the KLD distance, we assign the document to the category with the minimum distance.
Notice that these are no iterations involved since we are finding the distances only. With the distances, we get a measure of how similar or dissimilar the vectors are.
We have said so far that the document can be assigned a category with which it has the minimum distance. This distance is computed based on the difference between the documents distribution and the category's distribution. Notice that the sample for the distribution can be the category or the document and we evaluate it on a term by term basis.
We will discuss a variation where given a fairly large set of vocabulary, we select the keywords that are most distinguishing from the background that consists of general terms. Perhaps, we can choose a subset of the vocabulary and find the terms that repeatedly contribute most to the distinguishing from the background distribution or we could start out with an empty document and add the terms that keep the distance higher than a threshold.
If we choose the latter approach, we may find it easier to build the subset of keyword we are interested in. At the same time, we may need to experiment with different threshold values. This is not a bad option given that the words will keep reducing in number in the subset with higher and higher thresholds. The advantage we have with this approach is that we keep the distribution in the background the same. Consequently the distance is directly related to the subset of keywords.
This way we have a higher likelihood of reducing the number of selections significantly.
In the former approach, we might have to try out different subsets and again since the distribution to compare with remains the same the words that repeatedly contribute more and more could be chosen as candidates. The sample can be of fixed size with only those terms swapped that don't contribute significantly to the distance and then stopped when we have a desired size of selection.
Here we will walk through how to implement it. First we take a document collection that is already categorized. We then select a vocabulary V from these documents using the lemmatization, stop words. With the terms in the vocabulary we compute the probability distribution as by counting the term frequencies in a document versus and taking the ratio with respect to all the frequencies. We can do the same for the categories so that we have vectors for both. Computing the vectors and normalizing them helps us with the preparation for computing the distance. This is measured using the four cases we discussed and normalizing them with the distance corresponding to the empty document vector. After computing the KLD distance, we assign the document to the category with the minimum distance.
Notice that these are no iterations involved since we are finding the distances only. With the distances, we get a measure of how similar or dissimilar the vectors are.
We have said so far that the document can be assigned a category with which it has the minimum distance. This distance is computed based on the difference between the documents distribution and the category's distribution. Notice that the sample for the distribution can be the category or the document and we evaluate it on a term by term basis.
We will discuss a variation where given a fairly large set of vocabulary, we select the keywords that are most distinguishing from the background that consists of general terms. Perhaps, we can choose a subset of the vocabulary and find the terms that repeatedly contribute most to the distinguishing from the background distribution or we could start out with an empty document and add the terms that keep the distance higher than a threshold.
If we choose the latter approach, we may find it easier to build the subset of keyword we are interested in. At the same time, we may need to experiment with different threshold values. This is not a bad option given that the words will keep reducing in number in the subset with higher and higher thresholds. The advantage we have with this approach is that we keep the distribution in the background the same. Consequently the distance is directly related to the subset of keywords.
This way we have a higher likelihood of reducing the number of selections significantly.
In the former approach, we might have to try out different subsets and again since the distribution to compare with remains the same the words that repeatedly contribute more and more could be chosen as candidates. The sample can be of fixed size with only those terms swapped that don't contribute significantly to the distance and then stopped when we have a desired size of selection.
Wednesday, November 13, 2013
In the paper titled Kullback-Leibler distance for text categorization, Brigitte Bigi introduces the tf-idf classifier and the KLD classifier for text categorization. Documents are represented by their term vectors where each term is a content term and the documents represent weights in terms of the content terms selected. Each term is assumed to occur at least once in one document. In the tf-idf case, each weight is calculated as the combination of term frequency and the inverse document frequency. If the idf of the term is lower, it appears in more documents and consequently not a good discriminating content word. The weight is zero when the term is not assigned to document dj and for the assigned terms, it equals the count of the term in that document with the idf calculated as the log of the total number of documents in the training divided by the document frequency of the term. The classifier for a tf-idf scheme is based on the cosine similarity where we sum the component wise products over their magnitudes.
The KLD classifier works similarly but measures how different two probability distributions are. The KL divergence of probability distributions P,Q on a finite set X is defined as the probability of selecting a term from the probability distribution P given Q and is expressed as the aggregated conditional probability P(x)log(P(x)/Q(x)) . The author mentions that a better distance metric would be the aggregated P(x) - Q(x) log P(x) / log Q(x). With this distance measure, we can choose the categories that contribute most to this distance. Note that the document vectors may have a lot of zeros for the terms that don't appear in the document. For these, there is a backoff scheme mentioned by the author. In general, C is the number of categories and V is the number of terms from all the categories. For each category, a distribution based only on the selected terms is computed from the corpus. The term distribution of a sample document is compared with the distribution of these categories. In the event of a backoff, term frequencies are discounted and all those terms that don't occur in the document are given a default value and treating them as if they were unknown words. This is done so that the distances don't come out to be infinite.
A document is represented by a term vector of probabilities dj while a category is represented by a term vector of probabilities ci. The probability distribution for a term tk in a document dj is computed as P(tk/dj) which is the ratio of the term frequencies in that document as compared to the overall term frequency across all documents in the case that the term appears in the document otherwise zero. Similarly, the probability of a term tk in a category ci is given as the ratio of the term frequency in the category c1 as compared to the term frequencies over all the categories. The probability distributions across the documents and the categories for the same term are normalized to unit length i.e the default values and the computed ratios add up to 1. The categorization method based on the KL distance computes the distance as the aggregated sum of the difference between the probability distribution of a term in a category versus a document taken together with the log of their ratio.
Notice that this has four cases : tk could appear in document dj and category ci, tk appears in document dj but not category ci, tk appears in category ci but not document dj and tk appears in neither. In the last case, the term degenerates to an unknown word. Finally, we normalize the distance based on the distance computed for the document dj and the distance computed for an empty document.
So now we add terms to the empty document that maximizes this distance to extract keywords. And it is also clear that we need the categories and documents for this keyword extraction and we do not want to rely in the test document or it's summary itself.
The KLD classifier works similarly but measures how different two probability distributions are. The KL divergence of probability distributions P,Q on a finite set X is defined as the probability of selecting a term from the probability distribution P given Q and is expressed as the aggregated conditional probability P(x)log(P(x)/Q(x)) . The author mentions that a better distance metric would be the aggregated P(x) - Q(x) log P(x) / log Q(x). With this distance measure, we can choose the categories that contribute most to this distance. Note that the document vectors may have a lot of zeros for the terms that don't appear in the document. For these, there is a backoff scheme mentioned by the author. In general, C is the number of categories and V is the number of terms from all the categories. For each category, a distribution based only on the selected terms is computed from the corpus. The term distribution of a sample document is compared with the distribution of these categories. In the event of a backoff, term frequencies are discounted and all those terms that don't occur in the document are given a default value and treating them as if they were unknown words. This is done so that the distances don't come out to be infinite.
A document is represented by a term vector of probabilities dj while a category is represented by a term vector of probabilities ci. The probability distribution for a term tk in a document dj is computed as P(tk/dj) which is the ratio of the term frequencies in that document as compared to the overall term frequency across all documents in the case that the term appears in the document otherwise zero. Similarly, the probability of a term tk in a category ci is given as the ratio of the term frequency in the category c1 as compared to the term frequencies over all the categories. The probability distributions across the documents and the categories for the same term are normalized to unit length i.e the default values and the computed ratios add up to 1. The categorization method based on the KL distance computes the distance as the aggregated sum of the difference between the probability distribution of a term in a category versus a document taken together with the log of their ratio.
Notice that this has four cases : tk could appear in document dj and category ci, tk appears in document dj but not category ci, tk appears in category ci but not document dj and tk appears in neither. In the last case, the term degenerates to an unknown word. Finally, we normalize the distance based on the distance computed for the document dj and the distance computed for an empty document.
So now we add terms to the empty document that maximizes this distance to extract keywords. And it is also clear that we need the categories and documents for this keyword extraction and we do not want to rely in the test document or it's summary itself.
Subscribe to:
Comments (Atom)