Tuesday, November 12, 2013

While the FuzzySKWIC works with a selected finite set of vocabulary for the documents in consideration, Wartena's paper focuses on treating documents as a bag of words and attempting to find the topic by clustering those words. Here too we have a term collection T each occurrence of which can be found in exactly one source document in a collection  C. Wartena's approach included Lemmatization that reduced all inflected forms of the verb, nouns and adjectives to their lexical form, substantially reducing the variation in the input data. Tagging was also used to distinguish content words from function words. This pre-processing let them select around a 160 most frequent keywords for their term collection.
In this approach, they consider the probability distributions that measure the probability to select  an occurrence of a term t, the probability to select a document from the collection, and the probability to select a term and a document from C x T. They have two conditional probabilities - one for the probability that a randomly selected occurrence of term t has source d also called the source distribution of t and the other for the randomly selected term occurrence  from document d is an instance of term t also called the term distribution of d.  These help them define the distribution of co-occurring terms with which they perform clustering to determine the topic.
The concern here is that  the vocabulary set may not be sufficient to find the topic in a test document or to extract keywords. In order to evaluate available words, a different scheme may be needed. Word similarities and clustering could be used.
In order that we pick keywords out of the document, we must identify centers for the clusters within the document itself. Do we have a limit on the number of words we can cluster, no not as long as the terms are considered by their weights or vectors. Typically the documents have vectors and not the terms themselves because the terms only have weights. The document vectors suffer from a high dimension due to the number of terms in the vocabulary set. The high dimensions have problems for the documents but not for the keywords themselves. The weights associated with the keywords can be based on tf-idf or Jensen-Shannon divergence. We find the most frequent content words. We can differentiate the keywords from the background based on the Kullback-Leibler divergence and a cutoff. however, documents and keywords are not treated independently. Often topics have to be categorized and keywords have to be extracted together because they independently don't give correct information. Take the example of just keyword extraction. If we were looking merely for the content words and we worked only with the test document and the test document was reduced to a bag of words, we would be selecting words out of this document without any knowledge of the distribution of the words in any other documents. That may or may not be satisfactory since the document corpus gives valuable information for keywords that generally appear together in a category. Perhaps, then the documents from the corpus could be summarized in a term-attributes lookup table that we pre-populate and summarize based on the studied corpus. Then given the words we encounter in the document, we find the matches and using the corresponding attributes, we extract the keywords. This is where wordnet was a start but we are talking about categories the words appear in as attributes among others. In addition, we are considering performing Kullback-Leibler divergence with a cut-off to extract keywords. The suggestion is that we use the existing knowledge as well as the differentiation in the given document focusing on keyword extraction

No comments:

Post a Comment