I came across a paper that has some similarity with my interest in keyword extraction. The paper is titled Keyword extraction from a single document using word co-occurrence statistical information by Matsuo and Ishizuka. Their algorithm works on a single document without using a corpus. They extract the frequent terms first. Then they count the co-occurrences of a term and frequent terms. If a term appears selectively with a particular subset of frequent terms, the term is likely to have an important meaning. They measure the degree of bias of the co-occurrence distribution by Chi-square goodness. They show that their algorithm performs just as well as a tf-idf with corpus.
They assume that a sentence as a basket of words ignoring term order and grammatical information. They make a co-occurrence matrix by counting frequencies of pairwise term co-occurrence. This matrix is a symmetric N x N matrix where N is the number of different terms and different from the number of frequent terms G. They ignore the diagonal components. If a term w appears independent from frequent terms G, the distribution of w and G is similar to unconditional distribution. On the other hand, if there is a semantic relation with a subset g from G, then the w and g have a biased distribution. Since the term frequency could be small, the degree of biases is not reliable.So they test the significance of biases using chi-square. In this case, the chi-square is defined as Sum-g((freq(w,g) - nw,pg)^2/nw.pg) where
pg is the expected probability equal to the unconditional probability of the frequent term g and
nw is the total number of co-occurrence of term w and frequent terms G.
nwpg represents the expected frequency of co-occurrence.
Terms with high chi-square value are relatively more important in the document than the ones with low chi-square.
They use clustering.
They assume that a sentence as a basket of words ignoring term order and grammatical information. They make a co-occurrence matrix by counting frequencies of pairwise term co-occurrence. This matrix is a symmetric N x N matrix where N is the number of different terms and different from the number of frequent terms G. They ignore the diagonal components. If a term w appears independent from frequent terms G, the distribution of w and G is similar to unconditional distribution. On the other hand, if there is a semantic relation with a subset g from G, then the w and g have a biased distribution. Since the term frequency could be small, the degree of biases is not reliable.So they test the significance of biases using chi-square. In this case, the chi-square is defined as Sum-g((freq(w,g) - nw,pg)^2/nw.pg) where
pg is the expected probability equal to the unconditional probability of the frequent term g and
nw is the total number of co-occurrence of term w and frequent terms G.
nwpg represents the expected frequency of co-occurrence.
Terms with high chi-square value are relatively more important in the document than the ones with low chi-square.
They use clustering.
No comments:
Post a Comment