Thursday, January 23, 2014

Today we will recap some keyword selection techniques:

In this article, we look at different feature selection algorithms. Feature selection metrics are based on content, frequency and  distribution.

Content based feature selection can be based on say mutual information or information gain.

MI is about how much information feature contains about class and it is given by log P(f,c)/P(f)P(c)

Information Gain measures the number of bits of information obtained about the presence and absence of a class by knowing the presence or absence of a feature.

It is given by Sum-class  Sum-feature  P(g,d) .log P(g,d)/P(g)P(d)

Gain Ratio improves IG by including features with low entropy that may be strongly correlated with a class. It does this by normalizing IG with the entropy of the class.

Term Strength is based on the idea that most valuable features are shared by related documents.
The weight of a feature is defined as the probability of finding it in some document d given that it is also appeared in the document d' similar to d.
To calculate TS for feature f, we should have a threshold between the similarity measure used to judge two words to be sufficiently related. This is done by first deciding how many documents can be related to a given one and then finding the average minimum similarity measure for this number of neighbors over all documents in collection.

Frequency based keyword selection metrics include tf-idf and Transition point.

tf-idf is the term-frequency and inverse document frequency.  This is quite a simple and effective measure for keywords.

Transition Point is about the idea that medium frequency are closely related to the conceptual content of a document. A higher value of weight can be given to each term as its frequency is closer to a frequency named the transition point. This is determined by inspecting the vocabulary frequencies of each text, identifying the lowest frequency from the highest frequency where the frequency is not repeated. The repeating frequencies are more common for low frequency words.
Distribution wise it is debatable whether a term spread out throughout the document can be included or excluded. On one hand its a good candidate for an index and on the other hand, it can be considered overly general.

No comments:

Post a Comment