Thursday, June 5, 2014

In the previous post we saw that we could represent distributional similarity and contextual similarity with data structures. The first is represented by three graphs We first create a term context feature graph with T set of terms, C the set of context features and edges based on the counts of the terms and the context features. Based on this graph, a first order graph is created, This first order graph is denoted by T , C and the significance score sig. The significance score sig is calculated for each pair (t, c) using a formula for a metric called the Lexicographer's Mutual Information LMI(t,c) = count(t,c).log-base-2(count(t,c)/count(t)count(c)). This is described in the paper by Evert, 2004. To find the significance score, the edges with LMI < 0 are removed and only the p  most significant pairs per term t are retained. Features that are general because they have too many terms are also removed.
The second order similarity graph between terms is defined with the similarity score as the number of salient features the two terms share.  This approach does not count how often the feature occurs with a term but instead uses a significance ranking This makes it more scalable to data as it does not need the full list of features for a term pair at any time.
The contextual similarity is calculated differently.  This is also computed by a ranking method. Given the set of candidate expansions from the second order graph, such that the most similar term in context will be ranked higher and the non-compatible ones should be ranked lower. First the terms and context features are extracted as pairs and for all pairs for the given target word from the document by the holing operator.  Then a new graph is defined with context features CON that belong to the previous graph. For the target word,  the second order similarity graph is looked up to find the top n similar terms and these are added to the CON. Then edges between target word and context features are labeled with the significance score from the first order graph. If the edges are not in the first order graph then they don't get significance score. Then a ranking score for each term is calculated with the harmonic mean and a smoothing factor. Then the entries are re-ordered based on this ranking score.

No comments:

Post a Comment