Monday, October 21, 2013

In today's post we describe the automatic discovery of similar words in a dictionary. In particular we discuss the method proposed by the same authors as in the previous post. This is a refinement over the generalized method provided by Kleinberg for searching the web. These methods could have  been improved by others but their reference here is a good read nevertheless.
First we construct a dictionary graph G where each word is a vertex in G and there is an edge from u to v if v appears in the definition of u. Then associated with a given query word we construct a neighborhood graph which is a subgraph of G whose vertices are those that are pointing to or pointed by the vertex in G.
The problem is simplified to searching for authoritative pages in a subgraph that is focused on the original page. These pages are similar to the hubs in the structure graph.  An authority page is one that categorizes the other pages such as automobile makers for Ford, Toyota and other car makers whereas web pages that list these as home pages are hubs.
The method to find hubs and authorities proceeds by initially setting a weight of 1 to both hub and authority and simultaneously updating them according to a mutually reinforcing rule.  The hub score of the vertex i, xj is set to equal the sum of the authority scores pointed to by i and similarly the authority scores of the vertex j is equal to the sum of the hub squares of all vertices pointing to by j. The hub score and the authority score can be seen as a measure of similarity between vertices in each other's graph.
However hub and authority pages are both not good at discovering synonyms.
This can be improved  by maintaining three scores or as many scores as there are vertices in the structure graph. We then iteratively update the scores simultaneously based on a mutually reinforcing rule : the first scores are set to the sum of the scores of all vertices j pointed to by i, the second scores are set equal to the sum of third scores of vertices pointed to by i and the third scores as the sum of the second squares pointed to by i.  Let M be the adjacency matrix associated with G. The process converges with the list of synonyms appearing as those in decreasing order of Eigen value of M Mtransposed + MtransposedM.

No comments:

Post a Comment