In the paper titled "Merging word senses" by Bhagwani, Satapathy and Karnick, a semi supervised approach to learning Wordnet synsets using a graph based recursive similarity definition is presented. The synsets help provide input on sense similarities of all word sense pairs. Then this method shows coarse sense inventories at arbitrary granularities
Unlike previous approaches where they generated a coarse sense inventory by merging fine grained senses, this approach proposes a framework by learning a synset similarity metric The problems with the earlier approaches included the following: first, it requires a stopping critierion for each word such as the number of final classes and these cannot usually be predetermined. Second the inconsistent clusters are obtained because coarse senses are independently generated for each words. The inconsistent clustering causes transitive closure errors. With the approach discussed in this paper, the coarsening of noun synsets is improved upon. But to learn similarities between synset pairs which do not share a word, they use a variant of the SimRank framework that gives a non-zero similarity. The SimRank is a graph based similarity measure applicable in any domain with object to object relationships and shows that two objects are similar if they are related to similar objects. The SimRank equations can be solved to find a similarity between all pairs of words and this was proved in a separate paper by Jeh and Wisdom 2002. For a graph G(V,E) the solution is reached by iteration to a fixed point. For each iteration |V|^2 entries are kept where Sk(a,b) is the estimate of similarity between a and b at the kth iteration. This works well when we know complete information on the objects.
In many scenarios, this may not be known. So the SimRank is personalized by initializing it and by knowing the similarities of some pairs, this approach fixes them in the set of equations and let the rest of the values be automatically learnt by the system. To begin with supervised labels, the mapping labels of the Omega ontology is used. To coarsen the WordNet, an undirected graph is constructed which contains the synsets of WordNet and edge set E comprising of edges obtained by thresholding the similarity metric learnt using the personalized SimRank model. On these graphs, connected components are found which gives us a partition over synsets. All the senses of a word occurring in the same component are grouped as a single coarse sense. To make it more coarse, denser graphs are obtained with fewer connected components. The small number of components translates into more coarser senses. This threshold provides a way to control the granularity of the coarse senses.
We now look at Feature Engineering The feature space is constructed this way The features are broadly classified into two parts one that is derived from the structure of WordNet and another that is derived from other corpora. The features derived from WordNet are further subdivided into similarity measures and features. Among the WordNet similarity measures, the authors used path based similarity measures. Other synsets and sense based features include number of lemmas common in two synsets. Other synsets and sense based features include number of lemmas common in two synsets, maximum polysemy degree among the lemmas shared by the synsets, whether two synsets have the same lexicographer file, number of common hypernyms, whether the two synsets have the same lexicographer file, number of common hypernyms etc. Hyperynyms/Hyponyms mean super/sub ordinate relationships in the structure of the WordNet.
Polysemous means having more than one sense in the syntactic category. Features derived from External corpora include a score of a synset with respect to 169 hierarchically organized domain label as available from eXtended WordNet domain project. BabelNet is another external corpora that provides the translation of a noun word senses in 6 languages.
Unlike previous approaches where they generated a coarse sense inventory by merging fine grained senses, this approach proposes a framework by learning a synset similarity metric The problems with the earlier approaches included the following: first, it requires a stopping critierion for each word such as the number of final classes and these cannot usually be predetermined. Second the inconsistent clusters are obtained because coarse senses are independently generated for each words. The inconsistent clustering causes transitive closure errors. With the approach discussed in this paper, the coarsening of noun synsets is improved upon. But to learn similarities between synset pairs which do not share a word, they use a variant of the SimRank framework that gives a non-zero similarity. The SimRank is a graph based similarity measure applicable in any domain with object to object relationships and shows that two objects are similar if they are related to similar objects. The SimRank equations can be solved to find a similarity between all pairs of words and this was proved in a separate paper by Jeh and Wisdom 2002. For a graph G(V,E) the solution is reached by iteration to a fixed point. For each iteration |V|^2 entries are kept where Sk(a,b) is the estimate of similarity between a and b at the kth iteration. This works well when we know complete information on the objects.
In many scenarios, this may not be known. So the SimRank is personalized by initializing it and by knowing the similarities of some pairs, this approach fixes them in the set of equations and let the rest of the values be automatically learnt by the system. To begin with supervised labels, the mapping labels of the Omega ontology is used. To coarsen the WordNet, an undirected graph is constructed which contains the synsets of WordNet and edge set E comprising of edges obtained by thresholding the similarity metric learnt using the personalized SimRank model. On these graphs, connected components are found which gives us a partition over synsets. All the senses of a word occurring in the same component are grouped as a single coarse sense. To make it more coarse, denser graphs are obtained with fewer connected components. The small number of components translates into more coarser senses. This threshold provides a way to control the granularity of the coarse senses.
We now look at Feature Engineering The feature space is constructed this way The features are broadly classified into two parts one that is derived from the structure of WordNet and another that is derived from other corpora. The features derived from WordNet are further subdivided into similarity measures and features. Among the WordNet similarity measures, the authors used path based similarity measures. Other synsets and sense based features include number of lemmas common in two synsets. Other synsets and sense based features include number of lemmas common in two synsets, maximum polysemy degree among the lemmas shared by the synsets, whether two synsets have the same lexicographer file, number of common hypernyms, whether the two synsets have the same lexicographer file, number of common hypernyms etc. Hyperynyms/Hyponyms mean super/sub ordinate relationships in the structure of the WordNet.
Polysemous means having more than one sense in the syntactic category. Features derived from External corpora include a score of a synset with respect to 169 hierarchically organized domain label as available from eXtended WordNet domain project. BabelNet is another external corpora that provides the translation of a noun word senses in 6 languages.
No comments:
Post a Comment