In Natural Language Processing, Word2Net and Roget’s
thesaurus are often used interchangeably.
We argue that a thesaurus can be put to more use.
First, it helps in sampling words with semantically
equivalent content. We achieve this by
finding a word from the thesaurus for each of the word in the text. These words
are chosen at random from the synonyms pertaining to each word although
arguably we could choose the simplest or most frequently used.
Second it helps in finding an alternate text comprising of
words from the thesaurus that we measure based on lexical centrality. We
perform a random walk on the graph and the nodes that are visited the most
frequently are selected as the alternate text. In order to avoid duplicates or
duplicate content, the final selection depends on its maximal marginal
relevance.
We don’t use a bipartite graph between the original text and
the alternate text. Although such graphs have their own use, we argue that they
don’t serve our purpose to perform a Laplace Transform. Moreover we see this
different from bootstrapping where we start with a few seed relation examples or patterns and
iteratively grow the set of relations and patterns based on occurrence in a
large corpus. There a bipartite graph helps with representing the mutual
information between instances (hubs) and patterns (authorities)
We could form a page rank algorithm directly on the
thesaurus subset corresponding to the words in the given text. Hughes and Ramage (2007) showed that approach
would calculate the stationary distribution of the nodes in the Thesaurus
graph, biased on each of the input words in a given word pair. The divergence
between these distributions is calculated as a measure of the relatedness of
the two words. Although this approach has proved to be a significant
improvement for semantic relatedness, we have only introduced semantically
equivalent and redundant words and thereby made it more computationally
intensive as compared to page ranking the original text. The purpose of the
Laplace transform is to create semantic equivalent alternative regularized
text. Even if there is no such thing as a canonicalized text, the
transformation provides semantically alternate text.
We don’t cluster on the thesaurus subset because the words
will be selected better by applying it directly on the original text unless we
introduce a feature of the word vector based on the thesaurus. One such feature
may be one to one mappings of synonyms from the thesaurus for the words chosen
as vector features for the PMI matrix. Arguably the thesaurus based feature
could be part of the CRF instead of the PMI matrix. Either way we could achieve
the word embedding with thesaurus feature.
Having ruled out bipartite graph approach, clustering
approach and page ranking on the thesaurus subset, we now try to build
lexical chains and find the words in the overlapping regions between two
chains.
Courtesy: Survey of graphs in NLP by Nastase and Mihalcea
#codingexercise
Given an array : A1[] = 2, 1, 2, 5, 7, 1, 9, 3, 6, 8, 8
A2[] = 2, 1, 8, 3
Sort A1 in such a way that the relative order among the elements will be same as those are in A2. If the element is not present in A2, append them at last in sorted order.
o/p : 2, 2, 1, 1, 8, 8, 3, 5, 6, 7, 9
List<int> RelSort( ref list<int > A1, ref list<int> A2)
{
Var ret = new List();
Var h = new HashTable<int, int>();
A1.Sort();
For(int i = 0; i < A2.count; i++)
{
Int k = A1.indexOf(A2[i]);
If (k!=-1)
{
H[A2[i]]++;
For(; k < A1.count && A1[k] == A2[i]; k++)
ret.Add(A1[k]);
}
}
For(int j =0; j < A1.count; j++)
{
If (h[A1[j]] == 0)
ret.Add( A1[j]);
}
Return ret;
}
In the above code, we assume that the relative order of elements is dictated by A2
#codingexercise
Given an array : A1[] = 2, 1, 2, 5, 7, 1, 9, 3, 6, 8, 8
A2[] = 2, 1, 8, 3
Sort A1 in such a way that the relative order among the elements will be same as those are in A2. If the element is not present in A2, append them at last in sorted order.
o/p : 2, 2, 1, 1, 8, 8, 3, 5, 6, 7, 9
List<int> RelSort( ref list<int > A1, ref list<int> A2)
{
Var ret = new List();
Var h = new HashTable<int, int>();
A1.Sort();
For(int i = 0; i < A2.count; i++)
{
Int k = A1.indexOf(A2[i]);
If (k!=-1)
{
H[A2[i]]++;
For(; k < A1.count && A1[k] == A2[i]; k++)
ret.Add(A1[k]);
}
}
For(int j =0; j < A1.count; j++)
{
If (h[A1[j]] == 0)
ret.Add( A1[j]);
}
Return ret;
}
In the above code, we assume that the relative order of elements is dictated by A2
No comments:
Post a Comment