Cluster computing

Friday, October 7, 2016

In Natural Language Processing, Word2Net and Roget’s thesaurus are often used interchangeably. We argue that a thesaurus can be put to more use.

First, it helps in sampling words with semantically equivalent content. We achieve this by finding a word from the thesaurus for each of the word in the text. These words are chosen at random from the synonyms pertaining to each word although arguably we could choose the simplest or most frequently used.

Second it helps in finding an alternate text comprising of words from the thesaurus that we measure based on lexical centrality. We perform a random walk on the graph and the nodes that are visited the most frequently are selected as the alternate text. In order to avoid duplicates or duplicate content, the final selection depends on its maximal marginal relevance.

We don’t use a bipartite graph between the original text and the alternate text. Although such graphs have their own use, we argue that they don’t serve our purpose to perform a Laplace Transform. Moreover we see this different from bootstrapping where we start with a few seed relation examples or patterns and iteratively grow the set of relations and patterns based on occurrence in a large corpus. There a bipartite graph helps with representing the mutual information between instances (hubs) and patterns (authorities)

We could form a page rank algorithm directly on the thesaurus subset corresponding to the words in the given text. Hughes and Ramage (2007) showed that approach would calculate the stationary distribution of the nodes in the Thesaurus graph, biased on each of the input words in a given word pair. The divergence between these distributions is calculated as a measure of the relatedness of the two words. Although this approach has proved to be a significant improvement for semantic relatedness, we have only introduced semantically equivalent and redundant words and thereby made it more computationally intensive as compared to page ranking the original text. The purpose of the Laplace transform is to create semantic equivalent alternative regularized text. Even if there is no such thing as a canonicalized text, the transformation provides semantically alternate text.

We don’t cluster on the thesaurus subset because the words will be selected better by applying it directly on the original text unless we introduce a feature of the word vector based on the thesaurus. One such feature may be one to one mappings of synonyms from the thesaurus for the words chosen as vector features for the PMI matrix. Arguably the thesaurus based feature could be part of the CRF instead of the PMI matrix. Either way we could achieve the word embedding with thesaurus feature.

Having ruled out bipartite graph approach, clustering approach and page ranking on the thesaurus subset, we now try to build lexical chains and find the words in the overlapping regions between two chains.

Courtesy: Survey of graphs in NLP by Nastase and Mihalcea

#codingexercise
Given an array : A1[] = 2, 1, 2, 5, 7, 1, 9, 3, 6, 8, 8
A2[] = 2, 1, 8, 3
Sort A1 in such a way that the relative order among the elements will be same as those are in A2. If the element is not present in A2, append them at last in sorted order.
o/p : 2, 2, 1, 1, 8, 8, 3, 5, 6, 7, 9
List<int> RelSort( ref list<int > A1, ref list<int> A2)
{
Var ret = new List();
Var h = new HashTable<int, int>();
A1.Sort();
For(int i = 0; i < A2.count; i++)
{
Int k = A1.indexOf(A2[i]);
If (k!=-1)
{
H[A2[i]]++;
For(; k < A1.count && A1[k] == A2[i]; k++)
ret.Add(A1[k]);
}
}
For(int j =0; j < A1.count; j++)
{
If (h[A1[j]] == 0)
ret.Add( A1[j]);
}
Return ret;
}
In the above code, we assume that the relative order of elements is dictated by A2

Cluster computing

Friday, October 7, 2016

No comments:

Post a Comment