Thursday, September 29, 2016

Today we continue to discuss the relevance of thesaurus in Natural language processing. Mant algorithms hav3 been professed which derermin the latent similarity based on word co-occurrence matrix. Yet it is narural to find similarity based on some ontology. Word2Net filled this gap very well. Words were arranged in a .predefined tree and their hierarchy gave an organization of the topic they belonged under. Furthermore, they gave similarity measures as distance between the words along the edges.  The drawbacks were that not all words were found in the wordnet and the organization didnt really help with every context. where it helped the most was in its association with parts of speech.  A thesaurus improves the usefulness in determining similarity by providing all possible synonyms and thus exploring as much semantic connections as possible. In fact, many nlp algorithms have been mentioned by their authors to ne significantly improved when there is such an online thesaurus. They stated that a thesaurus does better with semantic closeness than wordnet theoretically.
Such a thesaurus was not available in machine readable format.
Today we have learned to store graphs with trillions of edges. We can partition them on thousands of nodes using technologies such as Apache Giraph. With several thousand worker nodes, we can find relationships between words in parallelized summation forms. For example, if we were to perform k-means clustering of the words based on accepted similarities, the vertices could start as partitioned on different nodes but eventually become tagged with the cluster they belong to. The master would initialize the centroids and share it with local aggregators. These aggregators in turn would tag their vertices to the nearest centroid. The centroids would then be recomputed when the aggregators have shared their results with the master.
This summation form technique works very well with linear scalability to the size of graph.
However, the way this similarity was used since wordnet age was in finding lexical chains through the text.

A lexical chain was a set of words in the texr that contributed to some strain of meaning in the overall discourse. The best of these lexical chains was selected as the overall topic of the texr. Naturally a thesaurus performed better than wordnet in finding these chains however both were likely to pick false chains.
This approach was greatly improved by statistical analysis which considered the text as a bag of words model where words were high dimensional words.

We will continue this discussion but first let us look into a coding exercise
Given a stream of characters, find the first non-repeating character.
A solution is to maintain a doubly linked list which keeps track of the nonrepeating characters seen in order.
Void printFirstNonRepeatingChar(stream<char> s)
{
Const int Max_char = 256;
Var nr = new LinkedList() { null; }
Var present = new List<char>(max_char);
Var dup = new List<bool>(max_char){ false};
Var current = s.read();
While (current != EOF)
{
If(dup[current] == false){
If (present[char] == null){
nr.Add(current);
present[current] = nr.Last();
}else{
nr.Remove(current);
present[current] = null:
Dup[current] = true;
}
}
If (nr.empty() == false)
    Console.write("First non-repeated char is {0}", nr.First());
Current = s.Read();
}

}


The above can also be modified to exit after first non-repeated character.

Also note that the state is toggled after the repetition.

If the number of unique characters is not restricted by ascii, the above problem does not scale. Consequently a hash table may be required.
# combination puzzles : https://1drv.ms/w/s!Ashlm-Nw-wnWlBdZZKfVTy5CTc4i 

No comments:

Post a Comment