Wednesday, September 20, 2017

We continue to review the slides from Stanford that introduce Natural Language Processing via Vector Semantics.We said that vector representation is useful and opens up new possibilities. For example, it helps compute the similarity between words. "fast" is similar to "rapid", "tall" is similar to "height" This can help in question answering say as in How tall is Mt.Everest ? The height of Mt.Everest is 29029 feet. Similarity of words also helps with plagiarism. If two narratives have a few words changed here and there, the similarity of the words should be high because they share the same context. When a number of word vectors are similar, the overall narrative is plagiarized.
Word vectors are also useful when the semantics of the word change over time. Words hold their meaning only in context of the surrounding words.If their usage changes over time, their meaning also changes. Consequently, word similarity may change based on their context. The problem with using a thesaurus in this case is that the thesaurus does not exist for every year to determine the similarity between the words which mean something today and meant something else yesterday. Moreover, thesaurus unlike a dictionary does not contain all words and phrases particularly verbs and adjectives.
Therefore instead of looking up an ontology, we now refer to a distributional model for the meaning of word which relies on the context surrounding the given words. A synonym is therefore a choice of words that share the same context and usage. In fact we interpret meanings of unknown words by look at the surrounding words and their context.
Stanford NLP has shown there are four kinds of vector models.
A Sparse vector representation where a word is represented in terms of the co-occurrences with the other words and using a set of weights for their co-occurrences. This weight is usually based on a metric called the mutual information.
A dense vector representation that takes one of the following vector models:
A representation based on weights associated with other words where the weights are computed as using conditional probabilities of the occurrences and referred to as latent semantic analysis
A neural network based models where the weights with other words are first determined by predicting a word based on the surrounding words and then predicting the surrounding words based on the current word
A set of clusters based on the Brown corpus.
#codingexercise
Find the maximum water retention in a bar chart
Water is retained over a bar of unit length and between the left and the right bars upto a depth equal to the difference between the minimum of the left and right and the height of the current bar. Therefore for each bar we can find the max on the left and on the right and calculate the water retained as above. We then cumulate this water retained for each bar along the range of bars. Since we need to find the max on the left and on the right for each bar, we can do this in two separate passes over all the bars. 

No comments:

Post a Comment