Sunday, September 24, 2017

We continue to review the slides from Stanford that introduce Natural Language Processing via Vector Semantics.We said that vector representation is useful and opens up new possibilities. We saw that a lookup such as a thesaurus does not help.  We were first reviewing co-occurrence matrices. These are of many forms such as term-document matrix, word-word matrix, word-context matrix etc The term-document matrix was  a count of word w in a document d. Each document therefore becomes a count vector. The similarity between the words in this case merely indicates their occurrence to be similar. If we changed the scope from documents to some other text boundary, we have word-word matrix.  The similarity in this case improves over that in the term-document matrix. A word-context matrix improves this further because the word in terms of context which is closer to its meaning and bring semantical similarity.
 Instead of co-occurrence, we now consider a different correlation between two words. It's called the positive pointwise mutual information.  Pointwise mutual information indicates whether two words occur more than they were independent.It is defined as the log of their probability taken together divided by their individual probabilities. This value can come out to be negative or positive. The negative values are not helpful because the values are of little significance. The probabilities are small and the result is in the inverse order of powers of ten which do not give any meaning via significance. Unrelatedness is also not helpful to our analysis. On the other hand, positive PMI is helpful to discern whether two words are likely together. we only have to take the PMI if it turns out to be positive. Computing the PPMI is easy to do with the probabilities which are based on cumulatives of word - occurrences
Today we  realize that PMI needs to be weighted, because it is biased towards in-frequent events. If the words are rare, they have a high PMI value. This skews the results when the text has rare words. In order to make it fair, the PMI is weighted. This can be achieved either by raising the context probabilities or with add-one smoothing. The probability of rare context is raised to alpha=0.75 in the first case and an appropriate smoothing may be added to the numerator in calculating the probability and applied to the uniform probability in the second case.
#codingexercise
Find all the heroes and the superheroes in an integer array. The heroes are the elements which are greater than all the elements to the right of them. A superhero is the element which is greater than all the elements to the left and the right.
This problem can be solved by keeping track of the current max seen so far. If the elements traversed and picked as next exceeds the current max, it satisfies the criteria for being the hero and gets used as the current max. Any element that does not exceed the current max is not a hero. The final value of the current max is the superhero.

No comments:

Post a Comment