Saturday, September 23, 2017

We continue to review the slides from Stanford that introduce Natural Language Processing via Vector Semantics.We said that vector representation is useful and opens up new possibilities. We saw that a lookup such as a thesaurus does not help.  We were first reviewing co-occurrence matrices. These are of many forms such as term-document matrix, word-word matrix, word-context matrix etc The term-document matrix was  a count of word w in a document d. Each document therefore becomes a count vector. The similarity between the words in this case merely indicates their occurrence to be similar. If we changed the scope from documents to some other text boundary, we have word-word matrix.  The similarity in this case improves over that in the term-document matrix. A word-context matrix improves this further because the word in terms of context which is closer to its meaning and bring semantical similarity.
Some of these matrices can be very sparse with zeros covering a majority of the cells. This is quite alright since there are lots of efficient algorithms for sparse matrices. Similarly the size of the window can also be adjusted. The shorter the window the more syntactic the representation. The longer the window, the more semantic the representation. Instead of co-occurrence, we now consider a different correlation between two words. It's called the positive pointwise mutual information. Raw word frequency suffered from being skewed by more frequent and less salient words. Pointwise mutual information indicates whether two words occur more than they were independent.It is defined as the log of their probability taken together divided by their individual probabilities. This value can come out to be negative or positive. The negative values are not helpful because the values are of little significance. The probabilities are small and the result is in the inverse order of powers of ten which do not give any meaning via significance. Unrelatedness is also not helpful to our analysis. On the other hand, positive PMI is helpful to discern whether two words are likely together. we only have to take the PMI if it turns out to be positive. Computing the PPMI is easy to do with the probabilities which are based on cumulatives of word - occurrences.
#codingexercise
Count the number of islands in a sparse matrix.
This problem is similar to the one in graph where we find connected components. In a 2d matrix, every cell has eight neighbors. A depth first search explores all these eight neighbors. When a cell is visited, it is marked so it is not included in the next traversal. The number of such successful depth first search results in the number of connected components.

No comments:

Post a Comment