Sunday, September 24, 2017

We continue to review the slides from Stanford that introduce Natural Language Processing via Vector Semantics.We said that vector representation is useful and opens up new possibilities. We saw that a lookup such as a thesaurus does not help.  We were first reviewing co-occurrence matrices. These are of many forms such as term-document matrix, word-word matrix, word-context matrix etc The term-document matrix was  a count of word w in a document d. Each document therefore becomes a count vector. The similarity between the words in this case merely indicates their occurrence to be similar. If we changed the scope from documents to some other text boundary, we have word-word matrix.  The similarity in this case improves over that in the term-document matrix. A word-context matrix improves this further because the word in terms of context which is closer to its meaning and bring semantical similarity.
 Instead of co-occurrence, we now consider a different correlation between two words. It's called the positive pointwise mutual information.  Pointwise mutual information indicates whether two words occur more than they were independent.It is defined as the log of their probability taken together divided by their individual probabilities. This value can come out to be negative or positive. The negative values are not helpful because the values are of little significance. The probabilities are small and the result is in the inverse order of powers of ten which do not give any meaning via significance. Unrelatedness is also not helpful to our analysis. On the other hand, positive PMI is helpful to discern whether two words are likely together. we only have to take the PMI if it turns out to be positive. Computing the PPMI is easy to do with the probabilities which are based on cumulatives of word - occurrences
Today we  realize that PMI needs to be weighted, because it is biased towards in-frequent events. If the words are rare, they have a high PMI value. This skews the results when the text has rare words. In order to make it fair, the PMI is weighted. This can be achieved either by raising the context probabilities or with add-one smoothing. The probability of rare context is raised to alpha=0.75 in the first case and an appropriate smoothing may be added to the numerator in calculating the probability and applied to the uniform probability in the second case.
#codingexercise
Find all the heroes and the superheroes in an integer array. The heroes are the elements which are greater than all the elements to the right of them. A superhero is the element which is greater than all the elements to the left and the right.
This problem can be solved by keeping track of the current max seen so far. If the elements traversed and picked as next exceeds the current max, it satisfies the criteria for being the hero and gets used as the current max. Any element that does not exceed the current max is not a hero. The final value of the current max is the superhero.

Saturday, September 23, 2017

We continue to review the slides from Stanford that introduce Natural Language Processing via Vector Semantics.We said that vector representation is useful and opens up new possibilities. We saw that a lookup such as a thesaurus does not help.  We were first reviewing co-occurrence matrices. These are of many forms such as term-document matrix, word-word matrix, word-context matrix etc The term-document matrix was  a count of word w in a document d. Each document therefore becomes a count vector. The similarity between the words in this case merely indicates their occurrence to be similar. If we changed the scope from documents to some other text boundary, we have word-word matrix.  The similarity in this case improves over that in the term-document matrix. A word-context matrix improves this further because the word in terms of context which is closer to its meaning and bring semantical similarity.
Some of these matrices can be very sparse with zeros covering a majority of the cells. This is quite alright since there are lots of efficient algorithms for sparse matrices. Similarly the size of the window can also be adjusted. The shorter the window the more syntactic the representation. The longer the window, the more semantic the representation. Instead of co-occurrence, we now consider a different correlation between two words. It's called the positive pointwise mutual information. Raw word frequency suffered from being skewed by more frequent and less salient words. Pointwise mutual information indicates whether two words occur more than they were independent.It is defined as the log of their probability taken together divided by their individual probabilities. This value can come out to be negative or positive. The negative values are not helpful because the values are of little significance. The probabilities are small and the result is in the inverse order of powers of ten which do not give any meaning via significance. Unrelatedness is also not helpful to our analysis. On the other hand, positive PMI is helpful to discern whether two words are likely together. we only have to take the PMI if it turns out to be positive. Computing the PPMI is easy to do with the probabilities which are based on cumulatives of word - occurrences.
#codingexercise
Count the number of islands in a sparse matrix.
This problem is similar to the one in graph where we find connected components. In a 2d matrix, every cell has eight neighbors. A depth first search explores all these eight neighbors. When a cell is visited, it is marked so it is not included in the next traversal. The number of such successful depth first search results in the number of connected components.

Friday, September 22, 2017

We continue to review the slides from Stanford that introduce Natural Language Processing via Vector Semantics.We said that vector representation is useful and opens up new possibilities. We saw that a lookup such as a thesaurus does not help.
Stanford NLP has shown there are four kinds of vector models.
A Sparse vector representation where a word is represented in terms of the co-occurrences with the other words and using a set of weights for their co-occurrences. This weight is usually based on a metric called the mutual information.
A dense vector representation that involves latent semantic analysis, neural net or clusters from Brown corpus. The dense vector representations share a representation of word as a vector of numbers which translate a word into a corresponding vector in the vector space. This is called embedding.
Co-occurrence matrices were of many forms such as term-document matrix, word-word matrix, word-context matrix etc The term-document matrix was  a count of word w in a document d. Each document therefore becomes a count vector. The similarity between the words in this case merely indicates their occurrence to be similar. If we changed the scope from documents to some other text boundary, we have word-word matrix.  The similarity in this case improves over that in the term-document matrix. A word-context matrix improves this further because the word in terms of context which is closer to its meaning and bring semantical similarity.
Co-occurrence between two words have two forms - first order and second order. The first order co-occurrence is syntagmatic association and the second-order association is paradigmatic association which means the first one is based on positions  where as the second one is based on similar neighbors. Note that the vectorization derives from the usage of words which is why it becomes popular. Another way to look at usage is to canonicalize the text into an esperanto language where the relations and syntax are more oriented towards natural language processing. Some work has already begun with different kind of improvements to ontologies that are not restricted to thesaurus or wordnet but one such as FrameNet. All we need to keep in mind here is that there are layers to tackle the problem - Usage, vector space, classification of vectors. 

Thursday, September 21, 2017

We continue to review the slides from Stanford that introduce Natural Language Processing via Vector Semantics.We said that vector representation is useful and opens up new possibilities. We saw that a lookup such as a thesaurus does not help.
Stanford NLP has shown there are four kinds of vector models.
A Sparse vector representation where a word is represented in terms of the co-occurrences with the other words and using a set of weights for their co-occurrences. This weight is usually based on a metric called the mutual information.
A dense vector representation that takes one of the following vector models:
A representation based on weights associated with other words where the weights are computed as using conditional probabilities of the occurrences and referred to as latent semantic analysis
A neural network based models where the weights with other words are first determined by predicting a word based on the surrounding words and then predicting the surrounding words based on the current word
A set of clusters based on the Brown corpus.
#codingexercise
Find the minimum number of squares whose sum equals to a given number n
We write a few base cases say upto n = 3
For the n greater than that, we can initialize the number of squares to be the candidate we consider from 4 to n. Each number can be represented with the maximum number of squares as those comprising of unit squares only.
Next for each number from 1 to that candidate, we can recursively calculate the maximum number of squares for the n minus the square of the iterator and incrementing one towards the count. We update the minimum as we find for each iterator. All the results are memoized for easy lookup. This results in the smallest number of squares being found in the table entry for n.

Wednesday, September 20, 2017

We continue to review the slides from Stanford that introduce Natural Language Processing via Vector Semantics.We said that vector representation is useful and opens up new possibilities. For example, it helps compute the similarity between words. "fast" is similar to "rapid", "tall" is similar to "height" This can help in question answering say as in How tall is Mt.Everest ? The height of Mt.Everest is 29029 feet. Similarity of words also helps with plagiarism. If two narratives have a few words changed here and there, the similarity of the words should be high because they share the same context. When a number of word vectors are similar, the overall narrative is plagiarized.
Word vectors are also useful when the semantics of the word change over time. Words hold their meaning only in context of the surrounding words.If their usage changes over time, their meaning also changes. Consequently, word similarity may change based on their context. The problem with using a thesaurus in this case is that the thesaurus does not exist for every year to determine the similarity between the words which mean something today and meant something else yesterday. Moreover, thesaurus unlike a dictionary does not contain all words and phrases particularly verbs and adjectives.
Therefore instead of looking up an ontology, we now refer to a distributional model for the meaning of word which relies on the context surrounding the given words. A synonym is therefore a choice of words that share the same context and usage. In fact we interpret meanings of unknown words by look at the surrounding words and their context.
Stanford NLP has shown there are four kinds of vector models.
A Sparse vector representation where a word is represented in terms of the co-occurrences with the other words and using a set of weights for their co-occurrences. This weight is usually based on a metric called the mutual information.
A dense vector representation that takes one of the following vector models:
A representation based on weights associated with other words where the weights are computed as using conditional probabilities of the occurrences and referred to as latent semantic analysis
A neural network based models where the weights with other words are first determined by predicting a word based on the surrounding words and then predicting the surrounding words based on the current word
A set of clusters based on the Brown corpus.
#codingexercise
Find the maximum water retention in a bar chart
Water is retained over a bar of unit length and between the left and the right bars upto a depth equal to the difference between the minimum of the left and right and the height of the current bar. Therefore for each bar we can find the max on the left and on the right and calculate the water retained as above. We then cumulate this water retained for each bar along the range of bars. Since we need to find the max on the left and on the right for each bar, we can do this in two separate passes over all the bars. 

Tuesday, September 19, 2017

Today we start reviewing the slides from Stanford that introduce Natural Language Processing via Vector Semantics. This is useful to study when we wonder about the usefulness of graphs in text analysis. Its true that a vector of features is like a row with columns.Tabular data is easy to work with using conventional database and data mining techniques. Moreover, representing a problem in vector space means we can utilize all the matrix analysis techniques that we have known for a long while. In fact, new improvements may be possible as we re-discover the suitability of more and more techniques from this domain. For example, matrix Factorization, Eigen values and Eigen vectors, gaussian and laplacian are straight out of mathematical text books and continue to serve as techniques to try with NLP. Matrix mathematics is also useful to succinctly describe a problem with their use of notation for an entire matrix. Similarly, Vector space also helps visualize the problems in cartesian co-ordinates with relative or absolute point of reference. This is another form of analysis that we have long understood and continue to find it useful to explain ideas. Vector space gives us a way to describe the problem in a space where we can describe and visualize magnitude and direction. Together with matrix, vectors allows us to see transformations.
Graphs came about as an evolution from disconnected and linearly related data. When we establish linear dependencies, we immediately see dependencies that exist beyond just two instances in a linear model and contributions from neighbors other than those two. Graph also gave us methods such as traversal, centrality and pagerank that introduced us to ways we can visualize and solve problems. People who work with graph databases become so involved with seeing relationships described as edges between neighbors that they start questioning what purpose relational databases have and why are there even joins between tables. They see graph as liberating to store and analyze relationships. Graphs now can even be worked with in batch no-sql mode which lets us scale our operations like never before. In fact, graphs take a long time to create but once they are created, they can serve useful analysis in a fraction of the time that was spent on creating it. Graphs are also supported out of box from graph databases and work very well with their techniques. Many packages for analysis also make it popular to work with graphs.
Vectors can be transformed into graphs based on similarity between vectors. Depending on the strength of similarity, we may draw edges between the nodes that represent the vectors. This gives us a way to determine how important a  vector among its neighbors - a very useful concept if we have to select a few vectors from the many. Still the selection can also work with representing vectors as tabular rows and a classifier that can issue tag. So while these different techniques may perform similar tasks, they are not mutually exclusive to each other and can even work with one another. We will start reading on how vector representation improves the notion of what the nodes or entities are.
#coding exercise
Find the largest rectangular area in a histogram.
Each bar of the histogram may serve as the start of a rectangle. So the top left corner may act as the top left of a rectangle whose width may increment over other bars as long as they allow it. Therefore for each bar in the histogram, we try to draw a rectangle starting at the top left and take the maximum by enclosing area of all the rectangles formed.

Monday, September 18, 2017

Today we describe a method to do text summarization:
def summarize(self, text):
from gensim.models import word2vec

# load the test sample
sentences = text.split('\n')
model = word2vec.Word2Vec(sentences, size=200)
model.save_word2vec_format('/tmp/vectors.bin', binary=True)
model = word2vec.Word2Vec.load_word2vec_format('/tmp/vectors.bin', binary=True)

#get selection from text based on centroids selected from k-means classification avoiding graph
from nltk.cluster.kmeans import KMeansClusterer
NUM_CLUSTERS = 3  # user can specify this or we take a default value
kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=model.similarity, repeats=25)
assigned_clusters = kclusterer.cluster(model, assign_clusters=True)
pr = translate_to_words(assigned_clusters)

# render the summary
import operator
sorted_pr = sorted(pr.items(), key=operator.itemgetter(1), reverse=True)
important = [int(i[0]) for i in sorted_x][:10]
scored_sentences = {}
for sentence in sentences:
matches = set(sentence.split()).intersection(important)
score = 0
for match in matches:
score+= pr[match]
scored_sentences[sentence]=score
reordered_sentences = [ i[0] for i in sorted(scored_sentences.items(), key=operator.itemgetter(1), reverse=True)[:10] ]
ordered_sentences = [ x for x in sentences if x in reordered_sentences ]
summary = '\n'.join(ordered_sentences)
print(summary)
return ordered_sentences


#codingexercise
Given a sequence of motions in terms of unit displacements as - G - for moving forward, L for moving to the left and R for moving to the right, determine if a robot executing the sequence can stay within a circle
Solution: We keep track of the current position and the orientation in terms of N, E, W, and S. For each movement in the sequence, we update the current co-ordinate by displacing x or y but not both and by updating the orientation. If the ending position converges towards the origin, we know the robot stays within a circle.

There is one observation for this coding exercise. There is no way to be sure that a given sequence converges in a predetermined number of steps.