Friday, September 22, 2017

We continue to review the slides from Stanford that introduce Natural Language Processing via Vector Semantics.We said that vector representation is useful and opens up new possibilities. We saw that a lookup such as a thesaurus does not help.
Stanford NLP has shown there are four kinds of vector models.
A Sparse vector representation where a word is represented in terms of the co-occurrences with the other words and using a set of weights for their co-occurrences. This weight is usually based on a metric called the mutual information.
A dense vector representation that involves latent semantic analysis, neural net or clusters from Brown corpus. The dense vector representations share a representation of word as a vector of numbers which translate a word into a corresponding vector in the vector space. This is called embedding.
Co-occurrence matrices were of many forms such as term-document matrix, word-word matrix, word-context matrix etc The term-document matrix was  a count of word w in a document d. Each document therefore becomes a count vector. The similarity between the words in this case merely indicates their occurrence to be similar. If we changed the scope from documents to some other text boundary, we have word-word matrix.  The similarity in this case improves over that in the term-document matrix. A word-context matrix improves this further because the word in terms of context which is closer to its meaning and bring semantical similarity.
Co-occurrence between two words have two forms - first order and second order. The first order co-occurrence is syntagmatic association and the second-order association is paradigmatic association which means the first one is based on positions  where as the second one is based on similar neighbors. Note that the vectorization derives from the usage of words which is why it becomes popular. Another way to look at usage is to canonicalize the text into an esperanto language where the relations and syntax are more oriented towards natural language processing. Some work has already begun with different kind of improvements to ontologies that are not restricted to thesaurus or wordnet but one such as FrameNet. All we need to keep in mind here is that there are layers to tackle the problem - Usage, vector space, classification of vectors. 

Thursday, September 21, 2017

We continue to review the slides from Stanford that introduce Natural Language Processing via Vector Semantics.We said that vector representation is useful and opens up new possibilities. We saw that a lookup such as a thesaurus does not help.
Stanford NLP has shown there are four kinds of vector models.
A Sparse vector representation where a word is represented in terms of the co-occurrences with the other words and using a set of weights for their co-occurrences. This weight is usually based on a metric called the mutual information.
A dense vector representation that takes one of the following vector models:
A representation based on weights associated with other words where the weights are computed as using conditional probabilities of the occurrences and referred to as latent semantic analysis
A neural network based models where the weights with other words are first determined by predicting a word based on the surrounding words and then predicting the surrounding words based on the current word
A set of clusters based on the Brown corpus.
#codingexercise
Find the minimum number of squares whose sum equals to a given number n
We write a few base cases say upto n = 3
For the n greater than that, we can initialize the number of squares to be the candidate we consider from 4 to n. Each number can be represented with the maximum number of squares as those comprising of unit squares only.
Next for each number from 1 to that candidate, we can recursively calculate the maximum number of squares for the n minus the square of the iterator and incrementing one towards the count. We update the minimum as we find for each iterator. All the results are memoized for easy lookup. This results in the smallest number of squares being found in the table entry for n.

Wednesday, September 20, 2017

We continue to review the slides from Stanford that introduce Natural Language Processing via Vector Semantics.We said that vector representation is useful and opens up new possibilities. For example, it helps compute the similarity between words. "fast" is similar to "rapid", "tall" is similar to "height" This can help in question answering say as in How tall is Mt.Everest ? The height of Mt.Everest is 29029 feet. Similarity of words also helps with plagiarism. If two narratives have a few words changed here and there, the similarity of the words should be high because they share the same context. When a number of word vectors are similar, the overall narrative is plagiarized.
Word vectors are also useful when the semantics of the word change over time. Words hold their meaning only in context of the surrounding words.If their usage changes over time, their meaning also changes. Consequently, word similarity may change based on their context. The problem with using a thesaurus in this case is that the thesaurus does not exist for every year to determine the similarity between the words which mean something today and meant something else yesterday. Moreover, thesaurus unlike a dictionary does not contain all words and phrases particularly verbs and adjectives.
Therefore instead of looking up an ontology, we now refer to a distributional model for the meaning of word which relies on the context surrounding the given words. A synonym is therefore a choice of words that share the same context and usage. In fact we interpret meanings of unknown words by look at the surrounding words and their context.
Stanford NLP has shown there are four kinds of vector models.
A Sparse vector representation where a word is represented in terms of the co-occurrences with the other words and using a set of weights for their co-occurrences. This weight is usually based on a metric called the mutual information.
A dense vector representation that takes one of the following vector models:
A representation based on weights associated with other words where the weights are computed as using conditional probabilities of the occurrences and referred to as latent semantic analysis
A neural network based models where the weights with other words are first determined by predicting a word based on the surrounding words and then predicting the surrounding words based on the current word
A set of clusters based on the Brown corpus.
#codingexercise
Find the maximum water retention in a bar chart
Water is retained over a bar of unit length and between the left and the right bars upto a depth equal to the difference between the minimum of the left and right and the height of the current bar. Therefore for each bar we can find the max on the left and on the right and calculate the water retained as above. We then cumulate this water retained for each bar along the range of bars. Since we need to find the max on the left and on the right for each bar, we can do this in two separate passes over all the bars. 

Tuesday, September 19, 2017

Today we start reviewing the slides from Stanford that introduce Natural Language Processing via Vector Semantics. This is useful to study when we wonder about the usefulness of graphs in text analysis. Its true that a vector of features is like a row with columns.Tabular data is easy to work with using conventional database and data mining techniques. Moreover, representing a problem in vector space means we can utilize all the matrix analysis techniques that we have known for a long while. In fact, new improvements may be possible as we re-discover the suitability of more and more techniques from this domain. For example, matrix Factorization, Eigen values and Eigen vectors, gaussian and laplacian are straight out of mathematical text books and continue to serve as techniques to try with NLP. Matrix mathematics is also useful to succinctly describe a problem with their use of notation for an entire matrix. Similarly, Vector space also helps visualize the problems in cartesian co-ordinates with relative or absolute point of reference. This is another form of analysis that we have long understood and continue to find it useful to explain ideas. Vector space gives us a way to describe the problem in a space where we can describe and visualize magnitude and direction. Together with matrix, vectors allows us to see transformations.
Graphs came about as an evolution from disconnected and linearly related data. When we establish linear dependencies, we immediately see dependencies that exist beyond just two instances in a linear model and contributions from neighbors other than those two. Graph also gave us methods such as traversal, centrality and pagerank that introduced us to ways we can visualize and solve problems. People who work with graph databases become so involved with seeing relationships described as edges between neighbors that they start questioning what purpose relational databases have and why are there even joins between tables. They see graph as liberating to store and analyze relationships. Graphs now can even be worked with in batch no-sql mode which lets us scale our operations like never before. In fact, graphs take a long time to create but once they are created, they can serve useful analysis in a fraction of the time that was spent on creating it. Graphs are also supported out of box from graph databases and work very well with their techniques. Many packages for analysis also make it popular to work with graphs.
Vectors can be transformed into graphs based on similarity between vectors. Depending on the strength of similarity, we may draw edges between the nodes that represent the vectors. This gives us a way to determine how important a  vector among its neighbors - a very useful concept if we have to select a few vectors from the many. Still the selection can also work with representing vectors as tabular rows and a classifier that can issue tag. So while these different techniques may perform similar tasks, they are not mutually exclusive to each other and can even work with one another. We will start reading on how vector representation improves the notion of what the nodes or entities are.
#coding exercise
Find the largest rectangular area in a histogram.
Each bar of the histogram may serve as the start of a rectangle. So the top left corner may act as the top left of a rectangle whose width may increment over other bars as long as they allow it. Therefore for each bar in the histogram, we try to draw a rectangle starting at the top left and take the maximum by enclosing area of all the rectangles formed.

Monday, September 18, 2017

Today we describe a method to do text summarization:
def summarize(self, text):
from gensim.models import word2vec

# load the test sample
sentences = text.split('\n')
model = word2vec.Word2Vec(sentences, size=200)
model.save_word2vec_format('/tmp/vectors.bin', binary=True)
model = word2vec.Word2Vec.load_word2vec_format('/tmp/vectors.bin', binary=True)

#get selection from text based on centroids selected from k-means classification avoiding graph
from nltk.cluster.kmeans import KMeansClusterer
NUM_CLUSTERS = 3  # user can specify this or we take a default value
kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=model.similarity, repeats=25)
assigned_clusters = kclusterer.cluster(model, assign_clusters=True)
pr = translate_to_words(assigned_clusters)

# render the summary
import operator
sorted_pr = sorted(pr.items(), key=operator.itemgetter(1), reverse=True)
important = [int(i[0]) for i in sorted_x][:10]
scored_sentences = {}
for sentence in sentences:
matches = set(sentence.split()).intersection(important)
score = 0
for match in matches:
score+= pr[match]
scored_sentences[sentence]=score
reordered_sentences = [ i[0] for i in sorted(scored_sentences.items(), key=operator.itemgetter(1), reverse=True)[:10] ]
ordered_sentences = [ x for x in sentences if x in reordered_sentences ]
summary = '\n'.join(ordered_sentences)
print(summary)
return ordered_sentences


#codingexercise
Given a sequence of motions in terms of unit displacements as - G - for moving forward, L for moving to the left and R for moving to the right, determine if a robot executing the sequence can stay within a circle
Solution: We keep track of the current position and the orientation in terms of N, E, W, and S. For each movement in the sequence, we update the current co-ordinate by displacing x or y but not both and by updating the orientation. If the ending position converges towards the origin, we know the robot stays within a circle.

There is one observation for this coding exercise. There is no way to be sure that a given sequence converges in a predetermined number of steps.

Sunday, September 17, 2017

Today we continue reviewing U-SQL.It unifies the benefits of SQL with the expressive power of your own code. This is said to work very well with all kind of data stores – file, object and relational. U-SQL works on the Azure ecosystem which involves the Azure data lake storage as the foundation and the analytics layer over it. The benefit of the Azure storage is that it spans several kinds of data formats and stores.
The presentation for U-SQL explains three scenarios  which include - a Cognitive example , a text analysis example and a JSON processing.
The cognitive example identifies objects in images. This kind of example show how the entire image processing on image files can be considered custom logic and used with the query language. As long as we define the objects, the input and the logic to analyze the objects, it can be made part of the query to extract the desired output dataset.
The text analysis example is also similar where we can extract the text prior to performing the analysis. Its interesting to note that the classifier used to tag the text can be written in R language and is not dependent on the query.  The outputters also result in different output.
JSON processing  is another example cited by the presentation probably because it has become important to extract transform load in analytical processing whether it is a cloud data warehouse or big data operations. This "schema later" approach is popular because it decouples producers and consumers which saves co-ordination and time-consuming operations between say departments. While some applications with query languages such as SnowSQL import the Json into a columnar table and then execute a query or declare their own syntax to flatten the Json, the approach taken by U-SQL is more general purpose with its Extract, select and outputters that are either built-in or the customizations that the user can make.
Courtesy U-SQL slide shares
My updates on query improvements : https://1drv.ms/w/s!Ashlm-Nw-wnWsFqBcG-mBhjPLbC8
#codingexercise
Recursive function for a palindrome:
 If string Is empty or one character return true
 If string.first() == string.last() return recursively by stripping first and last
 If string.first() != string.last() return false

Also an update on classifiers : https://1drv.ms/w/s!Ashlm-Nw-wnWsFzHx5Hrcl633js_ 

Saturday, September 16, 2017

Today we continue reviewing U-SQL.It unifies the benefits of SQL with the expressive power of your own code. This is said to work very well with all kind of data stores – file, object and relational. U-SQL works on the Azure ecosystem which involves the Azure data lake storage as the foundation and the analytics layer over it. The benefit of the Azure storage is that it spans several kinds of data formats and stores.
One of the improvements in this language design is the consideration for single-node versus parallel versus distributed computing. Queries often have to manage parallelism, synchronizations and transactions. But the language not only has to allow implicit considerations by the system but also enable explicit constructs for the users. Moreover, execution is no longer just scale-up but also scale-out and therefore libraries as well as language needs to handle parallelism.
The data processing language is independent of the scale of data but the data is a part of the language model. Programming languages treat data as something in  a store and tie the data and the logic together. This data processing language allows data to chnage and evolve independent of the application.
U-SQL provides all this for the user with custom operator extensions called UDO's which are scaled out. It includes User-defined extractors, outputters, processors, appliers, combiners and reducers. The scale-out can also be explicitly requested with hint keywords.
UDO's can be written in any .Net language and they can be deployed in the service as an assembly after registering them with U-SQL script. Therefore UDOs like SQLCLR can invoke managed code, other runtimes like Python, R and all with the option to scale out. UDOs cannot interact with one another and are isolated in the scope that they are registered with. The U-SQL script allows these UDOs residing in assemblies to be invoked with the different data processing options such as extract, reduce etc.
One simple example to use UDOs for text summarization that we talked about earlier with trimpy python extension can be shown to be similar to the following simpler but only for illustration query as follows:
@text = EXTRACT text string
              FROM @"filename"
              USING new Trimpy.Extractor();
@summary = SELECT Trimpy.Summarize(text)
                       FROM @text
OUTPUT @summary
           TO "/summary.txt"
           USING Outputters.text();
This is simple but tasks like text classification or prediction or data mining can also be called via U-SQL.
                               


Courtesy U-SQL slide shares

My take on query improvements : https://1drv.ms/w/s!Ashlm-Nw-wnWsFqBcG-mBhjPLbC8

#codingexercise
Count all palindromic subsequences of a string
we can use a recursive solution to count this as we shrink the string.
if the boundary characters match, we can count the following two subsequences
first from start to end - 1
second from start +1 to end - 1
plus 1 for the match with the current boundary
otherwise we count the same two subsequences again and reduce the count from subsequence starting at start + 1 and ending at end -1 because it would have been included twice in each subsequence.

This same logic holds true for substrings if the subsequences can be confirmed to exist in the string.