Cluster computing

Tuesday, September 19, 2017

Today we start reviewing the slides from Stanford that introduce Natural Language Processing via Vector Semantics. This is useful to study when we wonder about the usefulness of graphs in text analysis. Its true that a vector of features is like a row with columns.Tabular data is easy to work with using conventional database and data mining techniques. Moreover, representing a problem in vector space means we can utilize all the matrix analysis techniques that we have known for a long while. In fact, new improvements may be possible as we re-discover the suitability of more and more techniques from this domain. For example, matrix Factorization, Eigen values and Eigen vectors, gaussian and laplacian are straight out of mathematical text books and continue to serve as techniques to try with NLP. Matrix mathematics is also useful to succinctly describe a problem with their use of notation for an entire matrix. Similarly, Vector space also helps visualize the problems in cartesian co-ordinates with relative or absolute point of reference. This is another form of analysis that we have long understood and continue to find it useful to explain ideas. Vector space gives us a way to describe the problem in a space where we can describe and visualize magnitude and direction. Together with matrix, vectors allows us to see transformations.
Graphs came about as an evolution from disconnected and linearly related data. When we establish linear dependencies, we immediately see dependencies that exist beyond just two instances in a linear model and contributions from neighbors other than those two. Graph also gave us methods such as traversal, centrality and pagerank that introduced us to ways we can visualize and solve problems. People who work with graph databases become so involved with seeing relationships described as edges between neighbors that they start questioning what purpose relational databases have and why are there even joins between tables. They see graph as liberating to store and analyze relationships. Graphs now can even be worked with in batch no-sql mode which lets us scale our operations like never before. In fact, graphs take a long time to create but once they are created, they can serve useful analysis in a fraction of the time that was spent on creating it. Graphs are also supported out of box from graph databases and work very well with their techniques. Many packages for analysis also make it popular to work with graphs.
Vectors can be transformed into graphs based on similarity between vectors. Depending on the strength of similarity, we may draw edges between the nodes that represent the vectors. This gives us a way to determine how important a vector among its neighbors - a very useful concept if we have to select a few vectors from the many. Still the selection can also work with representing vectors as tabular rows and a classifier that can issue tag. So while these different techniques may perform similar tasks, they are not mutually exclusive to each other and can even work with one another. We will start reading on how vector representation improves the notion of what the nodes or entities are.
#coding exercise
Find the largest rectangular area in a histogram.
Each bar of the histogram may serve as the start of a rectangle. So the top left corner may act as the top left of a rectangle whose width may increment over other bars as long as they allow it. Therefore for each bar in the histogram, we try to draw a rectangle starting at the top left and take the maximum by enclosing area of all the rectangles formed.

Monday, September 18, 2017

Today we describe a method to do text summarization:

def summarize(self, text):

from gensim.models import word2vec

# load the test sample

sentences = text.split('\n')

model = word2vec.Word2Vec(sentences, size=200)

model.save_word2vec_format('/tmp/vectors.bin', binary=True)

model = word2vec.Word2Vec.load_word2vec_format('/tmp/vectors.bin', binary=True)

#get selection from text based on centroids selected from k-means classification avoiding graph

from nltk.cluster.kmeans import KMeansClusterer

NUM_CLUSTERS = 3 # user can specify this or we take a default value

kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=model.similarity, repeats=25)

assigned_clusters = kclusterer.cluster(model, assign_clusters=True)

pr = translate_to_words(assigned_clusters)

# render the summary

import operator

sorted_pr = sorted(pr.items(), key=operator.itemgetter(1), reverse=True)

important = [int(i[0]) for i in sorted_x][:10]

scored_sentences = {}

for sentence in sentences:

matches = set(sentence.split()).intersection(important)

score = 0

for match in matches:

score+= pr[match]

scored_sentences[sentence]=score

reordered_sentences = [ i[0] for i in sorted(scored_sentences.items(), key=operator.itemgetter(1), reverse=True)[:10] ]

ordered_sentences = [ x for x in sentences if x in reordered_sentences ]

summary = '\n'.join(ordered_sentences)

print(summary)

return ordered_sentences

#codingexercise

Given a sequence of motions in terms of unit displacements as - G - for moving forward, L for moving to the left and R for moving to the right, determine if a robot executing the sequence can stay within a circle
Solution: We keep track of the current position and the orientation in terms of N, E, W, and S. For each movement in the sequence, we update the current co-ordinate by displacing x or y but not both and by updating the orientation. If the ending position converges towards the origin, we know the robot stays within a circle.

There is one observation for this coding exercise. There is no way to be sure that a given sequence converges in a predetermined number of steps.

Sunday, September 17, 2017

Today we continue reviewing U-SQL.It unifies the benefits of SQL with the expressive power of your own code. This is said to work very well with all kind of data stores – file, object and relational. U-SQL works on the Azure ecosystem which involves the Azure data lake storage as the foundation and the analytics layer over it. The benefit of the Azure storage is that it spans several kinds of data formats and stores.
The presentation for U-SQL explains three scenarios which include - a Cognitive example , a text analysis example and a JSON processing.
The cognitive example identifies objects in images. This kind of example show how the entire image processing on image files can be considered custom logic and used with the query language. As long as we define the objects, the input and the logic to analyze the objects, it can be made part of the query to extract the desired output dataset.
The text analysis example is also similar where we can extract the text prior to performing the analysis. Its interesting to note that the classifier used to tag the text can be written in R language and is not dependent on the query. The outputters also result in different output.
JSON processing is another example cited by the presentation probably because it has become important to extract transform load in analytical processing whether it is a cloud data warehouse or big data operations. This "schema later" approach is popular because it decouples producers and consumers which saves co-ordination and time-consuming operations between say departments. While some applications with query languages such as SnowSQL import the Json into a columnar table and then execute a query or declare their own syntax to flatten the Json, the approach taken by U-SQL is more general purpose with its Extract, select and outputters that are either built-in or the customizations that the user can make.
Courtesy U-SQL slide shares
My updates on query improvements : https://1drv.ms/w/s!Ashlm-Nw-wnWsFqBcG-mBhjPLbC8
#codingexercise
Recursive function for a palindrome:
If string Is empty or one character return true
If string.first() == string.last() return recursively by stripping first and last
If string.first() != string.last() return false

Also an update on classifiers : https://1drv.ms/w/s!Ashlm-Nw-wnWsFzHx5Hrcl633js_

Saturday, September 16, 2017

Today we continue reviewing U-SQL.It unifies the benefits of SQL with the expressive power of your own code. This is said to work very well with all kind of data stores – file, object and relational. U-SQL works on the Azure ecosystem which involves the Azure data lake storage as the foundation and the analytics layer over it. The benefit of the Azure storage is that it spans several kinds of data formats and stores.
One of the improvements in this language design is the consideration for single-node versus parallel versus distributed computing. Queries often have to manage parallelism, synchronizations and transactions. But the language not only has to allow implicit considerations by the system but also enable explicit constructs for the users. Moreover, execution is no longer just scale-up but also scale-out and therefore libraries as well as language needs to handle parallelism.
The data processing language is independent of the scale of data but the data is a part of the language model. Programming languages treat data as something in a store and tie the data and the logic together. This data processing language allows data to chnage and evolve independent of the application.
U-SQL provides all this for the user with custom operator extensions called UDO's which are scaled out. It includes User-defined extractors, outputters, processors, appliers, combiners and reducers. The scale-out can also be explicitly requested with hint keywords.
UDO's can be written in any .Net language and they can be deployed in the service as an assembly after registering them with U-SQL script. Therefore UDOs like SQLCLR can invoke managed code, other runtimes like Python, R and all with the option to scale out. UDOs cannot interact with one another and are isolated in the scope that they are registered with. The U-SQL script allows these UDOs residing in assemblies to be invoked with the different data processing options such as extract, reduce etc.
One simple example to use UDOs for text summarization that we talked about earlier with trimpy python extension can be shown to be similar to the following simpler but only for illustration query as follows:
@text = EXTRACT text string
FROM @"filename"
USING new Trimpy.Extractor();
@summary = SELECT Trimpy.Summarize(text)
FROM @text
OUTPUT @summary
TO "/summary.txt"
USING Outputters.text();
This is simple but tasks like text classification or prediction or data mining can also be called via U-SQL.

Courtesy U-SQL slide shares

My take on query improvements : https://1drv.ms/w/s!Ashlm-Nw-wnWsFqBcG-mBhjPLbC8

#codingexercise

Count all palindromic subsequences of a string
we can use a recursive solution to count this as we shrink the string.
if the boundary characters match, we can count the following two subsequences
first from start to end - 1
second from start +1 to end - 1
plus 1 for the match with the current boundary
otherwise we count the same two subsequences again and reduce the count from subsequence starting at start + 1 and ending at end -1 because it would have been included twice in each subsequence.

This same logic holds true for substrings if the subsequences can be confirmed to exist in the string.

Friday, September 15, 2017

We maintain a hashtable for the N letters of the search string with a linked list of positions of occurrences for each letter. As we scan the containing string, we insert the index into the linked list against that letter in the hash table.

Next for each position in the min value of the linked lists as a candidate and an initialized max value, and for every other letter in the hash table, we find the positions of the letters in the N that is closest and after the candidate's position and update the max we have found so far. With this max value minus the candidate's position as offset, we can form a tuple of start, offset for every candidate. The tuple that gives the smallest offset is the answer to return.

Thursday, September 14, 2017

Today we continue reviewing U-SQL.It unifies the benefits of SQL with the expressive power of your own code. This is said to work very well with all kind of data stores – file, object and relational. U-SQL works on the Azure ecosystem which involves the Azure data lake storage as the foundation and the analytics layer over it. The benefit of the Azure storage is that it spans several kinds of data formats and stores.
U-SQL is a data processing language.Like SQL, it has declarative language and procedural control flow. The difference between imperative and declarative language is that Imperative means we tell the system how to get the data. Declarative means we tell the system what we want and the system finds a way to get it. The optimizer decides how to do it best. The difference between procedural and functional is that procedural means it operates by changing persistent state with each expression Functional means it transforms input into output with no persistent state. Single objects require control flow. Single objects requires control flows and may require parallelism to be explained. With the data model, we can use higher level data abstractions. There is scale up versus scale out. Programming languages required data in a store and does not make it part of the language model. Data and logic go together. It is imperative and procedural. Data processing languages on the other hand make data part of the language model. It can be both declarative and functional. The data can evolve independently of the application and the optimizer decides the parallelism and
U-SQL like T-SQL provides important benefits with query language. First and foremost, there is consistency and familiarity with its usage. The learning curve and the onboarding from T-SQL to U-SQL is not very steep. Moreover, there is a lot of thought behind the syntax. It is context independent and defined in data processing language. It is also composable. It is important for the query language to be precise and accurate while at the same time be convenient for the user. There is writeability versus readability separation of concerns. All of these are important considerations in U-SQL. In addition to the syntax, the semantics also matter. U-SQL tries to avoid surprises and complexities. Moreoever as a language it is composable for the user and optimizable for the system.
Courtesy U-SQL slide shares

My take on query improvements : https://1drv.ms/w/s!Ashlm-Nw-wnWsFqBcG-mBhjPLbC8

#codingexercise
Given n friends, each one can remain single or can be paired up with some other friend. Each friend can be paired only once so ordering is irrelevant. Find the total number of ways in which the friends can be paired.
Let us define a recursive method to do this which takes a parameter n.
When n is less than two, the number of friends returned is n
If the number of friends is greater than two, then they can be formed excluding one which is the equivalent of calling the recursive method n-1 times. In addition, they can also be formed by pairing all the others excluding this one which is n-1 with the number of friends formed by removing two members
F(n) = F(n-1) + (n-1) F(n-2)

Wednesday, September 13, 2017

Today we continue reviewing U-SQL.It unifies the benefits of SQL with the expressive power of your own code. This is said to work very well with all kind of data stores – file, object and relational. U-SQL works on the Azure ecosystem which involves the Azure data lake storage as the foundation and the analytics layer over it. The benefit of the Azure storage is that it spans several kinds of data formats and stores.
U-SQL is a data processing language.Like SQL, it has declarative language and procedural control flow. The difference between imperative and declarative language is that Imperative means we tell the system how to get the data. Declarative means we tell the system what we want and the system finds a way to get it. The optimizer decides how to do it best. The difference between procedural and functional is that procedural means it operates by changing persistent state with each expression Functional means it transforms input into output with no persistent state. Single objects require control flow. Single objects requires control flows and may require parallelism to be explained. With the data model, we can use higher level data abstractions. There is scale up versus scale out. Programming languages required data in a store and does not make it part of the language model. Data and logic go together. It is imperative and procedural. Data processing languages on the other hand make data part of the language model. It can be both declarative and functional. The data can evolve independently of the application and the optimizer decides the parallelism and synchronization.
Courtesy U-SQL slide shares

My take on query improvements : https://1drv.ms/w/s!Ashlm-Nw-wnWsFqBcG-mBhjPLbC8

#codingexercise

If we have a matrix where each cell has a cost to travel and we can only move right or down, what is the minimum cost to travel from top left of matrix to bottom right corner ?

We keep track of the row and column index. The cell values are unit displacement costs.
Therefore starting from the ending position, we can work our way backwards to start using the path that gives minimum cost
new cost = cell value + minimum of recursive cost at position to the left or recursive cost at position to the right
if we go out of the board, the recursive cost is infinite
if we reach the start position, the recursive cost is the value of the cell at the start position.

The same holds for path taken in any direction.