Saturday, September 30, 2017

Today we continue reviewing the whitepaper - the chutes and ladders of customer identity from Gigya which is a leader in Customer Identity and Access Management Industry. The title uses an analogy of a well-known game. This is a whitepaper that introduces the delights and the pitfalls in customer experience with identity.
Some of their salient mentions include:
1) The need for personalization experience from the beginning and the need for a lighter touch.
2) More visitors register themselves when they see a value and an easy way in.
3) Also customer data can be collected over time and the profile built up on a continuous basis
4) This data can also be shared or synchronized with downstream marketing, sales and services based on compliance with privacy and industry standards.
5) CIAM Solutions treat first party data of the customer as cash. Its value is immediate and most reliable.
6) There is transparency required by virtue of the data and integration between systems should be facilitated but these systems realize the importance of first party data.

One mention this CIAM solution seems to get that customer loyalty should be cherished and fostered through online engagement where a feedback loop can be created to make recommendations that suit the customer's behavior and preferences that can go on and on. This is also a virtuous cycle of long term growth in relationship. Score card and profiles maintained with the customer data may help with analytics. What used to be static tables in a database now becomes an accruing analytics platform. We may refer to AirBnB's discussion to see how the storage and analytics evolve. the key takeaway was the possibility of a data pipeline and analytics platform as opposed to databases and service oriented architecture.

Tomorrow we will start reading another white paper from Gigya.

#codingexercise
count palindrome substrings in a string.
We mentioned the brute force way is to enumerate all substring and check for palindrome. This is also similar to using each letter as the center of a palindrome candidate and expanding it in both directions for even and odd length palindrome. We also mentioned an alternative that exploits the lengths of the palindrome substrings to follow Pascal's triangle We explain dynamic programming technique Since this problem of counting palindrome has overlapping Subproblems in the range i to j of the string, we can also use memoization and dynamic programming. 
For one or two letters in a range (i,j):
      we can initialize the palindrome count
else if str(i,j) is a palindrome
      we increment the count by 1 and count for (i,j-1) + (i+1,j) - (i+1,j-1) 
else 
      we proceed to count for  (i,j-1) + (i+1,j) - (i+1,j-1)
This count can directly be read and updated in a dp matrix of string length sized x and y.
Here (i,j-1) + (i+1,j) - (i+1,j-1)  implies that the first part of the count comes from taking the substring in the range i to j-1 by leaving the last letter, the second part of the count comes from taking the substring in the range i+1 to j and the last part of the count is the removal of a duplicate count contributed by the twice inclusion from the substring in the range (i+1,j-1) which is leaving both the first and the last letters.
The above is repeated for every gap occuring between letters from 2 to n and the i,j are selected such that i takes values from 0 to just before the gap and j as the letter after the gap.This way the dp table fills from columnwise towards the right so that the values computed can be re-used for the next iteration.

Friday, September 29, 2017

Today we continue reviewing the whitepaper - the chutes and ladders of customer identity from Gigya which is a leader in Customer Identity and Access Management Industry. The title uses an analogy of a well-known game. This is a whitepaper that introduces the delights and the pitfalls in customer experience with identity.

The chutes for customer identity are demonstrated by
1) annoying ads for past purchases
2) spamming of email inbox with unwanted newsletters
3) more setup and initiation activities prior to transaction

The ladders for customer identity are demonstrated by
1) personalization of ads done right
2) less frequent and more pertinent notifications
3) more integrated and seamless experience with less chores

Some of their salient mentions include:
1) The need for personalization experience from the beginning and the need for a lighter touch.
2) More visitors register themselves when they see a value and an easy way in.
3) Also customer data can be collected over time and the profile built up on a continuous basis
4) This data can also be shared or synchronized with downstream marketing, sales and services based on compliance with privacy and industry standards.
5) CIAM Solutions treat first party data of the customer as cash. Its value is immediate and most reliable.
6) There is transparency required by virtue of the data and integration between systems should be facilitated but these systems realize the importance of first party data.
#codingexercise
count palindrome substrings in a string.
We mentioned the brute force way is to enumerate all substring and check for palindrome. This is also similar to using each letter as the center of a palindrome candidate and expanding it in both directions for even and odd length palindrome. We also mentioned an alternative that exploits the lengths of the palindrome substrings to follow Pascal's triangle We explain dynamic programming technique Since this problem of counting palindrome has overlapping Subproblems in the range i to j of the string, we can also use memoization and dynamic programming. 
For one or two letters in a range (i,j):
      we can initialize the palindrome count
else if str(i,j) is a palindrome
      we increment the count by 1 and count for (i,j-1) + (i+1,j) - (i+1,j-1)
else 
      we proceed to count for  (i,j-1) + (i+1,j) - (i+1,j-1)
This count can directly be read and updated in a dp matrix of string length sized x and y.

Thursday, September 28, 2017

Today we continue reviewing the whitepaper - the chutes and ladders of customer identity from Gigya which is a leader in Customer Identity and Access Management Industry. The title uses an analogy of a well-known game. This is a whitepaper that introduces the delights and the pitfalls in customer experience with identity. We read about the personalization experience from the beginning and the need for a lighter touch. More visitors register themselves when they see a value and an easy way in. Most of them already know the benefits of setting up an account such as keeping track of history, making purchases with a single click and so on. The effort involved in setting it up varies if its online or via the download of an app. The process of login can be replaced with techniques that are not only hard to hack but also offer superior behaviors risk-based authentication which can be used to combat fraud. Multiple factors of authentication can be added included one time passcodes. Also customer data can be collected over time and the profile built up on a continuous basis.  This data can also be shared or synchronized with downstream marketing, sales and services based on compliance with privacy and industry standards.
There is an interesting take on CIAM Solutions introduced earlier. They take first party data of the customer as cash. Its value is immediate and most reliable. This is why CIAM Solutions manage their own data while also allowing information to flow downstream to other parties. There is transparency required by virtue of the data and integration between systems should be facilitated but these systems realize the importance of first party data.
#codingexercise
count palindrome substrings in a string.
We mentioned the brute force way is to enumerate all substring and check for palindrome. This is also similar to using each letter as the center of a palindrome candidate and expanding it in both directions for even and odd length palindrome.
An alternative would be to find disjoint longest palindromes in the string by expanding the string one character at a time. Care must be taken to count the inner palindromes of a palindrome. Finally we know that the lengths of the palindrome up to those found to a given center are also palindromic. Therefore we look for the next center based on the lengths we have found already. The suggestion is that the lengths array we construct will have values similar to the palindromic nature of Pascal's triangle.
Therefore the algorithm is:
Initialize the lengths array to the number of possible centers
Set the current center to the first center
Loop while the current center is valid:
       Expand to the left and right simultaneously until we find the largest palindrome around this center
       Fill in the appropriate entry in the longest palindrome lengths array
       Iterate through the longest palindrome array backwards and fill in the corresponding values to the right of the entry for the current center until an unknown value (because we don't have sufficient elements in the string to make an interpretation on the length) is encountered.
       Set the new center to the index of this unknown value
Return the lengths array
 since this problem of counting palindrome has overlapping Subproblems in the range i to j of the string, we can also use memoization and dynamic programming. By using the lengths array and Pascal triangle, we can also get counts directly 

Wednesday, September 27, 2017

Today we continue reviewing the whitepaper - the chutes and ladders of customer identity from Gigya which is a leader in Customer Identity and Access Management Industry. The title uses an analogy of a well-known game. This is a whitepaper that introduces the delights and the pitfalls in customer experience with identity.
The chutes for customer identity are demonstrated by
1) annoying ads for past purchases
2) spamming of email inbox with unwanted newsletters
3) more setup and initiation activities prior to transaction
The ladders for customer identity are demonstrated by
1) personalization of ads done right
2) less frequent and more pertinent notifications
3) more integrated and seamless experience with less chores

Yesterday we read about the personalization experience from the beginning and the need for a lighter touch. Today we continue on the theme to make it easier for the customer. More visitors register themselves when they see a value and an easy way in. Most of them already know the benefits of setting up an account such as keeping track of history, making purchases with a single click and so on. The effort involved in setting it up varies if its online or via the download of an app. The process of login can be replaced with techniques that are not only hard to hack but also offer superior behaviors risk-based authentication which can be used to combat fraud. Multiple factors of authentication can be added included one time passcodes. Also customer data can be collected over time and the profile built up on a continuous basis.  This data can also be shared or synchronized with downstream marketing, sales and services based on compliance with privacy and industry standards.
#codingexercise
Given a set of distinct strings, find the shortest superstring  that contains each string in a given set as substring. The distinct strings cannot be substrings of each other:
The method to solve this is as follows:
Make a copy of the array to begin with
Find a pair that has maximum overlap as a and b
Replace a and b with the concatenation ab
The size of the array reduces by one and we repeat 1 to 4 until we are left with only one super string.
Since we find the maximum overlap at each point, we make a greedy choice for the solution.

one more coding exercise
count palindrome substrings in a string.
The brute force way is to enumerate all substring and check for palindrome.
An alternative would be to find disjoint longest palindromes in the string. Care must be taken to count the inner palindromes of a palindrome. Finally we know that the lengths of the palindrome up to those found to a given center are also palindromic. Therefore we look for the next center based on the lengths we have found already. Palindrome  center occurs at Pascal triangle positions in a string so we check these as opposed to every letter to optimize.

Tuesday, September 26, 2017

Today we continue reviewing the whitepaper - the chutes and ladders of customer identity from Gigya which is a leader in Customer Identity and Access Management Industry. The title uses an analogy of a well-known game. This is a whitepaper that introduces the delights and the pitfalls in customer experience with identity.
The chutes for customer identity are demonstrated by
1) annoying ads for past purchases
2) spamming of email inbox with unwanted newsletters
3) more setup and initiation activities prior to transaction
The ladders for customer identity are demonstrated by
1) personalization of ads done right
2) less frequent and more pertinent notifications
3) more integrated and seamless experience with less chores
Most Customer Identity and Access Management solutions need to track customers. They can only do this with user consent. While some show all the terms and conditions up front for a one-time user but overwhelming consent request, others choose to ask as and when needed providing a lighter touch
Similarly the sign up process can seem to require all day to get all the details fed in to the system while others merely refer to existing registered partner sites such as signing into email or chat applications. Removing this password requirement is touted as one of the best improvements to customer experience and consequently a lot of attention has been paid by the industry in forming the OpenID standard. What the standard leaves unsaid is how the marketers can use it to their advantage while it focuses on the integration between businesses and stores for the customer.  A marketer would like to know :
whether the customer arrived via search, campaign or referral
what device they connected with
what was the tipping point that converted them to a customer
what transactions the customer attempted or executed
what are the recommendations that will interest the customer the most
how to make more money from an engaging and mutually beneficial experience for the customer.
what partners and associates are willing to work with the marketers for this customer

#codingexercise
Get the max sum level of an integer binary tree.
We perform level wise traversal of the binary tree and serialize the integers encountered.
Then for every level, we add up the sum and return the level with the highest sum.
The level wise traversal is done by enqueueing the left and the right siblngs in a queue. The levels are demarcated by level indicators which in our case could be a null node.

Monday, September 25, 2017

Today we start reviewing the whitepaper - the chutes and ladders of customer identity from Gigya which is a leader in Customer Identity and Access Management Industry. The title uses an analogy of a well-known game. This is a whitepaper that introduces the delights and the pitfalls in customer experience with identity. The narration is about an online store with the recognition that many of the points mentioned in the white paper apply equally well for other businesses. Most customers start out on an online store as an unknown visitor.  Businesses rely on sending the right message to the right person at the right time and through the right channels. This multi-channel marketing is a ladder in customer experience. Since they start out from a variety of devices and applications, it is difficult to know what they want. Identity and preferences help mitigate this barrier. Some examples of their interactions that are either annoying and frustrating or enjoyable and satisfying are mentioned here.  A customer may buy a pair of jeans a few weeks earlier but may continue to see unwanted ads for the same jeans all over the internet. On the other other hand, the same customer may see recommendations for other items that pair well with the jeans that is much like a personal fashion consultant.  Another example is that of subscription to the newsletters from the online store that become difficult to manage. On the other hand, a crisp and clear infrequent newsletter may give relevant information and improve the experience. Similarly, when the user switches devices, he may lose the settings he had made earlier.  On the other hand, his login may be integrated with social networking apps and the settings become part of public profile to easily follow the user between apps and devices. Therefore there is a trade-off in the customer touchpoints that cam attract or spurn the customer. If the business were to capture information about the user as early as possible and in small pieces, a relationship can be established based on the users preferences. This approach is known as "progressive identity".  It offers promotions, coupons or newsletters to the customer without requiring full registration. It obtains consent from the customer throughout their life-cycle. It enables convenient and centralized profile and preference management. This progressive identity is the notion that a customer is not a boolean known or unknown for a store but an accumulation of transparent and engaging experiences. Customer tracking is required to personalize the customer's experience but instead of displaying all the terms upfront, if it can be managed as the customer progresses on the website, it can improve the experience. A lightweight registration such as an email address in exchange for a valuable and personalized content  will be preferred by the customer. The customer will be enticed from the content to sign up for a full account. The easier we make this registration and the more augmented the authentication methods, the better the customer experience.
#codingexercise
Convert a BST to min heap
Traverse the BST in InOrder to get a sorted list
Heapify the sorted list, say one-based, using
the left of an element at index i-1 in the sorted list is at 2i th index if that index is within range
the right of an element at index i-1 in the sorted list is at 2i+1 th index if that index is within range

Sunday, September 24, 2017

We continue to review the slides from Stanford that introduce Natural Language Processing via Vector Semantics.We said that vector representation is useful and opens up new possibilities. We saw that a lookup such as a thesaurus does not help.  We were first reviewing co-occurrence matrices. These are of many forms such as term-document matrix, word-word matrix, word-context matrix etc The term-document matrix was  a count of word w in a document d. Each document therefore becomes a count vector. The similarity between the words in this case merely indicates their occurrence to be similar. If we changed the scope from documents to some other text boundary, we have word-word matrix.  The similarity in this case improves over that in the term-document matrix. A word-context matrix improves this further because the word in terms of context which is closer to its meaning and bring semantical similarity.
 Instead of co-occurrence, we now consider a different correlation between two words. It's called the positive pointwise mutual information.  Pointwise mutual information indicates whether two words occur more than they were independent.It is defined as the log of their probability taken together divided by their individual probabilities. This value can come out to be negative or positive. The negative values are not helpful because the values are of little significance. The probabilities are small and the result is in the inverse order of powers of ten which do not give any meaning via significance. Unrelatedness is also not helpful to our analysis. On the other hand, positive PMI is helpful to discern whether two words are likely together. we only have to take the PMI if it turns out to be positive. Computing the PPMI is easy to do with the probabilities which are based on cumulatives of word - occurrences
Today we  realize that PMI needs to be weighted, because it is biased towards in-frequent events. If the words are rare, they have a high PMI value. This skews the results when the text has rare words. In order to make it fair, the PMI is weighted. This can be achieved either by raising the context probabilities or with add-one smoothing. The probability of rare context is raised to alpha=0.75 in the first case and an appropriate smoothing may be added to the numerator in calculating the probability and applied to the uniform probability in the second case.
#codingexercise
Find all the heroes and the superheroes in an integer array. The heroes are the elements which are greater than all the elements to the right of them. A superhero is the element which is greater than all the elements to the left and the right.
This problem can be solved by keeping track of the current max seen so far. If the elements traversed and picked as next exceeds the current max, it satisfies the criteria for being the hero and gets used as the current max. Any element that does not exceed the current max is not a hero. The final value of the current max is the superhero.

Saturday, September 23, 2017

We continue to review the slides from Stanford that introduce Natural Language Processing via Vector Semantics.We said that vector representation is useful and opens up new possibilities. We saw that a lookup such as a thesaurus does not help.  We were first reviewing co-occurrence matrices. These are of many forms such as term-document matrix, word-word matrix, word-context matrix etc The term-document matrix was  a count of word w in a document d. Each document therefore becomes a count vector. The similarity between the words in this case merely indicates their occurrence to be similar. If we changed the scope from documents to some other text boundary, we have word-word matrix.  The similarity in this case improves over that in the term-document matrix. A word-context matrix improves this further because the word in terms of context which is closer to its meaning and bring semantical similarity.
Some of these matrices can be very sparse with zeros covering a majority of the cells. This is quite alright since there are lots of efficient algorithms for sparse matrices. Similarly the size of the window can also be adjusted. The shorter the window the more syntactic the representation. The longer the window, the more semantic the representation. Instead of co-occurrence, we now consider a different correlation between two words. It's called the positive pointwise mutual information. Raw word frequency suffered from being skewed by more frequent and less salient words. Pointwise mutual information indicates whether two words occur more than they were independent.It is defined as the log of their probability taken together divided by their individual probabilities. This value can come out to be negative or positive. The negative values are not helpful because the values are of little significance. The probabilities are small and the result is in the inverse order of powers of ten which do not give any meaning via significance. Unrelatedness is also not helpful to our analysis. On the other hand, positive PMI is helpful to discern whether two words are likely together. we only have to take the PMI if it turns out to be positive. Computing the PPMI is easy to do with the probabilities which are based on cumulatives of word - occurrences.
#codingexercise
Count the number of islands in a sparse matrix.
This problem is similar to the one in graph where we find connected components. In a 2d matrix, every cell has eight neighbors. A depth first search explores all these eight neighbors. When a cell is visited, it is marked so it is not included in the next traversal. The number of such successful depth first search results in the number of connected components.

Friday, September 22, 2017

We continue to review the slides from Stanford that introduce Natural Language Processing via Vector Semantics.We said that vector representation is useful and opens up new possibilities. We saw that a lookup such as a thesaurus does not help.
Stanford NLP has shown there are four kinds of vector models.
A Sparse vector representation where a word is represented in terms of the co-occurrences with the other words and using a set of weights for their co-occurrences. This weight is usually based on a metric called the mutual information.
A dense vector representation that involves latent semantic analysis, neural net or clusters from Brown corpus. The dense vector representations share a representation of word as a vector of numbers which translate a word into a corresponding vector in the vector space. This is called embedding.
Co-occurrence matrices were of many forms such as term-document matrix, word-word matrix, word-context matrix etc The term-document matrix was  a count of word w in a document d. Each document therefore becomes a count vector. The similarity between the words in this case merely indicates their occurrence to be similar. If we changed the scope from documents to some other text boundary, we have word-word matrix.  The similarity in this case improves over that in the term-document matrix. A word-context matrix improves this further because the word in terms of context which is closer to its meaning and bring semantical similarity.
Co-occurrence between two words have two forms - first order and second order. The first order co-occurrence is syntagmatic association and the second-order association is paradigmatic association which means the first one is based on positions  where as the second one is based on similar neighbors. Note that the vectorization derives from the usage of words which is why it becomes popular. Another way to look at usage is to canonicalize the text into an esperanto language where the relations and syntax are more oriented towards natural language processing. Some work has already begun with different kind of improvements to ontologies that are not restricted to thesaurus or wordnet but one such as FrameNet. All we need to keep in mind here is that there are layers to tackle the problem - Usage, vector space, classification of vectors. 

Thursday, September 21, 2017

We continue to review the slides from Stanford that introduce Natural Language Processing via Vector Semantics.We said that vector representation is useful and opens up new possibilities. We saw that a lookup such as a thesaurus does not help.
Stanford NLP has shown there are four kinds of vector models.
A Sparse vector representation where a word is represented in terms of the co-occurrences with the other words and using a set of weights for their co-occurrences. This weight is usually based on a metric called the mutual information.
A dense vector representation that takes one of the following vector models:
A representation based on weights associated with other words where the weights are computed as using conditional probabilities of the occurrences and referred to as latent semantic analysis
A neural network based models where the weights with other words are first determined by predicting a word based on the surrounding words and then predicting the surrounding words based on the current word
A set of clusters based on the Brown corpus.
#codingexercise
Find the minimum number of squares whose sum equals to a given number n
We write a few base cases say upto n = 3
For the n greater than that, we can initialize the number of squares to be the candidate we consider from 4 to n. Each number can be represented with the maximum number of squares as those comprising of unit squares only.
Next for each number from 1 to that candidate, we can recursively calculate the maximum number of squares for the n minus the square of the iterator and incrementing one towards the count. We update the minimum as we find for each iterator. All the results are memoized for easy lookup. This results in the smallest number of squares being found in the table entry for n.

Wednesday, September 20, 2017

We continue to review the slides from Stanford that introduce Natural Language Processing via Vector Semantics.We said that vector representation is useful and opens up new possibilities. For example, it helps compute the similarity between words. "fast" is similar to "rapid", "tall" is similar to "height" This can help in question answering say as in How tall is Mt.Everest ? The height of Mt.Everest is 29029 feet. Similarity of words also helps with plagiarism. If two narratives have a few words changed here and there, the similarity of the words should be high because they share the same context. When a number of word vectors are similar, the overall narrative is plagiarized.
Word vectors are also useful when the semantics of the word change over time. Words hold their meaning only in context of the surrounding words.If their usage changes over time, their meaning also changes. Consequently, word similarity may change based on their context. The problem with using a thesaurus in this case is that the thesaurus does not exist for every year to determine the similarity between the words which mean something today and meant something else yesterday. Moreover, thesaurus unlike a dictionary does not contain all words and phrases particularly verbs and adjectives.
Therefore instead of looking up an ontology, we now refer to a distributional model for the meaning of word which relies on the context surrounding the given words. A synonym is therefore a choice of words that share the same context and usage. In fact we interpret meanings of unknown words by look at the surrounding words and their context.
Stanford NLP has shown there are four kinds of vector models.
A Sparse vector representation where a word is represented in terms of the co-occurrences with the other words and using a set of weights for their co-occurrences. This weight is usually based on a metric called the mutual information.
A dense vector representation that takes one of the following vector models:
A representation based on weights associated with other words where the weights are computed as using conditional probabilities of the occurrences and referred to as latent semantic analysis
A neural network based models where the weights with other words are first determined by predicting a word based on the surrounding words and then predicting the surrounding words based on the current word
A set of clusters based on the Brown corpus.
#codingexercise
Find the maximum water retention in a bar chart
Water is retained over a bar of unit length and between the left and the right bars upto a depth equal to the difference between the minimum of the left and right and the height of the current bar. Therefore for each bar we can find the max on the left and on the right and calculate the water retained as above. We then cumulate this water retained for each bar along the range of bars. Since we need to find the max on the left and on the right for each bar, we can do this in two separate passes over all the bars. 

Tuesday, September 19, 2017

Today we start reviewing the slides from Stanford that introduce Natural Language Processing via Vector Semantics. This is useful to study when we wonder about the usefulness of graphs in text analysis. Its true that a vector of features is like a row with columns.Tabular data is easy to work with using conventional database and data mining techniques. Moreover, representing a problem in vector space means we can utilize all the matrix analysis techniques that we have known for a long while. In fact, new improvements may be possible as we re-discover the suitability of more and more techniques from this domain. For example, matrix Factorization, Eigen values and Eigen vectors, gaussian and laplacian are straight out of mathematical text books and continue to serve as techniques to try with NLP. Matrix mathematics is also useful to succinctly describe a problem with their use of notation for an entire matrix. Similarly, Vector space also helps visualize the problems in cartesian co-ordinates with relative or absolute point of reference. This is another form of analysis that we have long understood and continue to find it useful to explain ideas. Vector space gives us a way to describe the problem in a space where we can describe and visualize magnitude and direction. Together with matrix, vectors allows us to see transformations.
Graphs came about as an evolution from disconnected and linearly related data. When we establish linear dependencies, we immediately see dependencies that exist beyond just two instances in a linear model and contributions from neighbors other than those two. Graph also gave us methods such as traversal, centrality and pagerank that introduced us to ways we can visualize and solve problems. People who work with graph databases become so involved with seeing relationships described as edges between neighbors that they start questioning what purpose relational databases have and why are there even joins between tables. They see graph as liberating to store and analyze relationships. Graphs now can even be worked with in batch no-sql mode which lets us scale our operations like never before. In fact, graphs take a long time to create but once they are created, they can serve useful analysis in a fraction of the time that was spent on creating it. Graphs are also supported out of box from graph databases and work very well with their techniques. Many packages for analysis also make it popular to work with graphs.
Vectors can be transformed into graphs based on similarity between vectors. Depending on the strength of similarity, we may draw edges between the nodes that represent the vectors. This gives us a way to determine how important a  vector among its neighbors - a very useful concept if we have to select a few vectors from the many. Still the selection can also work with representing vectors as tabular rows and a classifier that can issue tag. So while these different techniques may perform similar tasks, they are not mutually exclusive to each other and can even work with one another. We will start reading on how vector representation improves the notion of what the nodes or entities are.
#coding exercise
Find the largest rectangular area in a histogram.
Each bar of the histogram may serve as the start of a rectangle. So the top left corner may act as the top left of a rectangle whose width may increment over other bars as long as they allow it. Therefore for each bar in the histogram, we try to draw a rectangle starting at the top left and take the maximum by enclosing area of all the rectangles formed.

Monday, September 18, 2017

Today we describe a method to do text summarization:
def summarize(self, text):
from gensim.models import word2vec

# load the test sample
sentences = text.split('\n')
model = word2vec.Word2Vec(sentences, size=200)
model.save_word2vec_format('/tmp/vectors.bin', binary=True)
model = word2vec.Word2Vec.load_word2vec_format('/tmp/vectors.bin', binary=True)

#get selection from text based on centroids selected from k-means classification avoiding graph
from nltk.cluster.kmeans import KMeansClusterer
NUM_CLUSTERS = 3  # user can specify this or we take a default value
kclusterer = KMeansClusterer(NUM_CLUSTERS, distance=model.similarity, repeats=25)
assigned_clusters = kclusterer.cluster(model, assign_clusters=True)
pr = translate_to_words(assigned_clusters)

# render the summary
import operator
sorted_pr = sorted(pr.items(), key=operator.itemgetter(1), reverse=True)
important = [int(i[0]) for i in sorted_x][:10]
scored_sentences = {}
for sentence in sentences:
matches = set(sentence.split()).intersection(important)
score = 0
for match in matches:
score+= pr[match]
scored_sentences[sentence]=score
reordered_sentences = [ i[0] for i in sorted(scored_sentences.items(), key=operator.itemgetter(1), reverse=True)[:10] ]
ordered_sentences = [ x for x in sentences if x in reordered_sentences ]
summary = '\n'.join(ordered_sentences)
print(summary)
return ordered_sentences


#codingexercise
Given a sequence of motions in terms of unit displacements as - G - for moving forward, L for moving to the left and R for moving to the right, determine if a robot executing the sequence can stay within a circle
Solution: We keep track of the current position and the orientation in terms of N, E, W, and S. For each movement in the sequence, we update the current co-ordinate by displacing x or y but not both and by updating the orientation. If the ending position converges towards the origin, we know the robot stays within a circle.

There is one observation for this coding exercise. There is no way to be sure that a given sequence converges in a predetermined number of steps.

Sunday, September 17, 2017

Today we continue reviewing U-SQL.It unifies the benefits of SQL with the expressive power of your own code. This is said to work very well with all kind of data stores – file, object and relational. U-SQL works on the Azure ecosystem which involves the Azure data lake storage as the foundation and the analytics layer over it. The benefit of the Azure storage is that it spans several kinds of data formats and stores.
The presentation for U-SQL explains three scenarios  which include - a Cognitive example , a text analysis example and a JSON processing.
The cognitive example identifies objects in images. This kind of example show how the entire image processing on image files can be considered custom logic and used with the query language. As long as we define the objects, the input and the logic to analyze the objects, it can be made part of the query to extract the desired output dataset.
The text analysis example is also similar where we can extract the text prior to performing the analysis. Its interesting to note that the classifier used to tag the text can be written in R language and is not dependent on the query.  The outputters also result in different output.
JSON processing  is another example cited by the presentation probably because it has become important to extract transform load in analytical processing whether it is a cloud data warehouse or big data operations. This "schema later" approach is popular because it decouples producers and consumers which saves co-ordination and time-consuming operations between say departments. While some applications with query languages such as SnowSQL import the Json into a columnar table and then execute a query or declare their own syntax to flatten the Json, the approach taken by U-SQL is more general purpose with its Extract, select and outputters that are either built-in or the customizations that the user can make.
Courtesy U-SQL slide shares
My updates on query improvements : https://1drv.ms/w/s!Ashlm-Nw-wnWsFqBcG-mBhjPLbC8
#codingexercise
Recursive function for a palindrome:
 If string Is empty or one character return true
 If string.first() == string.last() return recursively by stripping first and last
 If string.first() != string.last() return false

Also an update on classifiers : https://1drv.ms/w/s!Ashlm-Nw-wnWsFzHx5Hrcl633js_ 

Saturday, September 16, 2017

Today we continue reviewing U-SQL.It unifies the benefits of SQL with the expressive power of your own code. This is said to work very well with all kind of data stores – file, object and relational. U-SQL works on the Azure ecosystem which involves the Azure data lake storage as the foundation and the analytics layer over it. The benefit of the Azure storage is that it spans several kinds of data formats and stores.
One of the improvements in this language design is the consideration for single-node versus parallel versus distributed computing. Queries often have to manage parallelism, synchronizations and transactions. But the language not only has to allow implicit considerations by the system but also enable explicit constructs for the users. Moreover, execution is no longer just scale-up but also scale-out and therefore libraries as well as language needs to handle parallelism.
The data processing language is independent of the scale of data but the data is a part of the language model. Programming languages treat data as something in  a store and tie the data and the logic together. This data processing language allows data to chnage and evolve independent of the application.
U-SQL provides all this for the user with custom operator extensions called UDO's which are scaled out. It includes User-defined extractors, outputters, processors, appliers, combiners and reducers. The scale-out can also be explicitly requested with hint keywords.
UDO's can be written in any .Net language and they can be deployed in the service as an assembly after registering them with U-SQL script. Therefore UDOs like SQLCLR can invoke managed code, other runtimes like Python, R and all with the option to scale out. UDOs cannot interact with one another and are isolated in the scope that they are registered with. The U-SQL script allows these UDOs residing in assemblies to be invoked with the different data processing options such as extract, reduce etc.
One simple example to use UDOs for text summarization that we talked about earlier with trimpy python extension can be shown to be similar to the following simpler but only for illustration query as follows:
@text = EXTRACT text string
              FROM @"filename"
              USING new Trimpy.Extractor();
@summary = SELECT Trimpy.Summarize(text)
                       FROM @text
OUTPUT @summary
           TO "/summary.txt"
           USING Outputters.text();
This is simple but tasks like text classification or prediction or data mining can also be called via U-SQL.
                               


Courtesy U-SQL slide shares

My take on query improvements : https://1drv.ms/w/s!Ashlm-Nw-wnWsFqBcG-mBhjPLbC8

#codingexercise
Count all palindromic subsequences of a string
we can use a recursive solution to count this as we shrink the string.
if the boundary characters match, we can count the following two subsequences
first from start to end - 1
second from start +1 to end - 1
plus 1 for the match with the current boundary
otherwise we count the same two subsequences again and reduce the count from subsequence starting at start + 1 and ending at end -1 because it would have been included twice in each subsequence.

This same logic holds true for substrings if the subsequences can be confirmed to exist in the string.

Friday, September 15, 2017

Today we continue reviewing U-SQL.It unifies the benefits of SQL with the expressive power of your own code. This is said to work very well with all kind of data stores – file, object and relational. U-SQL works on the Azure ecosystem which involves the Azure data lake storage as the foundation and the analytics layer over it. The benefit of the Azure storage is that it spans several kinds of data formats and stores.

U-SQL like T-SQL provides important benefits with query language. First and foremost, there is consistency and familiarity with its usage. The learning curve and the onboarding from T-SQL to U-SQL is not very steep.  Moreover, there is a lot of thought behind the syntax.  It is context independent and defined in data processing language. It is also composable. It is important for the query language to be precise and accurate while at the same time be convenient for the user. There is writeability versus readability separation of concerns. All of these are important considerations in U-SQL. In addition to the syntax, the semantics also matter. U-SQL tries to avoid surprises and complexities. Moreoever as a language it is composable for the user and optimizable for the system.
Courtesy U-SQL slide shares

My take on query improvements : https://1drv.ms/w/s!Ashlm-Nw-wnWsFqBcG-mBhjPLbC8

#codingexercise
We maintain a hashtable for the N letters of the search string with a linked list of positions of occurrences for each letter. As we scan the containing string, we insert the index into the linked list against that letter in the hash table. 
Next for each position in the min value of the linked lists as a candidate and an initialized max value, and for every other letter in the hash table, we find the positions of the letters in the N that is closest and after the candidate's position and update the max we have found so far. With this max value minus the candidate's position as offset, we can form a tuple of start, offset for every candidate. The tuple that gives the smallest offset is the answer to return. 

Thursday, September 14, 2017

Today we continue reviewing U-SQL.It unifies the benefits of SQL with the expressive power of your own code. This is said to work very well with all kind of data stores – file, object and relational. U-SQL works on the Azure ecosystem which involves the Azure data lake storage as the foundation and the analytics layer over it. The benefit of the Azure storage is that it spans several kinds of data formats and stores.
U-SQL is a data processing language.Like SQL, it has declarative language and procedural control flow. The difference between imperative and declarative language is that Imperative means we tell the system how to get the data. Declarative means we tell the system what we want and the system finds a way to get it. The optimizer decides how to do it best. The difference between procedural and functional is that procedural means it operates by changing persistent state with each expression Functional means it transforms input into output with no persistent state. Single objects require control flow. Single objects requires control flows  and may require parallelism to be explained.  With the data model, we can use higher level data abstractions. There is scale up versus scale out. Programming languages required data in a store and does not make it part of the language model. Data and logic go together. It is imperative and procedural.  Data processing languages on the other hand make data part of the language model. It can be both declarative and functional. The data can evolve independently of the application and the optimizer decides the parallelism and
U-SQL like T-SQL provides important benefits with query language. First and foremost, there is consistency and familiarity with its usage. The learning curve and the onboarding from T-SQL to U-SQL is not very steep.  Moreover, there is a lot of thought behind the syntax.  It is context independent and defined in data processing language. It is also composable. It is important for the query language to be precise and accurate while at the same time be convenient for the user. There is writeability versus readability separation of concerns. All of these are important considerations in U-SQL. In addition to the syntax, the semantics also matter. U-SQL tries to avoid surprises and complexities. Moreoever as a language it is composable for the user and optimizable for the system.
Courtesy U-SQL slide shares

My take on query improvements : https://1drv.ms/w/s!Ashlm-Nw-wnWsFqBcG-mBhjPLbC8

#codingexercise
Given n friends, each one can remain single or can be paired up with some other friend. Each friend can be paired only once so ordering is irrelevant. Find the total number of ways in which the friends can be paired.
Let us define a recursive method to do this which takes a parameter n.
When n is less than two, the number of friends returned is n
If the number of friends is greater than two, then they can be formed excluding one which is the equivalent of calling the recursive method n-1 times. In addition, they can also be formed by pairing all the others excluding this one which is n-1 with the number of friends formed by removing two members
F(n) = F(n-1) + (n-1) F(n-2)

Wednesday, September 13, 2017

Today we continue reviewing U-SQL.It unifies the benefits of SQL with the expressive power of your own code. This is said to work very well with all kind of data stores – file, object and relational. U-SQL works on the Azure ecosystem which involves the Azure data lake storage as the foundation and the analytics layer over it. The benefit of the Azure storage is that it spans several kinds of data formats and stores.
U-SQL is a data processing language.Like SQL, it has declarative language and procedural control flow. The difference between imperative and declarative language is that Imperative means we tell the system how to get the data. Declarative means we tell the system what we want and the system finds a way to get it. The optimizer decides how to do it best. The difference between procedural and functional is that procedural means it operates by changing persistent state with each expression Functional means it transforms input into output with no persistent state. Single objects require control flow. Single objects requires control flows  and may require parallelism to be explained.  With the data model, we can use higher level data abstractions. There is scale up versus scale out. Programming languages required data in a store and does not make it part of the language model. Data and logic go together. It is imperative and procedural.  Data processing languages on the other hand make data part of the language model. It can be both declarative and functional. The data can evolve independently of the application and the optimizer decides the parallelism and synchronization.
Courtesy U-SQL slide shares

My take on query improvements : https://1drv.ms/w/s!Ashlm-Nw-wnWsFqBcG-mBhjPLbC8

#codingexercise

If we have a matrix where each cell has a cost to travel and we can only move right or down, what is the minimum cost to travel from top left of matrix to bottom right corner ?

We keep track of the row and column index. The cell values are unit displacement costs.
Therefore starting from the ending position, we can work our way backwards to start using the path that gives minimum cost
new cost = cell value + minimum of recursive cost at position to the left or recursive cost at position to the right
if we go out of the board, the recursive cost is infinite
if we reach the start position, the recursive cost is the value of the cell at the start position.

The same holds for path taken in any direction.

Tuesday, September 12, 2017

Today we continue reviewing U-SQL.It unifies the benefits of SQL with the expressive power of your own code. This is said to work very well with all kind of data stores – file, object and relational. U-SQL works on the Azure ecosystem which involves the Azure data lake storage as the foundation and the analytics layer over it. The benefit of the Azure storage is that it spans several kinds of data formats and stores.
U-SQL supports big data analytics which generally have the characteristics that they require processing of any kind of data, allow use of custom algorithms, and scale to any size and be efficient.
This lets queries to be written for a variety of big data analytics. In addition, it supports SQL for Big Data which allows querying over structured data Also it enables scaling and parallelization.
U-sql has declarative SQL and can execute local and remote queries.
If we look at the pattern of separating query from data source, we quickly see it's no longer just a consolidate of data sources. It is also pushing down the query to the data sources and thus can act as a translator. Projections, filters and joins can now take place where the data resides. This was a design decision that came from the need to support heterogeneous data sources. Moreover, it gives a consistent unified view of the data to the user.
The improvements in query language include user-defined extractors, user-defined output, user-defined processors, user-defined appliers, user-defined combined and user-defined reducers. It is scaled out with explicit U-Sal syntax such as those that include  extract, output, cross apply, process, combine and reduce. These show the additions over T-SQL.
UDO'sDUO's can be written in any .Net language and registered as assemblies.
My take on query improvements : https://1drv.ms/w/s!Ashlm-Nw-wnWsFqBcG-mBhjPLbC8
#codingexercise
Given a room and a robot that can move in one of four directions, find the size of the room.
find start by walking left towards the wall
from the start walk the boundary of the room
noting the four corners
use it to calculate width and length
start at top left
we have a position on the wall, now we run along the boundary

Monday, September 11, 2017

Today we continue reviewing U-SQL.It unifies the benefits of SQL with the expressive power of your own code. This is said to work very well with all kind of data stores – file, object and relational. U-SQL works on the Azure ecosystem which involves the Azure data lake storage as the foundation and the analytics layer over it. The Azure analytics layer consists of both HD Insight and Azure data Lake analytics (HDLA) which target data differently. The HDInsight works on managed Hadoop clusters and allows developers to write map-reduce with open source. The ADLA is native to Azure and enables C#, SQL over job services. We will also recall that Hadoop was inherently batch processing while Microsoft stack allowed streaming as well. The benefit of the Azure storage is that it spans several kinds of data formats and stores. The ADLA has several other advantages over the managed Hadoop clusters in addition to working with a store for the universe. It enables limitless scale and enterprise grade with easy data preparation. The ADLA is built on Apache yarn, scales dynamically and supports a pay by query model. It supports Azure AD for access control and the U-SQL allows programmability like C#.
U-SQL supports big data analytics which generally have the characteristics that they require processing of any kind of data, allow use of custom algorithms, and scale to any size and be efficient.
This lets queries to be written for a variety of big data analytics. In addition, it supports SQL for Big Data which allows querying over structured data Also it enables scaling and parallelization. While Hive supported HiveSQL and Microsoft Scoop connector enabled SQL over big data and Apache Calcite became a SQL Adapter, U-SQL seems to improve the query language itself. It can unify querying over structured and unstructured data. It has declarative SQL and can execute local and remote queries. It increases productivity and agility  It brings in features from T-SQL, Hive SQL, and SCOPE which has been Microsoft's internal Big Data language.U-SQL is extensible and it can be extended with C# and .Net
If we look at the pattern of separating query from data source, we quickly see it's no longer just a consolidate of data sources. It is also pushing down the query to the data sources and thus can act as a translator. Projections, filters and joins can now take place where the data resides. This was a design decision that came from the need to support heterogeneous data sources. Moreover, it gives a consistent unified view of the data to the user.

Courtesy : U-SQL slideshare
#codingexercise
given two strings find if one string is a subsequence of another
solution:
check the lengths of the two string.
If the subsequence candidate string is of length zero return true
if the input to be matched with is of length zero, return false
If the last character of both strings match, check recursively with decremented length in both
else check recursively with decremented length in input.

Sunday, September 10, 2017

Today we start reviewing U-SQL.It unifies the benefits of SQL with the expressive power of your own code. This is said to work very well with all kind of data stores – file, object and relational. U-SQL works on the Azure ecosystem which involves the Azure data lake storage as the foundation and the analytics layer over it. The Azure analytics layer consists of both HD Insight and Azure data Lake analytics (HDLA) which target data differently. The HDInsight works on managed Hadoop clusters and allows developers to write map-reduce with open source. The ADLA is native to Azure and enables C#, SQL over job services. We will also recall that Hadoop was inherently batch processing while Microsoft stack allowed streaming as well. The benefit of the Azure storage is that it spans several kinds of data formats and stores. The ADLA has several other advantages over the managed Hadoop clusters in addition to working with a store for the universe. It enables limitless scale and enterprise grade with easy data preparation. The ADLA is built on Apache yarn, scales dynamically and supports a pay by query model. It supports Azure AD for access control and the U-SQL allows programmability like C#.
U-SQL supports big data analytics which generally have the characteristics that they require processing of any kind of data, allow use of custom algorithms, and scale to any size and be efficient.
This lets queries to be written for a variety of big data analytics. In addition, it supports SQL for Big Data which allows querying over structured data Also it enables scaling and parallelization. While Hive supported HiveSQL and Microsoft Scoop connector enabled SQL over big data and Apache Calcite became a SQL Adapter, U-SQL seems to improve the query language itself. It can unify querying over structured and unstructured data. It has declarative SQL and can execute local and remote queries. It increases productivity and agility  It brings in features from T-SQL, Hive SQL, and SCOPE which has been Microsoft's internal Big Data language.U-SQL is extensible and it can be extended with C# and .Net
Courtesy : U-SQL slideshare
#codingexercise
Count binary strings with k times appearing adjacent set bits:
Given a string of bits with length n and to find the the number of times k adjacent set bits appear, we solve it recursively:
1) if n == 1 then there is no count
2) if k == 0 or k >n  return no count
3) if string ends with 0 add the recursive count for string ending at n-1 and for k adjacent bits
4) else
          if the substring upto n-1 endswith 0
             add the recursive count for string ending at n-1 and for k adjacent bits
          if the substring ends with 1
             add the recursive count for string ending at n-1 and k-1 adjacent bits plus one
5) return the count

Saturday, September 9, 2017

We were reviewing Bing Maps API.
Bing Maps helps you visualize spatial data. Users love to know where a store is or how to get there. And the data does not always need to come from Bing. It can be overlaid over the map. As long as the data has geographical coordinates, it can be used with the maps. Previously this data was sent over as xml, now there is support for more succint format as GeoJson 
Recently Bing Maps announced support for drag and drop of GeoJson format data over maps. When an application is written to use Bing Maps, it can load the maps and the geojson module. It will listen for drag and drop events. with file reader APIs, each geojson file dropped on the map is  read and overlaid.
The steps involved for the application include :
Using HTML5 to load the local files
Select the files from a form input
Each file contains a collection of features where each feature is a location.
The location list is then overlaid on the map.
Load the GeoJson module:
Microsoft.Maps.loadModule('Microsoft.Maps.GeoJson', function () {
            //Setup the drag & drop listeners on the map.
            var dropZone = document.getElementById('myMap');
            dropZone.addEventListener('dragover', handleDragOver, false);
            dropZone.addEventListener('drop', handleFileSelect, false);

        });
var shapes = Microsoft.Maps.GeoJson.read(geoJsonText);
myMap.entities.push(shapes);
#codingexercise
Prune a binary tree with root to leaf paths whose sum is less than k
we use a recursive post order traversal and eliminate the leaves whose sum is greater than k. Nodes higher up become leaves.
Traverse the left of the root and get left
Traverse the right of the root and get right
if the root is a leaf, and the max of left and right together with the current sum is lesser than k, delete the root and return null

return root.

Friday, September 8, 2017

Today we continue reviewing Bing Maps API
These APIs include map control and services that can be used to make maps part of your application. These APIs are an authoritative source for geospatial features. They offer static and interactive maps, geocoding, route and traffic data.Any spatial data such as store locations can be queried and stored with these APIs. Typically an account and a key is needed to use the APIs. The key is used as a license to utilize their services. They are classified as used for public website, private website and enterprise assets.
The Bing Maps dev center provides account management functionality. An account is needed to cut a key and to manage data sources.
Bing Maps helps you visualize spatial data. Users love to know where a store is or how to get there. And the data does not always need to come from Bing. It can be overlaid over the map. As long as the data has geographical coordinates, it can be used with the maps. Previously this data was sent over as xml, now there is support for more succint format as GeoJson 
Recently Bing Maps announced support for drag and drop of GeoJson format data over maps.


#codingexercise
Prune a binary tree with root to leaf paths whose sum is greater than k
we use a recursive post order traversal and eliminate the leaves whose sum is greater than k. Nodes higher up become leaves.
Traverse the left of the root and get left
Traverse the right of the root and get right
if the root is a leaf, and the sum is greater than k, delete the root and return null
reduce the sum and return root.



Thursday, September 7, 2017

Today we continue reviewing Bing Maps API
These APIs include map control and services that can be used to make maps part of your application. These APIs are an authoritative source for geospatial features. They offer static and interactive maps, geocoding, route and traffic data.Any spatial data such as store locations can be queried and stored with these APIs. Typically an account and a key is needed to use the APIs. The key is used as a license to utilize their services. They are classified as used for public website, private website and enterprise assets.
The Bing Maps dev center provides account management functionality. An account is needed to cut a key and to manage data sources..
The API options from Bing Maps platform can be listed as follows: 
1) The V8 Web Control: This is one of the most universal mapping control available on Bing Maps platform with support for almost every type of browser and web application.
2) The Windows 10 universal platform that helps us build map apps for a variety of windows devices
3) The Windows Presentation Foundation that enables rich user experience on desktop applications including touch controls
4) The REST based services that facilitate tasks such as geocoding, reverse geocoding, routing and static imagery
5) The spatial data services that offer three key functionalities batch geocoding, point of interest data and the ability to store and expose spatial data. Imagine the geographical data columns added to almost all points of interest on the world map.
6) The developer resources such as documentation on APIs and SDKs.

#codingexercise
Rearrange the characters of a string such that the adjacent characters are not same.
Solution: Any data structure that can store the different alphabets and their frequencies can help here. if the alphabets could be ordered based on the decreasing order of frequencies, it will help. Consequently a priority queue or heap might help
Build a priority queue of letters and their counts
Take the most occuring letter write it down decrement its count and temporarily remove it until next iteration completes
Repeat as above and if there is no alphabet that can be written without violation, return as invalid input. If it can be written, add back the skip level alphabet and its count to the priority queue

Wednesday, September 6, 2017

Today we continue reviewing Bing Maps API
These APIs include map control and services that can be used to make maps part of your application. These APIs are an authoritative source for geospatial features. They offer static and interactive maps, geocoding, route and traffic data.Any spatial data such as store locations can be queried and stored with these APIs. Typically an account and a key is needed to use the APIs. The key is used as a license to utilize their services. They are classified as used for public website, private website and enterprise assets.
The Bing Maps dev center provides account management functionality. An account is needed to cut a key and to manage data sources.
The Bing Maps Services include REST services and spatial data services. The REST services provide geocoding, routing, elevation, imagery etc. and traffic incident information. The data services provide batch geocoding and spatial data source query and management.
The Web controls available by Bing Maps include support for both WPF and native libraries for Android and perhaps iOS. The geocoding API helps geocode a large batch of addresses. The FindBy APIs help with the querying.
The REST services automatically come  with transaction accounting that helps determine billable and non-billable transactions.
The API options from Bing Maps platform can be listed as follows: 
1) The V8 Web Control: This is one of the most universal mapping control available on Bing Maps platform with support for almost every type of browser and web application.
2) The Windows 10 universal platform that helps us build map apps for a variety of windows devices
3) The Windows Presentation Foundation that enables rich user experience on desktop applications including touch controls
4) The REST based services that facilitate tasks such as geocoding, reverse geocoding, routing and static imagery
5) The spatial data services that offer three key functionalities batch geocoding, point of interest data and the ability to store and expose spatial data. Imagine the geographical data columns added to almost all points of interest on the world map.
6) The developer resources such as documentation on APIs and SDKs.

#codingexercise
Given a range [L,R] find the count of numbers having prime number of set bits in their binary representation. 
Solution:
Initialize an array of prime numbers we can use as divisors against the count of set bits
Iterate through the elements from L to R inclusive
count the bits for others and use the divisors
Since the numbers are contiguous, we can use the previous count and positions to determine current count as the counts typically wrap around in small ranges.