Cluster computing

Thursday, September 28, 2017

Today we continue reviewing the whitepaper - the chutes and ladders of customer identity from Gigya which is a leader in Customer Identity and Access Management Industry. The title uses an analogy of a well-known game. This is a whitepaper that introduces the delights and the pitfalls in customer experience with identity. We read about the personalization experience from the beginning and the need for a lighter touch. More visitors register themselves when they see a value and an easy way in. Most of them already know the benefits of setting up an account such as keeping track of history, making purchases with a single click and so on. The effort involved in setting it up varies if its online or via the download of an app. The process of login can be replaced with techniques that are not only hard to hack but also offer superior behaviors risk-based authentication which can be used to combat fraud. Multiple factors of authentication can be added included one time passcodes. Also customer data can be collected over time and the profile built up on a continuous basis. This data can also be shared or synchronized with downstream marketing, sales and services based on compliance with privacy and industry standards.
There is an interesting take on CIAM Solutions introduced earlier. They take first party data of the customer as cash. Its value is immediate and most reliable. This is why CIAM Solutions manage their own data while also allowing information to flow downstream to other parties. There is transparency required by virtue of the data and integration between systems should be facilitated but these systems realize the importance of first party data.
#codingexercise
count palindrome substrings in a string.
We mentioned the brute force way is to enumerate all substring and check for palindrome. This is also similar to using each letter as the center of a palindrome candidate and expanding it in both directions for even and odd length palindrome.
An alternative would be to find disjoint longest palindromes in the string by expanding the string one character at a time. Care must be taken to count the inner palindromes of a palindrome. Finally we know that the lengths of the palindrome up to those found to a given center are also palindromic. Therefore we look for the next center based on the lengths we have found already. The suggestion is that the lengths array we construct will have values similar to the palindromic nature of Pascal's triangle.
Therefore the algorithm is:
Initialize the lengths array to the number of possible centers
Set the current center to the first center
Loop while the current center is valid:
Expand to the left and right simultaneously until we find the largest palindrome around this center
Fill in the appropriate entry in the longest palindrome lengths array
Iterate through the longest palindrome array backwards and fill in the corresponding values to the right of the entry for the current center until an unknown value (because we don't have sufficient elements in the string to make an interpretation on the length) is encountered.
Set the new center to the index of this unknown value
Return the lengths array
since this problem of counting palindrome has overlapping Subproblems in the range i to j of the string, we can also use memoization and dynamic programming. By using the lengths array and Pascal triangle, we can also get counts directly

Wednesday, September 27, 2017

Today we continue reviewing the whitepaper - the chutes and ladders of customer identity from Gigya which is a leader in Customer Identity and Access Management Industry. The title uses an analogy of a well-known game. This is a whitepaper that introduces the delights and the pitfalls in customer experience with identity.
The chutes for customer identity are demonstrated by
1) annoying ads for past purchases
2) spamming of email inbox with unwanted newsletters
3) more setup and initiation activities prior to transaction
The ladders for customer identity are demonstrated by
1) personalization of ads done right
2) less frequent and more pertinent notifications
3) more integrated and seamless experience with less chores

Yesterday we read about the personalization experience from the beginning and the need for a lighter touch. Today we continue on the theme to make it easier for the customer. More visitors register themselves when they see a value and an easy way in. Most of them already know the benefits of setting up an account such as keeping track of history, making purchases with a single click and so on. The effort involved in setting it up varies if its online or via the download of an app. The process of login can be replaced with techniques that are not only hard to hack but also offer superior behaviors risk-based authentication which can be used to combat fraud. Multiple factors of authentication can be added included one time passcodes. Also customer data can be collected over time and the profile built up on a continuous basis. This data can also be shared or synchronized with downstream marketing, sales and services based on compliance with privacy and industry standards.
#codingexercise
Given a set of distinct strings, find the shortest superstring that contains each string in a given set as substring. The distinct strings cannot be substrings of each other:
The method to solve this is as follows:
Make a copy of the array to begin with
Find a pair that has maximum overlap as a and b
Replace a and b with the concatenation ab
The size of the array reduces by one and we repeat 1 to 4 until we are left with only one super string.
Since we find the maximum overlap at each point, we make a greedy choice for the solution.

one more coding exercise
count palindrome substrings in a string.
The brute force way is to enumerate all substring and check for palindrome.
An alternative would be to find disjoint longest palindromes in the string. Care must be taken to count the inner palindromes of a palindrome. Finally we know that the lengths of the palindrome up to those found to a given center are also palindromic. Therefore we look for the next center based on the lengths we have found already. Palindrome center occurs at Pascal triangle positions in a string so we check these as opposed to every letter to optimize.

Tuesday, September 26, 2017

Today we continue reviewing the whitepaper - the chutes and ladders of customer identity from Gigya which is a leader in Customer Identity and Access Management Industry. The title uses an analogy of a well-known game. This is a whitepaper that introduces the delights and the pitfalls in customer experience with identity.
The chutes for customer identity are demonstrated by
1) annoying ads for past purchases
2) spamming of email inbox with unwanted newsletters
3) more setup and initiation activities prior to transaction
The ladders for customer identity are demonstrated by
1) personalization of ads done right
2) less frequent and more pertinent notifications
3) more integrated and seamless experience with less chores
Most Customer Identity and Access Management solutions need to track customers. They can only do this with user consent. While some show all the terms and conditions up front for a one-time user but overwhelming consent request, others choose to ask as and when needed providing a lighter touch
Similarly the sign up process can seem to require all day to get all the details fed in to the system while others merely refer to existing registered partner sites such as signing into email or chat applications. Removing this password requirement is touted as one of the best improvements to customer experience and consequently a lot of attention has been paid by the industry in forming the OpenID standard. What the standard leaves unsaid is how the marketers can use it to their advantage while it focuses on the integration between businesses and stores for the customer. A marketer would like to know :
whether the customer arrived via search, campaign or referral
what device they connected with
what was the tipping point that converted them to a customer
what transactions the customer attempted or executed
what are the recommendations that will interest the customer the most
how to make more money from an engaging and mutually beneficial experience for the customer.
what partners and associates are willing to work with the marketers for this customer

#codingexercise
Get the max sum level of an integer binary tree.
We perform level wise traversal of the binary tree and serialize the integers encountered.
Then for every level, we add up the sum and return the level with the highest sum.
The level wise traversal is done by enqueueing the left and the right siblngs in a queue. The levels are demarcated by level indicators which in our case could be a null node.

Monday, September 25, 2017

Today we start reviewing the whitepaper - the chutes and ladders of customer identity from Gigya which is a leader in Customer Identity and Access Management Industry. The title uses an analogy of a well-known game. This is a whitepaper that introduces the delights and the pitfalls in customer experience with identity. The narration is about an online store with the recognition that many of the points mentioned in the white paper apply equally well for other businesses. Most customers start out on an online store as an unknown visitor. Businesses rely on sending the right message to the right person at the right time and through the right channels. This multi-channel marketing is a ladder in customer experience. Since they start out from a variety of devices and applications, it is difficult to know what they want. Identity and preferences help mitigate this barrier. Some examples of their interactions that are either annoying and frustrating or enjoyable and satisfying are mentioned here. A customer may buy a pair of jeans a few weeks earlier but may continue to see unwanted ads for the same jeans all over the internet. On the other other hand, the same customer may see recommendations for other items that pair well with the jeans that is much like a personal fashion consultant. Another example is that of subscription to the newsletters from the online store that become difficult to manage. On the other hand, a crisp and clear infrequent newsletter may give relevant information and improve the experience. Similarly, when the user switches devices, he may lose the settings he had made earlier. On the other hand, his login may be integrated with social networking apps and the settings become part of public profile to easily follow the user between apps and devices. Therefore there is a trade-off in the customer touchpoints that cam attract or spurn the customer. If the business were to capture information about the user as early as possible and in small pieces, a relationship can be established based on the users preferences. This approach is known as "progressive identity". It offers promotions, coupons or newsletters to the customer without requiring full registration. It obtains consent from the customer throughout their life-cycle. It enables convenient and centralized profile and preference management. This progressive identity is the notion that a customer is not a boolean known or unknown for a store but an accumulation of transparent and engaging experiences. Customer tracking is required to personalize the customer's experience but instead of displaying all the terms upfront, if it can be managed as the customer progresses on the website, it can improve the experience. A lightweight registration such as an email address in exchange for a valuable and personalized content will be preferred by the customer. The customer will be enticed from the content to sign up for a full account. The easier we make this registration and the more augmented the authentication methods, the better the customer experience.
#codingexercise
Convert a BST to min heap
Traverse the BST in InOrder to get a sorted list
Heapify the sorted list, say one-based, using
the left of an element at index i-1 in the sorted list is at 2i th index if that index is within range
the right of an element at index i-1 in the sorted list is at 2i+1 th index if that index is within range

Sunday, September 24, 2017

We continue to review the slides from Stanford that introduce Natural Language Processing via Vector Semantics.We said that vector representation is useful and opens up new possibilities. We saw that a lookup such as a thesaurus does not help. We were first reviewing co-occurrence matrices. These are of many forms such as term-document matrix, word-word matrix, word-context matrix etc The term-document matrix was a count of word w in a document d. Each document therefore becomes a count vector. The similarity between the words in this case merely indicates their occurrence to be similar. If we changed the scope from documents to some other text boundary, we have word-word matrix. The similarity in this case improves over that in the term-document matrix. A word-context matrix improves this further because the word in terms of context which is closer to its meaning and bring semantical similarity.
Instead of co-occurrence, we now consider a different correlation between two words. It's called the positive pointwise mutual information. Pointwise mutual information indicates whether two words occur more than they were independent.It is defined as the log of their probability taken together divided by their individual probabilities. This value can come out to be negative or positive. The negative values are not helpful because the values are of little significance. The probabilities are small and the result is in the inverse order of powers of ten which do not give any meaning via significance. Unrelatedness is also not helpful to our analysis. On the other hand, positive PMI is helpful to discern whether two words are likely together. we only have to take the PMI if it turns out to be positive. Computing the PPMI is easy to do with the probabilities which are based on cumulatives of word - occurrences
Today we realize that PMI needs to be weighted, because it is biased towards in-frequent events. If the words are rare, they have a high PMI value. This skews the results when the text has rare words. In order to make it fair, the PMI is weighted. This can be achieved either by raising the context probabilities or with add-one smoothing. The probability of rare context is raised to alpha=0.75 in the first case and an appropriate smoothing may be added to the numerator in calculating the probability and applied to the uniform probability in the second case.
#codingexercise
Find all the heroes and the superheroes in an integer array. The heroes are the elements which are greater than all the elements to the right of them. A superhero is the element which is greater than all the elements to the left and the right.
This problem can be solved by keeping track of the current max seen so far. If the elements traversed and picked as next exceeds the current max, it satisfies the criteria for being the hero and gets used as the current max. Any element that does not exceed the current max is not a hero. The final value of the current max is the superhero.

Saturday, September 23, 2017

We continue to review the slides from Stanford that introduce Natural Language Processing via Vector Semantics.We said that vector representation is useful and opens up new possibilities. We saw that a lookup such as a thesaurus does not help. We were first reviewing co-occurrence matrices. These are of many forms such as term-document matrix, word-word matrix, word-context matrix etc The term-document matrix was a count of word w in a document d. Each document therefore becomes a count vector. The similarity between the words in this case merely indicates their occurrence to be similar. If we changed the scope from documents to some other text boundary, we have word-word matrix. The similarity in this case improves over that in the term-document matrix. A word-context matrix improves this further because the word in terms of context which is closer to its meaning and bring semantical similarity.
Some of these matrices can be very sparse with zeros covering a majority of the cells. This is quite alright since there are lots of efficient algorithms for sparse matrices. Similarly the size of the window can also be adjusted. The shorter the window the more syntactic the representation. The longer the window, the more semantic the representation. Instead of co-occurrence, we now consider a different correlation between two words. It's called the positive pointwise mutual information. Raw word frequency suffered from being skewed by more frequent and less salient words. Pointwise mutual information indicates whether two words occur more than they were independent.It is defined as the log of their probability taken together divided by their individual probabilities. This value can come out to be negative or positive. The negative values are not helpful because the values are of little significance. The probabilities are small and the result is in the inverse order of powers of ten which do not give any meaning via significance. Unrelatedness is also not helpful to our analysis. On the other hand, positive PMI is helpful to discern whether two words are likely together. we only have to take the PMI if it turns out to be positive. Computing the PPMI is easy to do with the probabilities which are based on cumulatives of word - occurrences.
#codingexercise
Count the number of islands in a sparse matrix.
This problem is similar to the one in graph where we find connected components. In a 2d matrix, every cell has eight neighbors. A depth first search explores all these eight neighbors. When a cell is visited, it is marked so it is not included in the next traversal. The number of such successful depth first search results in the number of connected components.

Friday, September 22, 2017

We continue to review the slides from Stanford that introduce Natural Language Processing via Vector Semantics.We said that vector representation is useful and opens up new possibilities. We saw that a lookup such as a thesaurus does not help.
Stanford NLP has shown there are four kinds of vector models.
A Sparse vector representation where a word is represented in terms of the co-occurrences with the other words and using a set of weights for their co-occurrences. This weight is usually based on a metric called the mutual information.
A dense vector representation that involves latent semantic analysis, neural net or clusters from Brown corpus. The dense vector representations share a representation of word as a vector of numbers which translate a word into a corresponding vector in the vector space. This is called embedding.
Co-occurrence matrices were of many forms such as term-document matrix, word-word matrix, word-context matrix etc The term-document matrix was a count of word w in a document d. Each document therefore becomes a count vector. The similarity between the words in this case merely indicates their occurrence to be similar. If we changed the scope from documents to some other text boundary, we have word-word matrix. The similarity in this case improves over that in the term-document matrix. A word-context matrix improves this further because the word in terms of context which is closer to its meaning and bring semantical similarity.
Co-occurrence between two words have two forms - first order and second order. The first order co-occurrence is syntagmatic association and the second-order association is paradigmatic association which means the first one is based on positions where as the second one is based on similar neighbors. Note that the vectorization derives from the usage of words which is why it becomes popular. Another way to look at usage is to canonicalize the text into an esperanto language where the relations and syntax are more oriented towards natural language processing. Some work has already begun with different kind of improvements to ontologies that are not restricted to thesaurus or wordnet but one such as FrameNet. All we need to keep in mind here is that there are layers to tackle the problem - Usage, vector space, classification of vectors.