Monday, September 26, 2016

Today we look at an interesting problem space of metrics reporting.
Almost all application owners want to have an understanding of their resources usages. These resources usage metrics may apply to different clouds, per cloud or even as granular as per services.For example, we may have a microservice that provisions a single resource for the application. It may have APIs for create-update-delete that return different status codes. A simple metric may be to find out success rate in terms of the total calls made. The success in this case is determined by the number of http success status code sent across all the api calls. Since this kind of metric comes from individual services, the application owner may want to see these metrics across all services. This is best shown in  a dashboard with drill downs to different services.

It's important to note that metrics data is inherently time based. Although metrics can be spot measurements, cumulative over time or delta between time slices, they have an association with the time that they were measured.  Consequently, most metric data is stored in a time series database. Also, previous measurements can be summarized with statistics while the running measurements apply to only the current window of time. Therefore, some of these time-series databases can even be fix-sized.

There are many different tools available to present such charts from data sources. Almost all such reporting mechanisms want to pull data for their charts. However, they cannot directly call the services to query their metrics because this may affect the functionality and performance of those services. Therefore they do it in a passive manner by finding information from log trails and database states. Most services have logging configured and their logs are usually indexed in a time series database of their own. Consequently the pull information can come from these logs.

The Grafana stack is one such custom reporting solution. It presents beautiful dashboards with immense flexibility to add and configure different charts from several time series data sources. As long as the metrics data makes its way to the database configured with Grafana such as Graphite, InfluxDB, Prometheus, the reporting solution is a breeze.

Generally, this information flow into the reporting database needs to be automated. In the example cited above, it would be pulled from the log index server and pushed into say Graphite.  This collection statistics daemon usually works very well in periodically pulling information from wherever they are available. If the daemon has plugins to custom reporting data sources such as Graphite or full service reporting solutions to say Nagios or Zabbix, then the data automatically makes its way to charts for the end user.

Since there are different technologies with overlapping functionalities and even more vendors for those technologies in any IT deployments, workforce often has to decide how the information flows. Sometimes this is straightforward in an out of box mechanism from a purchased or re-licensed software but other times, it requires automation to route the data from existing data sources that can neither be replaced because of their operational dependency nor can they be configured to directly send data to the charts. It's in this case where collection may need to be written.

Billing and Monitoring are two aspects of IT operations that often have dependencies on these very same metric data. Therefore, the software products for those operations may very well support reporting out of the box.

This goes to show that there are many ways to tackle the same problem for metrics to charts data flow but the thumbrules of 1) reusing existing infrastructure and 2) applying customization - be it automation or dashboards, to be as minimal as necessary, hold true in many contexts.

#codinginterview
In yesterday's post we discussed decoding digits as letters from a stream. It was batch processing. Here we optimize it:
void decodeStreamOptimized(Stream numbers,  ref List<string> results)
{
var last = int_max;
var current =  numbers.Read();
while ( current != EOF){
var last = digits.Last();
digits = new List<int>(){ last, current };
s = new StringBuilder();
decode( digits, digits.count(), ref s, ref results);
// now separate levels between consecutive evaluations
results.Add(null);
last = current;
current = numbers.Read();
}
}
The results in each evaluation round is a single character denoted by z if the current digit is taken independently or z' if the last and current digit is taken together
Consecutive results R1 and R2 are therefore stitched together as one of two binary selections:
R1zR2z or R2z'

Another way to optimize would be to take a large window and translate before moving to the next.

Sunday, September 25, 2016

We continue our discussion on statistical analysis in Natural Language Processing
Word2vec uses continuous bag of words model that predicts a word based on the surrounding words and the skip gram model that predicts the surrounding words based on the current word. The skip gram models the conditional probability of a context c given a word w with probability theta using the softmax function 
The denominator in the softmax function is the sum of such expectations for i,j ranging over the entire vocabulary where vw and vw' are input and output vector representations of word w in vocabulary W. This is a heavily optimized expression but the idea is that each word tries to predict every word around it. Further research by Levy and Goldberg tried to predict the word not based on surrounding words but based on context where context is not just words but anything that contributes to context
The denominator is also rather large.  Also a summation over hundreds of thousands of contexts using the high dimensional vectors for each input and output vector representation of word w in vocabulary W is too expensive to compute. While Word2Vec optimizes this with hierarchical selection, they state their assumption as the selection of context for a word such that it is better than the say expected value of all others, results in good word embedding. This follows from the belief that the improvement comes because similar words have similar vectors. Intuitively, this gives us the notion of thesaurus. People use different words to mean the same thing and their purports often manifest similarity in the tightness between the word and its context in the sampling window.
#codingquestion
Given a stream of digits, evaluate the possible decodings if leading, trailing and consecutive zeros are excluded
For example, 121 => ABA , AU, LA
if A= 1, B =2, .. Z = 26

void decode(List<int>digits, int n, ref stringbuilder s, ref List<string> results)
{
if (n == 0) {Console.Write(s); results.Add(s.reverse()); return;}
if (n == 1) { s += ToLetter(digits, n-1, 1); Console.Write(s); results.Add(s.reverse()); return; }
if (digits[n-1] > 0){
      s += ToLetter(digits, n-1, 1);
      decode(digits, n-1, ref s, ref results);
      s.RemoveLast();
}
if (digits[n-2] < 2 || (digits[n-2] == 2 && digits[n-1] < 7)){
       s += ToLetter(digits, n-2, 2);
       decode(digits, n, ref s, ref results);
       s.RemoveLast();
}
}

void decodeStream(Stream numbers,  ref List<string> results)
{
var digits = new List<int>();
var current =  numbers.Read();
while ( current != EOF){
digits.Add(current.ToInteger());
s = new StringBuilder();
decode( digits, digits.count(), ref s, ref results);
current = numbers.Read();
}
}


Convert a BST into inorder singly linked list in place.
Node toList(node root)
{
If (root == null) return root;
If (root.left != null)
{
var left = toList(root.left);
for( ; left.right != null; left = left.right);
Left.right = root;
}
If (root.right != null)
{
var right = toList(root);
for(; right.left != null; right = right.left);
root.right = right;
}
Return root;
}

Saturday, September 24, 2016

We continue our discussion on statistical analysis in Natural Language Processing
Levy's and Goldberg's paper introduced a departure from Word2Vec  in how words and contexts are chosen. They introduced Negative Sampling which is the reverse of sampling contexts for words. In the latter case, we were picking say a handful of generalizations that we argued were contexts and mapped them to word in the word-context matrix. Now for the same number of contexts, we choose others at random that we know are not contexts. The assumption is that the randomly chosen context will be the unobserved ones. This means that the negative and the positive samples are selected so that they are as disjoint as possible. Furthermore the quality of the positive selections improve as their probabilities are higher

The purpose of negative sampling is to minimize the probability that words,contexts pairs are chosen such that they all appear from the text and they consequently have the same values for their vectors which is not very helpful.  By introducing random word,context we now have a mix. This has the side-effect that the word, contexts chosen as positive sampling will indeed be more representative of the text. 
But where are these random word, context pairs chosen from ? Do they come from the corpus or within the text. Arguably, if it is outside the sampling window, it is sufficient.

Find the number of three edge cycles in an undirected graph.
We use depth first search to traverse the graph and generate paths.
DFS ( V, E, paths)
For each vertex v in V
       V.color=white
       V.d = nil
  Time = 0
 For each vertex v in V:
       If v.color == white:
           path = new Path();  
           DFS-Visit (V, E, v, path, paths)
     
 DFS-VISIT (V,E, u, path, ref paths)
  time = time + 1
   u.d = time
   u.color = gray
 
   path.Add(u);
   foreach  vertex v adjacent to u
        If v.color == white
           DFS-VISIT(V,E,v)
         Else
               If v.d  <= u.d < u.f <= v.f  :
                        paths.Add(path); 
u.color = black
time = time + 1
 u.f = time


filter(paths):
var d = new Dictionary<string, int>();
foreach path in paths:
     foreach vertex  v in path:
           if v.next.next.next == v:
               pathidentifier = stringify(v,3)
                if d.contains(pathidentifier) == false:
                    d.Add(pathidentifier, 1);


This topological sort does not work with breadth first traversal because each vertex is able to participate as the starting point only in a depth first traversal. That said we can use either breadth first traversal or depth first traversal to traverse the graph and enumerate paths. Finding cycles from enumerated paths is easy because we look for the start and return to same vertex across all the paths.

Friday, September 23, 2016

We continue our discussion on Statistical framework in natural language processing.
The mutual information of two random variables is a measure of the mutual dependence between two variables. When the two random variables are taken together, we measure their joint distribution. Otherwise we take them independently which is termed marginal distribution. MI is therefore log(p(X,Y)/p(X)p(Y)). If they are independent the numerator and denominator are the same and MI equals zero.
Pointwise Mutual information or PMI refers to single events, whereas MI refers to the average of all possible events. MI is the expected value of PMI.
In natural language processing, this comes useful because we know that the words and contexts are dependent on each other. Therefore we can take the words as one of the marginal distributions and their contexts as another. This then can give us a matrix of PMIs for each cell in the word context matrix.
This is a pretty large matrix and a sparse matrix because words and context don't really appear together in a large number of ways. Matrix factorization can be done with techniques such as Singular Value Decomposition which we will go into later.
Word2Vec factorizes these matrices. It does this in a weighted pair giving more influence to frequent pairs. These weights have an interesting contribution. Without them, the original PMI matrix can still be reconstructed but the weighted and non-weighted objectives result in different generalizations. 
By generalizations, it means whatever we choose for the context such as prepositions
When the dimensions are reduced, this plays out an even bigger role.
If we change the algorithm such as word2vec to goldberg, we can even expect different results. For example,  the word Washington may bring up similar words as Seattle, Evergreen etc and on the other hand the same word might bring up other states as Oregon, Utah, California etc.

Let us now look at how negative sampling plays into this. Negative sampling means it is the reverse of sampling contexts for words. In the latter case, we were picking say a handful of generalizations that we argued were contexts and mapped them to word in the word-context matrix. Now for the same number of contexts, we choose others at random that we know are not contexts and this is negative sampling.

#codingexercise
Check if 2 binary trees are equal recursively and iteratively
bool isEqual(Node root1, Node root2)
{
if (root1 == NULL && root2 == NULL) return true;
if (root1 == NULL) return false;
if (root2 == NULL) return false;
return root1.data == root2.data && isEqual(root1.left, root2.left) && isEqual(root1.right, root2.right);
}
bool IsEqualIterative(Node root1, Node root2)
{
var current1 = root1;
var current2 = root2;
var stk = new Stack<Node>(); // we only need one stack

while (current1 != null ||  current2 != null || stk. empty() == false)
{

if (current1 == null && current2 != null ) return false;
if (current1 != null && current2 == null ) return false;

if (current1 != null)
{
if (current1.data != current2.data) return false;
stk.push(current1);
current1 = current1.left;
current2 = current2.left;
continue;
}

if (stk. empty() == false)
{
current1 = stk. pop();
current1 = current1.right;
current2 = current2.right;
}

}
return true;
}
Although the order of checking does not depend on the type of traversal, it is better and simpler to use preorder traversal.

Thursday, September 22, 2016

Statistical framework is especially relevant in natural language processing.

The purpose of using random variables and a Bayes framework is that it has advantages over the  frequentist approach. The focus on the frequentist approach is that the estimated model will be closer to the truth  when there is sufficient amount of data available. In the Bayes framework, this is in terms of probabilities from the training sample. For example, if we want to define a probability distribution over a language vocabulary, then one can define a random variable X(w) where w ranges over the words in the vocabulary. The probability of a word in the vocabulary then becomes  p(X=w). Random variables can be both  discrete and continuous. In the case of discrete variables we attach a weight to each element in the sample space such as in a multinomial distribution and the discrete variable is thus assumed to have an underlying mass function which is the sum of the probabilities ranging over the possible values that it can take. The same is done for a continuous random  variable except that we integrate it over a sample space and call it a probability density function. Since multiple random variables are defined on the same sample space, we can describe a joint distribution.
Word2vec uses CBOW architecture that predicts a word based on the surrounding words and the skip gram that predicts the surrounding words based on the current word. This is done in a specific way called the softmax function and it is summarized in a formulation as:
p(wo/wi) = exp(vo dash transpose. vi) / sum of such expectations for i,j ranging over the entire vocabulary where vw and vw' are input and output vector representations of word w in vocabulary W. This is a heavily optimized expression but the idea is that each word tries to predict every word around it. Further research by Levy and Goldberg tried to predict the word not based on surrounding words but based on context where context is not just words but anything that contributes to context such as preposition. This had a surprisingly different word embedding. The functional similarity seen here is very different from the similarities seen with the word2vec.  Later Stanford researchers started to find out what should the similarity function should look like ? And they came up with an objective function that showed the results similar to word2vec. They used word context co-occurrence matrix by using some generalizations of context and came up with an objective function that looked a new weighted least square regression model. When they factorize the co-occurrence matrix, they get results similar to word2vec. While word2vec tries to fit a model, predict, find the errors and refine it based on backpropagation, the Stanford objective function tries to count the words by context matrix, take a log and factorizes it. They explained word2vec and also found a supposedly faster way though this has been challenged quite a bit by Levy and Goldberg.  Levy and Goldberg subsequently did more work into analyzing skip gram with negative sampling. Negative sampling tries to predict context that is not there. It take a window of sampling and takes negative ten contexts. Levy and Goldberg tried to count it as the number of times the word appears alone and the number of times the context appears alone versus taken together. They fit this information in a well known formula called the pointwise mutual information (PMI). They declined to factorize the PMI matrix and used the rows of words to contexts as vectors directly and their experiments compared the similarity functions.
I have still not understood why negative sampling and PMI do not do the same diligence as precision and recall. With precision we include the false positives and with recall we include the false negatives.
We discussed research that was using matrix and high dimensional vectors using statistical framework and neural nets but matrices and graphs go well together. We will review a few graph techniques for finding correlations that word2vec does. Word2vec is hard to apply because you have to tune different parameters. With graphs and persisted connections we hope to alleviate that.
#codingexercise
Given an array of n elements, where each element is at most k away from its target position, sort the array in the reverse order
Void sortKReverse(ref List<int> A, int k)
{
Assert(A.count > k);
Var h = new Heap<int>();
For (int i =0: i < k && i < n; i++)
{
h.Add(A[i]);
}
Int i = k+1;
For(int j =0; j < n; j++)
{
   If (i < n){
       A[j] = h.replacemax(A[i]);
       I ++;
}else{
      A[J]  = h.max()
}
}
}

#urlshortener
Void printUrlHash(string url){
Console.Write("{2:X8}", url.GetHashCode());
}
Statistical framework is especially relevant in natural language processing.

The purpose of using random variables and a Bayes framework is that it has advantages over the  frequentist approach. The focus on the frequentist approach is that the estimated model will be closer to the truth  when there is sufficient amount of data available. In the Bayes framework, this is in terms of probabilities from the training sample. For example, if we want to define a probability distribution over a language vocabulary, then one can define a random variable X(w) where w ranges over the words in the vocabulary. The probability of a word in the vocabulary then becomes  p(X=w). Random variables can be both  discrete and continuous. In the case of discrete variables we attach a weight to each element in the sample space such as in a multinomial distribution and the discrete variable is thus assumed to have an underlying mass function which is the sum of the probabilities ranging over the possible values that it can take. The same is done for a continuous random  variable except that we integrate it over a sample space and call it a probability density function. Since multiple random variables are defined on the same sample space, we can describe a joint distribution
Word2vec uses CBOW architecture predicts a word based on the surrounding words and the skip gram predicts the surrounding words based on the current word. This is done in a specific way called the softmax function and it is summarized in a formulation as:
p(wo/wi) = exp(vo dash transpose. vi) / sum of such expectations for i,j ranging over the entire vocabulary where vw and vw' are input and output vector representations of word w in vocabulary W. This is a heavily optimized expression but the idea is that each word tries to predict every word around it. Further research by others tried to predict the word based on surrounding words but based on context where context is not just words but anything that contributes to context such as preposition.

#codingexercise
Given an array of n elements, where each element is at most k away from its target position, sort the array in the reverse order
Void sortKReverse(ref List<int> A, int k)
{
Assert(A.count > k);
Var h = new Heap<int>();
For (int i =0: i < k && i < n; i++)
{
h.Add(A[i]);
}
Int i = k+1;
For(int j =0; j < n; j++)
{
   If (i < n){
       A[j] = h.replacemax(A[i]);
       I ++;
}else{
      A[J]  = h.max()
}
}
}

Statistical framework is especially relevant in natural language processing.

The purpose of using random variables and a Bayes framework is that it has advantages over the  frequentist approach. The focus on the frequentist approach is that the estimated model will be closer to the truth  when there is sufficient amount of data available. In the Bayes framework, this is in terms of probabilities from the training sample. For example, if we want to define a probability distribution over a language vocabulary, then one can define a random variable X(w) where w ranges over the words in the vocabulary. The probability of a word in the vocabulary then becomes  p(X=w). Random variables can be both  discrete and continuous. In the case of discrete variables we attach a weight to each element in the sample space such as in a multinomial distribution and the discrete variable is thus assumed to have an underlying mass function which is the sum of the probabilities ranging over the possible values that it can take. The same is done for a continuous random  variable except that we integrate it over a sample space and call it a probability density function. Since multiple random variables are defined on the same sample space, we can describe a joint distribution
Word2vec uses skip graph formulation.

#codingexercise
Given an array of n elements, where each element is at most k away from its target position, sort the array in the reverse order
Void sortKReverse(ref List<int> A, int k)
{
Assert(A.count > k);
Var h = new Heap<int>();
For (int i =0: i < k && i < n; i++)
{
h.Add(A[i]);
}
Int i = k+1;
For(int j =0; j < n; j++)
{
   If (i < n){
       A[j] = h.replacemax(A[i]);
       I ++;
}else{
      A[J]  = h.max()
}
}
}