Saturday, September 24, 2016

We continue our discussion on statistical analysis in Natural Language Processing
Levy's and Goldberg's paper introduced a departure from Word2Vec  in how words and contexts are chosen. They introduced Negative Sampling which is the reverse of sampling contexts for words. In the latter case, we were picking say a handful of generalizations that we argued were contexts and mapped them to word in the word-context matrix. Now for the same number of contexts, we choose others at random that we know are not contexts. The assumption is that the randomly chosen context will be the unobserved ones. This means that the negative and the positive samples are selected so that they are as disjoint as possible. Furthermore the quality of the positive selections improve as their probabilities are higher

The purpose of negative sampling is to minimize the probability that words,contexts pairs are chosen such that they all appear from the text and they consequently have the same values for their vectors which is not very helpful.  By introducing random word,context we now have a mix. This has the side-effect that the word, contexts chosen as positive sampling will indeed be more representative of the text. 
But where are these random word, context pairs chosen from ? Do they come from the corpus or within the text. Arguably, if it is outside the sampling window, it is sufficient.

Find the number of three edge cycles in an undirected graph.
We use depth first search to traverse the graph and generate paths.
DFS ( V, E, paths)
For each vertex v in V
       V.color=white
       V.d = nil
  Time = 0
 For each vertex v in V:
       If v.color == white:
           path = new Path();  
           DFS-Visit (V, E, v, path, paths)
     
 DFS-VISIT (V,E, u, path, ref paths)
  time = time + 1
   u.d = time
   u.color = gray
 
   path.Add(u);
   foreach  vertex v adjacent to u
        If v.color == white
           DFS-VISIT(V,E,v)
         Else
               If v.d  <= u.d < u.f <= v.f  :
                        paths.Add(path); 
u.color = black
time = time + 1
 u.f = time


filter(paths):
var d = new Dictionary<string, int>();
foreach path in paths:
     foreach vertex  v in path:
           if v.next.next.next == v:
               pathidentifier = stringify(v,3)
                if d.contains(pathidentifier) == false:
                    d.Add(pathidentifier, 1);


This topological sort does not work with breadth first traversal because each vertex is able to participate as the starting point only in a depth first traversal. That said we can use either breadth first traversal or depth first traversal to traverse the graph and enumerate paths. Finding cycles from enumerated paths is easy because we look for the start and return to same vertex across all the paths.

Friday, September 23, 2016

We continue our discussion on Statistical framework in natural language processing.
The mutual information of two random variables is a measure of the mutual dependence between two variables. When the two random variables are taken together, we measure their joint distribution. Otherwise we take them independently which is termed marginal distribution. MI is therefore log(p(X,Y)/p(X)p(Y)). If they are independent the numerator and denominator are the same and MI equals zero.
Pointwise Mutual information or PMI refers to single events, whereas MI refers to the average of all possible events. MI is the expected value of PMI.
In natural language processing, this comes useful because we know that the words and contexts are dependent on each other. Therefore we can take the words as one of the marginal distributions and their contexts as another. This then can give us a matrix of PMIs for each cell in the word context matrix.
This is a pretty large matrix and a sparse matrix because words and context don't really appear together in a large number of ways. Matrix factorization can be done with techniques such as Singular Value Decomposition which we will go into later.
Word2Vec factorizes these matrices. It does this in a weighted pair giving more influence to frequent pairs. These weights have an interesting contribution. Without them, the original PMI matrix can still be reconstructed but the weighted and non-weighted objectives result in different generalizations. 
By generalizations, it means whatever we choose for the context such as prepositions
When the dimensions are reduced, this plays out an even bigger role.
If we change the algorithm such as word2vec to goldberg, we can even expect different results. For example,  the word Washington may bring up similar words as Seattle, Evergreen etc and on the other hand the same word might bring up other states as Oregon, Utah, California etc.

Let us now look at how negative sampling plays into this. Negative sampling means it is the reverse of sampling contexts for words. In the latter case, we were picking say a handful of generalizations that we argued were contexts and mapped them to word in the word-context matrix. Now for the same number of contexts, we choose others at random that we know are not contexts and this is negative sampling.

#codingexercise
Check if 2 binary trees are equal recursively and iteratively
bool isEqual(Node root1, Node root2)
{
if (root1 == NULL && root2 == NULL) return true;
if (root1 == NULL) return false;
if (root2 == NULL) return false;
return root1.data == root2.data && isEqual(root1.left, root2.left) && isEqual(root1.right, root2.right);
}
bool IsEqualIterative(Node root1, Node root2)
{
var current1 = root1;
var current2 = root2;
var stk = new Stack<Node>(); // we only need one stack

while (current1 != null ||  current2 != null || stk. empty() == false)
{

if (current1 == null && current2 != null ) return false;
if (current1 != null && current2 == null ) return false;

if (current1 != null)
{
if (current1.data != current2.data) return false;
stk.push(current1);
current1 = current1.left;
current2 = current2.left;
continue;
}

if (stk. empty() == false)
{
current1 = stk. pop();
current1 = current1.right;
current2 = current2.right;
}

}
return true;
}
Although the order of checking does not depend on the type of traversal, it is better and simpler to use preorder traversal.

Thursday, September 22, 2016

Statistical framework is especially relevant in natural language processing.

The purpose of using random variables and a Bayes framework is that it has advantages over the  frequentist approach. The focus on the frequentist approach is that the estimated model will be closer to the truth  when there is sufficient amount of data available. In the Bayes framework, this is in terms of probabilities from the training sample. For example, if we want to define a probability distribution over a language vocabulary, then one can define a random variable X(w) where w ranges over the words in the vocabulary. The probability of a word in the vocabulary then becomes  p(X=w). Random variables can be both  discrete and continuous. In the case of discrete variables we attach a weight to each element in the sample space such as in a multinomial distribution and the discrete variable is thus assumed to have an underlying mass function which is the sum of the probabilities ranging over the possible values that it can take. The same is done for a continuous random  variable except that we integrate it over a sample space and call it a probability density function. Since multiple random variables are defined on the same sample space, we can describe a joint distribution.
Word2vec uses CBOW architecture that predicts a word based on the surrounding words and the skip gram that predicts the surrounding words based on the current word. This is done in a specific way called the softmax function and it is summarized in a formulation as:
p(wo/wi) = exp(vo dash transpose. vi) / sum of such expectations for i,j ranging over the entire vocabulary where vw and vw' are input and output vector representations of word w in vocabulary W. This is a heavily optimized expression but the idea is that each word tries to predict every word around it. Further research by Levy and Goldberg tried to predict the word not based on surrounding words but based on context where context is not just words but anything that contributes to context such as preposition. This had a surprisingly different word embedding. The functional similarity seen here is very different from the similarities seen with the word2vec.  Later Stanford researchers started to find out what should the similarity function should look like ? And they came up with an objective function that showed the results similar to word2vec. They used word context co-occurrence matrix by using some generalizations of context and came up with an objective function that looked a new weighted least square regression model. When they factorize the co-occurrence matrix, they get results similar to word2vec. While word2vec tries to fit a model, predict, find the errors and refine it based on backpropagation, the Stanford objective function tries to count the words by context matrix, take a log and factorizes it. They explained word2vec and also found a supposedly faster way though this has been challenged quite a bit by Levy and Goldberg.  Levy and Goldberg subsequently did more work into analyzing skip gram with negative sampling. Negative sampling tries to predict context that is not there. It take a window of sampling and takes negative ten contexts. Levy and Goldberg tried to count it as the number of times the word appears alone and the number of times the context appears alone versus taken together. They fit this information in a well known formula called the pointwise mutual information (PMI). They declined to factorize the PMI matrix and used the rows of words to contexts as vectors directly and their experiments compared the similarity functions.
I have still not understood why negative sampling and PMI do not do the same diligence as precision and recall. With precision we include the false positives and with recall we include the false negatives.
We discussed research that was using matrix and high dimensional vectors using statistical framework and neural nets but matrices and graphs go well together. We will review a few graph techniques for finding correlations that word2vec does. Word2vec is hard to apply because you have to tune different parameters. With graphs and persisted connections we hope to alleviate that.
#codingexercise
Given an array of n elements, where each element is at most k away from its target position, sort the array in the reverse order
Void sortKReverse(ref List<int> A, int k)
{
Assert(A.count > k);
Var h = new Heap<int>();
For (int i =0: i < k && i < n; i++)
{
h.Add(A[i]);
}
Int i = k+1;
For(int j =0; j < n; j++)
{
   If (i < n){
       A[j] = h.replacemax(A[i]);
       I ++;
}else{
      A[J]  = h.max()
}
}
}

#urlshortener
Void printUrlHash(string url){
Console.Write("{2:X8}", url.GetHashCode());
}
Statistical framework is especially relevant in natural language processing.

The purpose of using random variables and a Bayes framework is that it has advantages over the  frequentist approach. The focus on the frequentist approach is that the estimated model will be closer to the truth  when there is sufficient amount of data available. In the Bayes framework, this is in terms of probabilities from the training sample. For example, if we want to define a probability distribution over a language vocabulary, then one can define a random variable X(w) where w ranges over the words in the vocabulary. The probability of a word in the vocabulary then becomes  p(X=w). Random variables can be both  discrete and continuous. In the case of discrete variables we attach a weight to each element in the sample space such as in a multinomial distribution and the discrete variable is thus assumed to have an underlying mass function which is the sum of the probabilities ranging over the possible values that it can take. The same is done for a continuous random  variable except that we integrate it over a sample space and call it a probability density function. Since multiple random variables are defined on the same sample space, we can describe a joint distribution
Word2vec uses CBOW architecture predicts a word based on the surrounding words and the skip gram predicts the surrounding words based on the current word. This is done in a specific way called the softmax function and it is summarized in a formulation as:
p(wo/wi) = exp(vo dash transpose. vi) / sum of such expectations for i,j ranging over the entire vocabulary where vw and vw' are input and output vector representations of word w in vocabulary W. This is a heavily optimized expression but the idea is that each word tries to predict every word around it. Further research by others tried to predict the word based on surrounding words but based on context where context is not just words but anything that contributes to context such as preposition.

#codingexercise
Given an array of n elements, where each element is at most k away from its target position, sort the array in the reverse order
Void sortKReverse(ref List<int> A, int k)
{
Assert(A.count > k);
Var h = new Heap<int>();
For (int i =0: i < k && i < n; i++)
{
h.Add(A[i]);
}
Int i = k+1;
For(int j =0; j < n; j++)
{
   If (i < n){
       A[j] = h.replacemax(A[i]);
       I ++;
}else{
      A[J]  = h.max()
}
}
}

Statistical framework is especially relevant in natural language processing.

The purpose of using random variables and a Bayes framework is that it has advantages over the  frequentist approach. The focus on the frequentist approach is that the estimated model will be closer to the truth  when there is sufficient amount of data available. In the Bayes framework, this is in terms of probabilities from the training sample. For example, if we want to define a probability distribution over a language vocabulary, then one can define a random variable X(w) where w ranges over the words in the vocabulary. The probability of a word in the vocabulary then becomes  p(X=w). Random variables can be both  discrete and continuous. In the case of discrete variables we attach a weight to each element in the sample space such as in a multinomial distribution and the discrete variable is thus assumed to have an underlying mass function which is the sum of the probabilities ranging over the possible values that it can take. The same is done for a continuous random  variable except that we integrate it over a sample space and call it a probability density function. Since multiple random variables are defined on the same sample space, we can describe a joint distribution
Word2vec uses skip graph formulation.

#codingexercise
Given an array of n elements, where each element is at most k away from its target position, sort the array in the reverse order
Void sortKReverse(ref List<int> A, int k)
{
Assert(A.count > k);
Var h = new Heap<int>();
For (int i =0: i < k && i < n; i++)
{
h.Add(A[i]);
}
Int i = k+1;
For(int j =0; j < n; j++)
{
   If (i < n){
       A[j] = h.replacemax(A[i]);
       I ++;
}else{
      A[J]  = h.max()
}
}
}

Wednesday, September 21, 2016

#algorithm discussion continued
Today we continue reading the flow networks. We were looking at maximum flows
Maximum flows can also be calculated using "push-relabel" method.
We reviewed relabel-to-front algorithm next. This algorithm improves the running time of the maximum flow problem by choosing the order of basic operations carefully and managing the network data structure efficiently.  We saw what admissible edges are. If a vertex u is overflowing and (u,v) is an admissible edge, the push (u,v) applies and while it does not create any new admissible edge, it may cause the edge u,v to become admissible. The admissible edges also applies to relabel operation
We also saw how the discharge operation takes place and how the relabel-to-front algorithm proceeds.
Since the relabel-to-front algorithm iterates on vertices other than the source or sink, this is best done in the form of topological sort of the vertices in the admissible network.

The Relabel to front algorithm computes maximum flow. It is in fact a push-relabel method. It uses discharge to do push and relabel only when they apply. The algorithm maintains the following invariant.
The vertex being discharged of its excess flow from the list of vertices has no vertex earlier in the list that has excess flow.
This means that the vertices are taken in topological sort order.

#preorder without recursion
void preorder(node root)
{
var current = root;
var stk = new Stack<node>();

while(current != null || stk.empty() == false)
{
if (current != null)
{
Console.Write(current.data);
stk.push(current);
current = current.left;
continue;
}
if (stk.empty() == false)
{
current = stk.pop();
current = current.right;
}
}
}

#postorder without recursion
void postorder(node root)
{
var current = root;
var stk = new Stack<node>();
Var last = null;
Var peek = null;
while(current != null || stk.empty() == false)
{
if (current != null)
{
stk.push(current);
current = current.left;
Continue;
}else
    Peek = stk.peek();

If(peek.right != null && peek.right!=last){
    Current = peek.right;
}
Else{
Console.Write(peek.data);
Last = stk.pop();

}
}
}
}

Given an array of n elements, where each element is at most k away from its target position, sort the array
Void sortK(ref List<int> A, int k)
{
Assert(A.count > k);
Var h = new Heap<int>();
For (int i =0: i < k && i < n; i++)
{
h.Add(A[i]);
}
Int i = k+1;
For(int j =0; j < n; j++)
{
   If (i < n){
       A[j] = h.replaceMin(A[i]);
       I ++;
}else{
      A[J]  = h.min()
}
}
}

The above also works for reverse order sorting except that we need to substitute min heap for max heap

Tuesday, September 20, 2016

#algorithm discussion continued
Today we continue reading the flow networks. We were looking at maximum flows
Maximum flows can also be calculated using "push-relabel" method.
We reviewed relabel-to-front algorithm next. This algorithm improves the running time of the maximum flow problem by choosing the order of basic operations carefully and managing the network data structure efficiently.  We saw what admissible edges are. If a vertex u is overflowing and (u,v) is an admissible edge, the push (u,v) applies and while it does not create any new admissible edge, it may cause the edge u,v to become admissible. The admissible edges also applies to relabel operation
We also saw how the discharge operation takes place and how the relabel-to-front algorithm proceeds.
Since the relabel-to-front algorithm iterates on vertices other than the source or sink, this is best done in the form of topological sort of the vertices in the admissible network.
#codingexercise
Print all nodes that are a distance k from leaf
Void GetK(Node root, int k, ref List<int> path, ref list<int> visited, ref int count)
{
if (root== null) return;
path[count] = root.data;
visited[count] = false;
count += 1;
if (root.left == null && root.right == null &&
     Count-k-1==0 && visited[count] ==false)
{
Console.write(path[count-k-1]);
visited[count] = true;
}
root.left= GetK(root.left, K, ref path, ref visited, ref count);
root.right = GetK(root.right, K, ref path, ref visited, ref count);
}

The above method assumes each element is visited in inorder

#Inorder without recursion
void inorder(node root)
{
var current = root;
var stk = new Stack<node>();

while(current != null || stk.empty() == false)
{
if (current != null)
{
stk.push(current);
current = current.left;
continue;
}
if (stk.empty() == false)
{
current = stk.pop();
Console.Write(current.data);
current = current.right;
}
}
}