Saturday, September 24, 2016

We continue our discussion on statistical analysis in Natural Language Processing
Levy's and Goldberg's paper introduced a departure from Word2Vec  in how words and contexts are chosen. They introduced Negative Sampling which is the reverse of sampling contexts for words. In the latter case, we were picking say a handful of generalizations that we argued were contexts and mapped them to word in the word-context matrix. Now for the same number of contexts, we choose others at random that we know are not contexts. The assumption is that the randomly chosen context will be the unobserved ones. This means that the negative and the positive samples are selected so that they are as disjoint as possible. Furthermore the quality of the positive selections improve as their probabilities are higher

The purpose of negative sampling is to minimize the probability that words,contexts pairs are chosen such that they all appear from the text and they consequently have the same values for their vectors which is not very helpful.  By introducing random word,context we now have a mix. This has the side-effect that the word, contexts chosen as positive sampling will indeed be more representative of the text. 
But where are these random word, context pairs chosen from ? Do they come from the corpus or within the text. Arguably, if it is outside the sampling window, it is sufficient.

Find the number of three edge cycles in an undirected graph.
We use depth first search to traverse the graph and generate paths.
DFS ( V, E, paths)
For each vertex v in V
       V.color=white
       V.d = nil
  Time = 0
 For each vertex v in V:
       If v.color == white:
           path = new Path();  
           DFS-Visit (V, E, v, path, paths)
     
 DFS-VISIT (V,E, u, path, ref paths)
  time = time + 1
   u.d = time
   u.color = gray
 
   path.Add(u);
   foreach  vertex v adjacent to u
        If v.color == white
           DFS-VISIT(V,E,v)
         Else
               If v.d  <= u.d < u.f <= v.f  :
                        paths.Add(path); 
u.color = black
time = time + 1
 u.f = time


filter(paths):
var d = new Dictionary<string, int>();
foreach path in paths:
     foreach vertex  v in path:
           if v.next.next.next == v:
               pathidentifier = stringify(v,3)
                if d.contains(pathidentifier) == false:
                    d.Add(pathidentifier, 1);


This topological sort does not work with breadth first traversal because each vertex is able to participate as the starting point only in a depth first traversal. That said we can use either breadth first traversal or depth first traversal to traverse the graph and enumerate paths. Finding cycles from enumerated paths is easy because we look for the start and return to same vertex across all the paths.

No comments:

Post a Comment