Friday, September 23, 2016

We continue our discussion on Statistical framework in natural language processing.
The mutual information of two random variables is a measure of the mutual dependence between two variables. When the two random variables are taken together, we measure their joint distribution. Otherwise we take them independently which is termed marginal distribution. MI is therefore log(p(X,Y)/p(X)p(Y)). If they are independent the numerator and denominator are the same and MI equals zero.
Pointwise Mutual information or PMI refers to single events, whereas MI refers to the average of all possible events. MI is the expected value of PMI.
In natural language processing, this comes useful because we know that the words and contexts are dependent on each other. Therefore we can take the words as one of the marginal distributions and their contexts as another. This then can give us a matrix of PMIs for each cell in the word context matrix.
This is a pretty large matrix and a sparse matrix because words and context don't really appear together in a large number of ways. Matrix factorization can be done with techniques such as Singular Value Decomposition which we will go into later.
Word2Vec factorizes these matrices. It does this in a weighted pair giving more influence to frequent pairs. These weights have an interesting contribution. Without them, the original PMI matrix can still be reconstructed but the weighted and non-weighted objectives result in different generalizations. 
By generalizations, it means whatever we choose for the context such as prepositions
When the dimensions are reduced, this plays out an even bigger role.
If we change the algorithm such as word2vec to goldberg, we can even expect different results. For example,  the word Washington may bring up similar words as Seattle, Evergreen etc and on the other hand the same word might bring up other states as Oregon, Utah, California etc.

Let us now look at how negative sampling plays into this. Negative sampling means it is the reverse of sampling contexts for words. In the latter case, we were picking say a handful of generalizations that we argued were contexts and mapped them to word in the word-context matrix. Now for the same number of contexts, we choose others at random that we know are not contexts and this is negative sampling.

#codingexercise
Check if 2 binary trees are equal recursively and iteratively
bool isEqual(Node root1, Node root2)
{
if (root1 == NULL && root2 == NULL) return true;
if (root1 == NULL) return false;
if (root2 == NULL) return false;
return root1.data == root2.data && isEqual(root1.left, root2.left) && isEqual(root1.right, root2.right);
}
bool IsEqualIterative(Node root1, Node root2)
{
var current1 = root1;
var current2 = root2;
var stk = new Stack<Node>(); // we only need one stack

while (current1 != null ||  current2 != null || stk. empty() == false)
{

if (current1 == null && current2 != null ) return false;
if (current1 != null && current2 == null ) return false;

if (current1 != null)
{
if (current1.data != current2.data) return false;
stk.push(current1);
current1 = current1.left;
current2 = current2.left;
continue;
}

if (stk. empty() == false)
{
current1 = stk. pop();
current1 = current1.right;
current2 = current2.right;
}

}
return true;
}
Although the order of checking does not depend on the type of traversal, it is better and simpler to use preorder traversal.

No comments:

Post a Comment