Statistical framework is especially relevant in natural language processing.
The purpose of using random variables and a Bayes framework is that it has advantages over the frequentist approach. The focus on the frequentist approach is that the estimated model will be closer to the truth when there is sufficient amount of data available. In the Bayes framework, this is in terms of probabilities from the training sample. For example, if we want to define a probability distribution over a language vocabulary, then one can define a random variable X(w) where w ranges over the words in the vocabulary. The probability of a word in the vocabulary then becomes p(X=w). Random variables can be both discrete and continuous. In the case of discrete variables we attach a weight to each element in the sample space such as in a multinomial distribution and the discrete variable is thus assumed to have an underlying mass function which is the sum of the probabilities ranging over the possible values that it can take. The same is done for a continuous random variable except that we integrate it over a sample space and call it a probability density function. Since multiple random variables are defined on the same sample space, we can describe a joint distribution.
Word2vec uses CBOW architecture that predicts a word based on the surrounding words and the skip gram that predicts the surrounding words based on the current word. This is done in a specific way called the softmax function and it is summarized in a formulation as:
p(wo/wi) = exp(vo dash transpose. vi) / sum of such expectations for i,j ranging over the entire vocabulary where vw and vw' are input and output vector representations of word w in vocabulary W. This is a heavily optimized expression but the idea is that each word tries to predict every word around it. Further research by Levy and Goldberg tried to predict the word not based on surrounding words but based on context where context is not just words but anything that contributes to context such as preposition. This had a surprisingly different word embedding. The functional similarity seen here is very different from the similarities seen with the word2vec. Later Stanford researchers started to find out what should the similarity function should look like ? And they came up with an objective function that showed the results similar to word2vec. They used word context co-occurrence matrix by using some generalizations of context and came up with an objective function that looked a new weighted least square regression model. When they factorize the co-occurrence matrix, they get results similar to word2vec. While word2vec tries to fit a model, predict, find the errors and refine it based on backpropagation, the Stanford objective function tries to count the words by context matrix, take a log and factorizes it. They explained word2vec and also found a supposedly faster way though this has been challenged quite a bit by Levy and Goldberg. Levy and Goldberg subsequently did more work into analyzing skip gram with negative sampling. Negative sampling tries to predict context that is not there. It take a window of sampling and takes negative ten contexts. Levy and Goldberg tried to count it as the number of times the word appears alone and the number of times the context appears alone versus taken together. They fit this information in a well known formula called the pointwise mutual information (PMI). They declined to factorize the PMI matrix and used the rows of words to contexts as vectors directly and their experiments compared the similarity functions.
I have still not understood why negative sampling and PMI do not do the same diligence as precision and recall. With precision we include the false positives and with recall we include the false negatives.
We discussed research that was using matrix and high dimensional vectors using statistical framework and neural nets but matrices and graphs go well together. We will review a few graph techniques for finding correlations that word2vec does. Word2vec is hard to apply because you have to tune different parameters. With graphs and persisted connections we hope to alleviate that.
#codingexercise
#urlshortener
Void printUrlHash(string url){
Console.Write("{2:X8}", url.GetHashCode());
}
The purpose of using random variables and a Bayes framework is that it has advantages over the frequentist approach. The focus on the frequentist approach is that the estimated model will be closer to the truth when there is sufficient amount of data available. In the Bayes framework, this is in terms of probabilities from the training sample. For example, if we want to define a probability distribution over a language vocabulary, then one can define a random variable X(w) where w ranges over the words in the vocabulary. The probability of a word in the vocabulary then becomes p(X=w). Random variables can be both discrete and continuous. In the case of discrete variables we attach a weight to each element in the sample space such as in a multinomial distribution and the discrete variable is thus assumed to have an underlying mass function which is the sum of the probabilities ranging over the possible values that it can take. The same is done for a continuous random variable except that we integrate it over a sample space and call it a probability density function. Since multiple random variables are defined on the same sample space, we can describe a joint distribution.
Word2vec uses CBOW architecture that predicts a word based on the surrounding words and the skip gram that predicts the surrounding words based on the current word. This is done in a specific way called the softmax function and it is summarized in a formulation as:
p(wo/wi) = exp(vo dash transpose. vi) / sum of such expectations for i,j ranging over the entire vocabulary where vw and vw' are input and output vector representations of word w in vocabulary W. This is a heavily optimized expression but the idea is that each word tries to predict every word around it. Further research by Levy and Goldberg tried to predict the word not based on surrounding words but based on context where context is not just words but anything that contributes to context such as preposition. This had a surprisingly different word embedding. The functional similarity seen here is very different from the similarities seen with the word2vec. Later Stanford researchers started to find out what should the similarity function should look like ? And they came up with an objective function that showed the results similar to word2vec. They used word context co-occurrence matrix by using some generalizations of context and came up with an objective function that looked a new weighted least square regression model. When they factorize the co-occurrence matrix, they get results similar to word2vec. While word2vec tries to fit a model, predict, find the errors and refine it based on backpropagation, the Stanford objective function tries to count the words by context matrix, take a log and factorizes it. They explained word2vec and also found a supposedly faster way though this has been challenged quite a bit by Levy and Goldberg. Levy and Goldberg subsequently did more work into analyzing skip gram with negative sampling. Negative sampling tries to predict context that is not there. It take a window of sampling and takes negative ten contexts. Levy and Goldberg tried to count it as the number of times the word appears alone and the number of times the context appears alone versus taken together. They fit this information in a well known formula called the pointwise mutual information (PMI). They declined to factorize the PMI matrix and used the rows of words to contexts as vectors directly and their experiments compared the similarity functions.
I have still not understood why negative sampling and PMI do not do the same diligence as precision and recall. With precision we include the false positives and with recall we include the false negatives.
We discussed research that was using matrix and high dimensional vectors using statistical framework and neural nets but matrices and graphs go well together. We will review a few graph techniques for finding correlations that word2vec does. Word2vec is hard to apply because you have to tune different parameters. With graphs and persisted connections we hope to alleviate that.
#codingexercise
Given an array of n elements, where each element is at most k away from its target position, sort the array in the reverse order
Void sortKReverse(ref List<int> A, int k)
{
Assert(A.count > k);
Var h = new Heap<int>();
For (int i =0: i < k && i < n; i++)
{
h.Add(A[i]);
}
Int i = k+1;
For(int j =0; j < n; j++)
{
If (i < n){
A[j] = h.replacemax(A[i]);
I ++;
}else{
A[J] = h.max()
}
}
}
#urlshortener
Void printUrlHash(string url){
Console.Write("{2:X8}", url.GetHashCode());
}
No comments:
Post a Comment