We have seen some generalized discussion about what random fields are, their benefits, their representations, the objective functions and their optimizations including techniques such as CG method which ties in things we have previously studied for the purposes of text analysis. We will now see how to apply it. At this point, we are merely forming hypothesis. We realize that the PMI matrix helped find the word relationships with word vectors. We said that the semantic network embedded within the word-space can be used to find candidates for positive and negative sampling. We said that the log exponential form of the PMI matrix defines the undirected graphical model. We said we could replace this model with feature functions where features are defined in terms of both transitions from previous state to current state and observations for current state. Then we said we could generalize this with parameter vectors and minimize the objective function with Conjugate Gradient method.
As we move from an undirected graphical model, to linear chain CRF, let us take a look at reworking not just the PMI matrix but also the semantic embedding. First we modify the recommendations for creating the semantic embedding in the word space discussed earlier from the paper by Johannson and Pina by using linear chain CRF and optimization instead of the steepest gradient descent that they perform. We do this by introducing a set of weights which equal 1 for a given state and zero otherwise. In other words, we define feature functions The reason we choose this problem to apply the CRF is that it has a direct minimization problem which we attempt to solve with CRF instead. There we were using sense vectors from a choice of hypernyms and hyponyms and we applied distance function to compute similarities between senses. The sense vectors were similar to the word vectors except that they drew their features from the thesaurus. Now we use the same features as states for the underlying sequence model and say that the words are based on some sequence of states where the previous state determines the current state and the current state has independent observations. We could do this merely by associating each word with their state. By drawing a sequence model, although less than perfect from the generative directed models, we are able to form the CRF directly. This just serves as an example to illustrate how to apply the sequence model directly. The joint distribution that we used to form the sense vectors in the first place is taken as the log probability So p(y)p(y/x) becomes lambda-y + sum lambda-y. x where the first term is the bias weight and the second term is the feature weight. The two together when raised exponentially and normalized becomes the conditional distribution. Therefore it is possible to rewrite the joint distributions in the form of bias and feature weights after which we already know how to proceed by defining feature functions that are non-zero for a single class label, representing an objective function and applying Conjugate Gradient method.
The Steps for introducing semantic CRF are being considered as follows:
1) Generate the sense embedding network in the same word space
2) Generate a PMI word-context matrix that has both positive and negative samples. Both the samples should include semantic content such as hyponyms and hypernyms.
3) We establish a conditional distribution from the top PMIs and the top senses.
4) Then we maximize the conditional log likelihood of this distribution.
As we move from an undirected graphical model, to linear chain CRF, let us take a look at reworking not just the PMI matrix but also the semantic embedding. First we modify the recommendations for creating the semantic embedding in the word space discussed earlier from the paper by Johannson and Pina by using linear chain CRF and optimization instead of the steepest gradient descent that they perform. We do this by introducing a set of weights which equal 1 for a given state and zero otherwise. In other words, we define feature functions The reason we choose this problem to apply the CRF is that it has a direct minimization problem which we attempt to solve with CRF instead. There we were using sense vectors from a choice of hypernyms and hyponyms and we applied distance function to compute similarities between senses. The sense vectors were similar to the word vectors except that they drew their features from the thesaurus. Now we use the same features as states for the underlying sequence model and say that the words are based on some sequence of states where the previous state determines the current state and the current state has independent observations. We could do this merely by associating each word with their state. By drawing a sequence model, although less than perfect from the generative directed models, we are able to form the CRF directly. This just serves as an example to illustrate how to apply the sequence model directly. The joint distribution that we used to form the sense vectors in the first place is taken as the log probability So p(y)p(y/x) becomes lambda-y + sum lambda-y. x where the first term is the bias weight and the second term is the feature weight. The two together when raised exponentially and normalized becomes the conditional distribution. Therefore it is possible to rewrite the joint distributions in the form of bias and feature weights after which we already know how to proceed by defining feature functions that are non-zero for a single class label, representing an objective function and applying Conjugate Gradient method.
The Steps for introducing semantic CRF are being considered as follows:
1) Generate the sense embedding network in the same word space
2) Generate a PMI word-context matrix that has both positive and negative samples. Both the samples should include semantic content such as hyponyms and hypernyms.
3) We establish a conditional distribution from the top PMIs and the top senses.
4) Then we maximize the conditional log likelihood of this distribution.
No comments:
Post a Comment