Today we continue discussing the conditional random field. A random field is a particular distribution in a family of distributions. We were discussing the discriminative and generative modeling.By modeling the conditional distributions directly, we avoid the dependencies of the observations. This makes the discriminative modeling less prone to violations from the assumptions of independence between the features.The conditional random fields are discriminative. A CRF is written in the form of feature functions. Each feature function has a feature for state transition and a feature for state observation. In an undirected graphical model, we had an exponential form of vectors. Instead we now use a set of weights that are shared across all the class. When they take a non zero value for a single class and zero otherwise, they become feature functions. We have now expressed the exponentials in the form of feature functions.
In the semantic network, this applies directly. We model the conditional probability p(c|w;theta) using p(c|w, theta) = exp(vector_for_c.vector_for_w) divided by sum of all such exponentials. where c is the semantic class and w is the word from the text or vocabulary encountered. This is an undirected graphical model and we can refine it with feature functions just the same way. When we express it in terms of feature functions, we are using it to articulate the current word's identity. But we are not restricted to using indicator functions. We could exploit other richer features of the input, such as prefixes and suffixes, context surrounding the word, and even named entities. This leads to the general definition of linear chain CRFs where the set of weights are replaced by parameter vectors.
In CRF, we allow the score of the tranisititon (i,j) to depend on the current observation vector by adding a feature that chains the transition and the observations and this is commonly used in text applications. Note that in interations involving time steps, the observation argument is written as a vector xt which contains all the components of all global observations x that are needed for computing features at time t. If the CRF uses the next word as a feature, then the vector for the last word would have included the identity of the next word.
The parameter vectors are estimated the same way as we would any conditional distributions. We take the sequence of inputs and the sequence of predictions and estimate using the penalized maximum likelihood which is a way of saying we take the sum of the log of the conditional distribution of the prediction to the input.
The independence of each sequence is something we can relax later.
Courtesy : Sutton and McCallum
#codingexercise
Int ResetKthBitFromLastInBinaryRepresentation(int n, int k)
{
String binary;
ToBinary(n, ref binary);
Assert(binary.count > n);
If (n-k >= 0 && binary[n-k] == 0)
binary[n-k] = 1;
Int num;
ToInt(binary, ref num, 0);
Return num;
}
In the semantic network, this applies directly. We model the conditional probability p(c|w;theta) using p(c|w, theta) = exp(vector_for_c.vector_for_w) divided by sum of all such exponentials. where c is the semantic class and w is the word from the text or vocabulary encountered. This is an undirected graphical model and we can refine it with feature functions just the same way. When we express it in terms of feature functions, we are using it to articulate the current word's identity. But we are not restricted to using indicator functions. We could exploit other richer features of the input, such as prefixes and suffixes, context surrounding the word, and even named entities. This leads to the general definition of linear chain CRFs where the set of weights are replaced by parameter vectors.
In CRF, we allow the score of the tranisititon (i,j) to depend on the current observation vector by adding a feature that chains the transition and the observations and this is commonly used in text applications. Note that in interations involving time steps, the observation argument is written as a vector xt which contains all the components of all global observations x that are needed for computing features at time t. If the CRF uses the next word as a feature, then the vector for the last word would have included the identity of the next word.
The parameter vectors are estimated the same way as we would any conditional distributions. We take the sequence of inputs and the sequence of predictions and estimate using the penalized maximum likelihood which is a way of saying we take the sum of the log of the conditional distribution of the prediction to the input.
The independence of each sequence is something we can relax later.
Courtesy : Sutton and McCallum
#codingexercise
Int ResetKthBitFromLastInBinaryRepresentation(int n, int k)
{
String binary;
ToBinary(n, ref binary);
Assert(binary.count > n);
If (n-k >= 0 && binary[n-k] == 0)
binary[n-k] = 1;
Int num;
ToInt(binary, ref num, 0);
Return num;
}
No comments:
Post a Comment