Today we continue discussing the conditional random field.  A random field is a particular distribution in a family of distributions. We were discussing the discriminative and generative modeling. Generative means it is based on the model of  joint distribution of a state with regard to observation. Discriminative means it is based on the model of conditional distribution of a state given the observation. The main advantage of the discriminative modeling is that it is better suited for rich overlapping features. By modeling the conditional distributions directly, we avoid the dependencies of the observations. This makes the discriminative modeling less prone to violations from the assumptions of independence between the features.
The conditional random fields are very much discriminative. They make independence assumptions between the states but not between the observations.
We also saw the advantages of sequence modeling which is an arrangement of linear chain of output variables. we combine both the linear chain as well as the discriminative modeling. This yields a linear chain CRF. A CRF is written in the form of feature functions. Each feature function has the form fz(yt, yt-1,xt) which goes to say that the states are dependent only on the previous state and that they are independent for their observations. In other words there is a feature, which we describe in terms of an indicator which takes the value 1 for the state to equal an instance and zero otherwise, for every state-transition and there is a feature for every state-observation pair.
In an undirected graphical model, we were normalizing a family of local functions or distributions and each local function had an exponential form and represented "sufficient statistics" in terms of observation and class. There were a chosen 'z' number of such local functions. But instead of using one vector per class, we used a set of weights that are shared across all the class. When it is non-zero for one class and 1 for that class, it becomes a feature function and we can express a logistic regression model in a normalization of the exponentials of the weighted feature functions. If we interpret it generatively, we take all possible values of the numerator and sum it to use as the denominator for each component. Now we have gone a step further and substituted each feature function with a form that is dependent on state-transition as well as state-observation.
If we look at the skip-gram model, we were using the undirected graph model. Instead we could now use semantic feature function from a semantic embedding in the same word space along with conditional representations. If we were to do named-entity recognition, we could also consider semantic feature function there.
If we look at the feature function for the semantic, we could use a conditional distribution for the semantic classes.
And we don't even have to do class /label propagation.
Courtesy : Sutton and McCallum
The conditional random fields are very much discriminative. They make independence assumptions between the states but not between the observations.
We also saw the advantages of sequence modeling which is an arrangement of linear chain of output variables. we combine both the linear chain as well as the discriminative modeling. This yields a linear chain CRF. A CRF is written in the form of feature functions. Each feature function has the form fz(yt, yt-1,xt) which goes to say that the states are dependent only on the previous state and that they are independent for their observations. In other words there is a feature, which we describe in terms of an indicator which takes the value 1 for the state to equal an instance and zero otherwise, for every state-transition and there is a feature for every state-observation pair.
In an undirected graphical model, we were normalizing a family of local functions or distributions and each local function had an exponential form and represented "sufficient statistics" in terms of observation and class. There were a chosen 'z' number of such local functions. But instead of using one vector per class, we used a set of weights that are shared across all the class. When it is non-zero for one class and 1 for that class, it becomes a feature function and we can express a logistic regression model in a normalization of the exponentials of the weighted feature functions. If we interpret it generatively, we take all possible values of the numerator and sum it to use as the denominator for each component. Now we have gone a step further and substituted each feature function with a form that is dependent on state-transition as well as state-observation.
If we look at the skip-gram model, we were using the undirected graph model. Instead we could now use semantic feature function from a semantic embedding in the same word space along with conditional representations. If we were to do named-entity recognition, we could also consider semantic feature function there.
If we look at the feature function for the semantic, we could use a conditional distribution for the semantic classes.
And we don't even have to do class /label propagation.
Courtesy : Sutton and McCallum
