Wednesday, October 12, 2016

Today we continue discussing the conditional random field. We were discussing the undirected graphical model and its application in naive Bayes classifier and maximum entropy classifier - both of which are used in natural language processing and are examples of a directed graphical model.  The naive Bayes classifier is based on conditional probability of each feature and the maximum entropy classifier is based on the log probability of each class. Instead of using one vector per class in the latter model, a set of weights are defined that is shared across all the classes. These are non-zero only for a single class and are called feature functions. The feature function  can be defined in terms of feature weights and bias weights. The feature weights is an indicator function of the feature which takes the value 1 when the feature equals an instance and zero otherwise. The bias weights similarly take the value 1 for the bias and zero otherwise. The notations used for conditional random field involve feature weights and bias weights.
While classifiers predict only a single class variable, the graphical model could be used to model many variables that are interdependent. When the output variables are arranged in a sequence or a linear chain, we get a sequence model. This is the approach taken by a hidden Markov Model. An HMM models a sequence of observations by assuming that there is a sequence of states. For example in named entity recognition task, we try to identify and classify proper names as person, location, organization and other.  Each state depends only on the previous state. And it is independent of all its ancestors. An HMM assumes that each observation variable depends only on the current state. This model is therefore specified by three probability distributions: the distribution p(y) over initial states, the transition distribution from one state to another and finally the observation distribution of the occurrence with respect to the state.
When we include interdependence between features, we use a generative model. This is usually done in one of two ways - enhance the model to represent dependencies among the inputs, or make simplifying independence assumptions. The first approach is difficult because we have to maintain tractability.  The second approach can hurt performance. The difference in their behaviors is largely due to the fact that one is generative and the other is discriminative.
Generative means it is based on the model of  joint distribution of a state with regard to observation. Discriminative means it is based on the model of conditional distribution of a state given the observation. We can also form a model based on generative-discriminative pair.


The generative model has many benefits. The main advantage of the discriminative modeling is that it is better suited for rich overlapping features. By modeling the conditional distributions directly, we avoid the dependencies of the observations. This makes the discriminative modeling less prone to violations from the assumptions of independence between the features.
The conditional random fields make independence assumptions between the states but not between the observations.

Courtesy : Sutton and McCallum

No comments:

Post a Comment