Tuesday, October 11, 2016

Today we discuss conditional random field.We said when we have diverse multiple features for a vector that depend on each other, one way to represent them is using conditional random fields. We can see this clearly in relational data that have two characteristics. First, each entity has a rich set of features and second there are statistical dependencies between the entites we wish to model. These relationships and dependencies between entities are naturally represented in graph.  Graph based models are have been used to represent the joint probability distribution p(y,x) where the variables y represent the attributes of the entities that we wish to predict and the input variables x represent our observed knowledge about the entities. However if we were to model joint distributions, then modeling the distribution p(x) can include complex dependencies. Modeling these dependencies among inputs can lead to intractable models, but ignoring them can lead to reduced performance. Therefore, a solution may be to directly model the conditional distribution p(y|x), which is sufficient for classification. This is what conditional random fields does. A conditional random field is simply a conditional distribution p(y|x) with an associated graphical structure. The  conditional dependencies among the input variable x now do not need to be represented explicitly and the input can assume to have rich input features.
A conditional random fields (CRF) allows us to directly model the conditional probability distribution p(y|x) which is sufficient for classification
We now review current training and inference techniques for conditional random fields. Unlike linear chain models, CRFs can capture long distance dependencies between labels. For example, when a noun is repeated several times in a text, each mention may contribute a different mention for the entity. Since all mentions have the same labels, it is helpful to use CRFs. Moreover, we can use skip-chain CRF which is a model that jointly performs segmentation and collective labeling of extracted mentions.
When graphs are drawn with these probability distributions, the adjacency is based on the factorizations. The main idea is to represent  a distribution over a large number of random variables by a product of local functions that depend on only a small number of variables. For example, the similarities are based on PMI.
This undirected graphical model can also be expressed as a factor graph which is a bipartite graph in which a variable node is connected to a factor node. A factor can be a set of one or more functions that map the outcomes to the observations.
A directed graphical model also known as Bayesian network, is based on a directed graph that factorize based on conditions. The term generative model is used to refer to a directed graphical model in which the outputs topologically precede the inputs. In other words, it propagates in forward only manner.
A directed graphical model is useful for classification. Both the naive Bayes classifier and the maximum entropy classifier are used in natural language processing and are examples of a directed graphical model.  The naive Bayes classifier is based on conditional probability of each feature and the maximum entropy classifier is based on the log probability of each class. Instead of using one vector per class in the latter model, a set of weights are defined that is shared across all the classes. These are non-zero only for a single class and are called feature functions. The feature function  can be defined in terms of feature weights and bias weights. The feature weights is an indicator function of the feature which takes the value 1 when the feature equals an instance and zero otherwise. The bias weights similarly take the value 1 for the bias and zero otherwise. The notations used for conditional random field involve feature weights and bias weights.

Courtesy : Sutton and McCallum

No comments:

Post a Comment