We were discussing parameter vectors with conditional random fields. A random field is a particular distribution in a family of distributions. It it discriminative which means it models the conditional distributions directly which avoids the dependencies of the observations.This makes it less prone to violations from the assumptions of independence between the features.A CRF is written in the form of feature functions. Each feature function has a feature for state transition and a feature for state observation. Feature functions are not limited to indicator functions where a set of weights take the value one for a class instance and zero otherwise. We could exploit other richer features of the input and replace the weights with parameter vectors. The parameter vectors are estimated with the same sum and log notations as we have seen for what is called penalized maximum likelihood. To do this, we take the log likelihood which is a conditional probability of the prediction summed over each of the input. However we showed the CRF as an improvement over the conditional probabilities, so we can write the log likelihood in terms of the state transition and state observation real valued feature functions and parameter vector. To avoid overfitting when the weight vector norms is too large, we penalize large weights and this is called regularizing it. A regularization parameter determines the strength of the penalty. The choice of this parameter can ideally be found by iterating over the range of parameters. In practice, however, typically only a few are tried out by varying the regularization parameter by a factor of 10. There can be other choice of regularization parameters but this kind of tuning is very well known in statistics and therefore applied directly.
The conditional log likelihood thus formed can be maximized by taking partial derivatives. In this case, it results in three components. The first term is the expected value of the feature function when the distribution is observed. An expected value is the sum of all possible values of a variable each multiplied by the probability of its occurrence. In this case we use product of feature weights and bias weights. directly for an average. The second term is the derivative of the log of the normalization constant which sums over all the possible state sequences.This second term is the expectation of the feature function under the model distribution, not the observed one. The optimum solution is when the gradient is zero and the two expectations are equal. This pleasing interpretation is a standard result about maximum likelihood. The third term is merely a regularization of the parameter vectors and can therefore be treated as insignificant. A number of optimization techniques can be chosen to perform this optimization. Because the original objective function we started out with is a logarithm of the exponentials, it is concave which facilitates these techniques. The simplest one would be the steepest ascent but it would take too many iterations. The difficulty is that there are millions of parameters so computation steps are expensive. As a result, current techniques now involve approximations for interim steps such as in quasi-Newton method. Conjugate Gradient method also performs approximations and are suited for these kind of probems.
So far we have seen some pretty generalized discussion about what random fields, their benefits, their representations, the objective functions and their optimizations including techniques such as CG method which ties in things we have previously studied for the purposes of text analysis.
The conditional log likelihood thus formed can be maximized by taking partial derivatives. In this case, it results in three components. The first term is the expected value of the feature function when the distribution is observed. An expected value is the sum of all possible values of a variable each multiplied by the probability of its occurrence. In this case we use product of feature weights and bias weights. directly for an average. The second term is the derivative of the log of the normalization constant which sums over all the possible state sequences.This second term is the expectation of the feature function under the model distribution, not the observed one. The optimum solution is when the gradient is zero and the two expectations are equal. This pleasing interpretation is a standard result about maximum likelihood. The third term is merely a regularization of the parameter vectors and can therefore be treated as insignificant. A number of optimization techniques can be chosen to perform this optimization. Because the original objective function we started out with is a logarithm of the exponentials, it is concave which facilitates these techniques. The simplest one would be the steepest ascent but it would take too many iterations. The difficulty is that there are millions of parameters so computation steps are expensive. As a result, current techniques now involve approximations for interim steps such as in quasi-Newton method. Conjugate Gradient method also performs approximations and are suited for these kind of probems.
So far we have seen some pretty generalized discussion about what random fields, their benefits, their representations, the objective functions and their optimizations including techniques such as CG method which ties in things we have previously studied for the purposes of text analysis.
No comments:
Post a Comment