Friday, October 21, 2016

Today we continue to discuss the conditional random fields we mentioned earlier.We have seen some generalized discussion about what random fields are, their benefits, their representations, the objective functions and their optimizations including techniques such as CG method which ties in things we have previously studied for the purposes of text analysis. We wanted to introduce sense embedding features into the word-context analysis. We can represent the conditional probability of a sense given its context using their corresponding vectors. Then we choose the parameters numbering 1..d for each of the context, semantic pairs. We would like to set the parameters such that their conditional log likelihood is maximized. We could do this with conjugate gradient method.
We saw linear chain CRFs. Now we could generalized this and the relax the requirements for previous state dependency.  Instead we define CRFs with general graphical structure. Such structures are found to be very helpful for relational learning because they relax the iid assumption between entities.
So far we have seen that models CRF included have been applied with training and testing data to be independent. However CRFs can also be used for within-network classification - one where we can model probabilistic dependencies between training and testing data. The generalization from linear chain CRFs to general CRFs is about moving from a linear chain factor graph to a more general factor-graph.
The general definition of the conditional random field is presented this way:  Let G be a factor graph over Y. Then p(y|x) is a conditional random field if for any fixed x, the distribution p(y|x) factorizes according to G. Therefore each conditional distribution is a CRF even though  it may be trivial. If F is a set of factors in G where each factor takes the exponential form, then each conditional distribution can be written in the form of exponential of weighted feature functions normalized by a partition function similar to what we have seen before except that they apply to each factor in the set F.
Models usually depend on parameter tying. For example, in the linear chain case, the same weights are used for the factors at each time step. This maintains the parameters between the transition. The same applied to a general model and not just a chain model.  To do the same, the factors of G are partitioned into C = C1, C2, .... Cp where each Cp is a clique template whose parameters are tied. A clique template is a set of factors which has a corresponding set of sufficient statistics and parameters. Then the weighted feature functions in the general definition can be replaced by a combination of the sufficient statistics and their parameters.
#codingexercise
Find the position of an element in a sorted array of infinite numbers:
int GetPosition(List<int> nums, int val)
{
int low = 0;
int high = 1;
int cur = nums[0];
while (cur < val)
{
low = high;
high = 2 * high;
val = nums[high];
}
return binarysearch(nums, low, high, val);
}

Void binarysearch(list<int> nums, int low, int high, int val)
{
Int mid = (low + high)/2;
If (nums[mid] == val) return mid;
If (low == high) return -1;
If (nums[mid] < val)
    Return binarysearch(nums,mid+1,high,val);
Else
     Return binarysearch(nums,low,mid,val);
}

No comments:

Post a Comment