Sunday, October 16, 2016

We have seen some generalized discussion about what random fields are, their benefits, their representations, the objective functions and their optimizations including techniques such as CG method which ties in things we have previously studied for the purposes of text analysis. We will now see how to apply it. At this point, we are merely forming hypothesis. We realize that the PMI  matrix helped find the word relationships with word vectors. We said that the semantic network embedded within the word-space can be used to find candidates for positive and negative sampling. We said that the log exponential form of the PMI matrix defines the undirected graphical model. We said we could replace this model with feature functions where features are defined in terms of both transitions from previous state to current state and observations for current state. Then we said we could generalize this with parameter vectors and minimize the objective function with Conjugate Gradient method.
As we move from an undirected graphical model, to linear chain CRF, let us take a look at reworking not just the PMI matrix but also the semantic embedding.  First we modify the recommendations for creating the semantic embedding in the word space discussed earlier from the paper by Johannson and Pina by using linear chain CRF and optimization instead of the steepest gradient descent that they perform. We do this by introducing a set of weights which equal 1 for a given state and zero otherwise. In other words, we define feature functions  The reason we choose this problem to apply the CRF is that it has a direct minimization problem which we attempt to solve with CRF instead. There we were using sense vectors from a choice of hypernyms and hyponyms and we applied distance function to compute similarities between senses. The sense vectors were similar to the word vectors except that they drew their features from the thesaurus. Now we use the same features as states for the underlying sequence model and say that the words are based on some sequence of states where the previous state determines the current state and the current state has independent observations. We could do this merely by associating each word with their state.  By drawing a sequence model, although less than perfect from the generative directed models, we are able to form the CRF directly.  This just serves as an example to illustrate how to apply the sequence model directly. The joint distribution that we used to form the sense vectors in the first place is taken as the log probability So p(y)p(y/x) becomes lambda-y + sum lambda-y. x where the first term is the bias weight and the second term is the feature weight. The two together when raised exponentially and normalized becomes the conditional distribution.  Therefore it is possible to rewrite the joint distributions in the form of bias and feature weights after which we already know how to proceed by defining feature functions that are non-zero for a single class label, representing an objective function and applying Conjugate Gradient method.

The Steps for introducing semantic CRF are being considered as follows:
1) Generate the sense embedding network in the same word space
2) Generate a PMI word-context matrix that has both positive and negative samples. Both the samples should include semantic content such as hyponyms and hypernyms.
3) We establish a conditional distribution from the top PMIs and the top senses.
4) Then we maximize the conditional log likelihood of this distribution.


Saturday, October 15, 2016

We were discussing parameter vectors with conditional random fields. A random field is a particular distribution in a family of distributions. It it discriminative which means  it models the conditional distributions directly which avoids the dependencies of the observations.This makes it less prone to violations from the assumptions of independence between the features.A CRF is written in the form of feature functions. Each feature function has a feature for state transition and a feature for state observation. Feature functions are not limited to indicator functions where a set of weights take the value one for a class instance and zero otherwise.  We could exploit other richer features of the input and replace the weights with parameter vectors. The parameter vectors are estimated with the same sum and log notations as we have seen for what is called penalized maximum likelihood. To do this, we take the log likelihood which is a conditional probability of the prediction summed over each of the input. However we showed the CRF as an improvement over the conditional probabilities, so we can write the log likelihood in terms of the state transition and state observation real valued feature functions and parameter vector. To avoid overfitting when the weight vector norms is too large, we penalize large weights and this is called regularizing it. A regularization parameter determines the strength of the penalty.  The choice of this parameter can ideally be found by iterating over the range of parameters. In practice, however, typically only a few are tried out by varying the regularization parameter by a factor of 10. There can be other choice of regularization parameters but this kind of tuning is very well known in statistics and therefore applied directly.
The conditional log likelihood thus formed can be maximized by taking partial derivatives. In this case, it results in three components.  The first term is the expected value of the feature function when the distribution is observed. An expected value is the sum of all possible values of a variable each multiplied by the probability of its occurrence. In this case we use product of feature weights and bias weights. directly for an average.  The second term is the derivative of the log of the normalization constant which sums over all the possible state sequences.This second term is the expectation of the feature function under the model distribution, not the observed one. The optimum solution is when the gradient is zero and the two expectations are equal. This pleasing interpretation is a standard result about maximum likelihood. The third term is merely a regularization of the parameter vectors and can therefore be treated as insignificant. A number of optimization techniques can be chosen to perform this optimization. Because the original objective function we started out with is a logarithm of the exponentials, it is concave which facilitates these techniques. The simplest one would be the steepest ascent but it would take too many iterations. The difficulty is that there are millions of parameters so computation steps are expensive. As a result, current techniques now involve approximations for interim steps such as in quasi-Newton method. Conjugate Gradient method also performs approximations and are suited for these kind of probems.
So far we have seen some pretty generalized discussion about what random fields, their benefits, their representations, the objective functions and their optimizations including techniques such as CG method which ties in things we have previously studied for the purposes of text analysis.

Friday, October 14, 2016

Today we continue discussing the conditional random field.  A random field is a particular distribution in a family of distributions. We were discussing the discriminative and generative modeling.By modeling the conditional distributions directly, we avoid the dependencies of the observations. This makes the discriminative modeling less prone to violations from the assumptions of independence between the features.The conditional random fields are discriminative. A CRF is written in the form of feature functions. Each feature function has a feature for state transition and a feature for state observation. In an undirected graphical model, we had an exponential form of vectors. Instead we now use a set of weights that are shared across all the class. When they take a non zero value for a single class and zero otherwise, they become feature functions. We have now expressed the exponentials in the form of feature functions.
In the semantic network, this applies directly. We model the conditional probability p(c|w;theta) using p(c|w, theta) = exp(vector_for_c.vector_for_w) divided by sum of all such exponentials. where c is the semantic class and w is the word from the text or vocabulary encountered. This is an undirected graphical model and we can refine it with feature functions just the same way. When we express it in terms of feature functions, we are using it to articulate the current word's identity. But we are not restricted to using indicator functions.  We could exploit other richer features of the input, such as prefixes and suffixes, context surrounding the word, and even named entities. This leads to the general definition of linear chain CRFs where the set of weights are replaced by parameter vectors.
In CRF, we allow the score of the tranisititon (i,j) to depend on the current observation vector by adding a feature that chains the transition and the observations  and this is commonly used in text applications. Note that in interations involving time steps, the observation argument is written as a vector xt which contains all the components of all global observations x that are needed for computing features at time t. If the CRF uses the next word as a feature, then the vector for the last word would have included the identity of the next word.

The parameter vectors are estimated the same way as we would any conditional distributions. We take the sequence of inputs and the sequence of predictions and estimate using the penalized maximum likelihood which is a way of saying we take the sum of the log of the conditional distribution of the prediction to the input.

The independence of each sequence is something we can relax later.

Courtesy : Sutton and McCallum
#codingexercise
Int ResetKthBitFromLastInBinaryRepresentation(int n, int k)

{

String binary;

ToBinary(n, ref binary);

Assert(binary.count > n);

If (n-k >= 0 &&  binary[n-k] == 0)

     binary[n-k] = 1;

Int num;

ToInt(binary, ref num, 0);

Return num;

}

Thursday, October 13, 2016

Today we continue discussing the conditional random field.  A random field is a particular distribution in a family of distributions. We were discussing the discriminative and generative modeling. Generative means it is based on the model of  joint distribution of a state with regard to observation. Discriminative means it is based on the model of conditional distribution of a state given the observation. The main advantage of the discriminative modeling is that it is better suited for rich overlapping features. By modeling the conditional distributions directly, we avoid the dependencies of the observations. This makes the discriminative modeling less prone to violations from the assumptions of independence between the features.
The conditional random fields are very much discriminative. They make independence assumptions between the states but not between the observations.
We also saw the advantages of sequence modeling which is an arrangement of linear chain of output variables. we combine both the linear chain as well as the discriminative modeling. This yields a linear chain CRF. A CRF is written in the form of feature functions. Each feature function has the form fz(yt, yt-1,xt) which goes to say that the states are dependent only on the previous state and that they are independent for their observations. In other words there is a feature, which we describe in terms of an indicator which takes the value 1 for the state to equal an instance and zero otherwise, for every state-transition and there is a feature for every state-observation pair.
In an undirected graphical model, we were normalizing a family of local functions or distributions and each local function had an exponential form and represented "sufficient statistics" in terms of observation and class. There were a chosen 'z' number of such local functions. But instead of using one vector per class, we used a set of weights that are shared across all the class. When it is non-zero for one class and 1 for that class, it becomes a feature function and we can express a logistic regression model in a normalization of the exponentials of the weighted feature functions. If we interpret it generatively, we take all possible values of the numerator and sum it to use as the denominator for each component. Now we have gone a step further and substituted each feature function with a form that is dependent on state-transition as well as state-observation.
 If we look at the skip-gram model, we were using the undirected graph model. Instead we could now use semantic feature function from a semantic embedding in the same word space along with conditional representations. If we were to do named-entity recognition, we could also consider semantic feature function there.

If we look at the feature function for the semantic, we could use a conditional distribution for the semantic classes.
And we don't even have to do class /label propagation.
Courtesy : Sutton and McCallum

Wednesday, October 12, 2016

Today we continue discussing the conditional random field. We were discussing the undirected graphical model and its application in naive Bayes classifier and maximum entropy classifier - both of which are used in natural language processing and are examples of a directed graphical model.  The naive Bayes classifier is based on conditional probability of each feature and the maximum entropy classifier is based on the log probability of each class. Instead of using one vector per class in the latter model, a set of weights are defined that is shared across all the classes. These are non-zero only for a single class and are called feature functions. The feature function  can be defined in terms of feature weights and bias weights. The feature weights is an indicator function of the feature which takes the value 1 when the feature equals an instance and zero otherwise. The bias weights similarly take the value 1 for the bias and zero otherwise. The notations used for conditional random field involve feature weights and bias weights.
While classifiers predict only a single class variable, the graphical model could be used to model many variables that are interdependent. When the output variables are arranged in a sequence or a linear chain, we get a sequence model. This is the approach taken by a hidden Markov Model. An HMM models a sequence of observations by assuming that there is a sequence of states. For example in named entity recognition task, we try to identify and classify proper names as person, location, organization and other.  Each state depends only on the previous state. And it is independent of all its ancestors. An HMM assumes that each observation variable depends only on the current state. This model is therefore specified by three probability distributions: the distribution p(y) over initial states, the transition distribution from one state to another and finally the observation distribution of the occurrence with respect to the state.
When we include interdependence between features, we use a generative model. This is usually done in one of two ways - enhance the model to represent dependencies among the inputs, or make simplifying independence assumptions. The first approach is difficult because we have to maintain tractability.  The second approach can hurt performance. The difference in their behaviors is largely due to the fact that one is generative and the other is discriminative.
Generative means it is based on the model of  joint distribution of a state with regard to observation. Discriminative means it is based on the model of conditional distribution of a state given the observation. We can also form a model based on generative-discriminative pair.


The generative model has many benefits. The main advantage of the discriminative modeling is that it is better suited for rich overlapping features. By modeling the conditional distributions directly, we avoid the dependencies of the observations. This makes the discriminative modeling less prone to violations from the assumptions of independence between the features.
The conditional random fields make independence assumptions between the states but not between the observations.

Courtesy : Sutton and McCallum

Tuesday, October 11, 2016

Today we discuss conditional random field.We said when we have diverse multiple features for a vector that depend on each other, one way to represent them is using conditional random fields. We can see this clearly in relational data that have two characteristics. First, each entity has a rich set of features and second there are statistical dependencies between the entites we wish to model. These relationships and dependencies between entities are naturally represented in graph.  Graph based models are have been used to represent the joint probability distribution p(y,x) where the variables y represent the attributes of the entities that we wish to predict and the input variables x represent our observed knowledge about the entities. However if we were to model joint distributions, then modeling the distribution p(x) can include complex dependencies. Modeling these dependencies among inputs can lead to intractable models, but ignoring them can lead to reduced performance. Therefore, a solution may be to directly model the conditional distribution p(y|x), which is sufficient for classification. This is what conditional random fields does. A conditional random field is simply a conditional distribution p(y|x) with an associated graphical structure. The  conditional dependencies among the input variable x now do not need to be represented explicitly and the input can assume to have rich input features.
A conditional random fields (CRF) allows us to directly model the conditional probability distribution p(y|x) which is sufficient for classification
We now review current training and inference techniques for conditional random fields. Unlike linear chain models, CRFs can capture long distance dependencies between labels. For example, when a noun is repeated several times in a text, each mention may contribute a different mention for the entity. Since all mentions have the same labels, it is helpful to use CRFs. Moreover, we can use skip-chain CRF which is a model that jointly performs segmentation and collective labeling of extracted mentions.
When graphs are drawn with these probability distributions, the adjacency is based on the factorizations. The main idea is to represent  a distribution over a large number of random variables by a product of local functions that depend on only a small number of variables. For example, the similarities are based on PMI.
This undirected graphical model can also be expressed as a factor graph which is a bipartite graph in which a variable node is connected to a factor node. A factor can be a set of one or more functions that map the outcomes to the observations.
A directed graphical model also known as Bayesian network, is based on a directed graph that factorize based on conditions. The term generative model is used to refer to a directed graphical model in which the outputs topologically precede the inputs. In other words, it propagates in forward only manner.
A directed graphical model is useful for classification. Both the naive Bayes classifier and the maximum entropy classifier are used in natural language processing and are examples of a directed graphical model.  The naive Bayes classifier is based on conditional probability of each feature and the maximum entropy classifier is based on the log probability of each class. Instead of using one vector per class in the latter model, a set of weights are defined that is shared across all the classes. These are non-zero only for a single class and are called feature functions. The feature function  can be defined in terms of feature weights and bias weights. The feature weights is an indicator function of the feature which takes the value 1 when the feature equals an instance and zero otherwise. The bias weights similarly take the value 1 for the bias and zero otherwise. The notations used for conditional random field involve feature weights and bias weights.

Courtesy : Sutton and McCallum

Monday, October 10, 2016

Considerations for large graph processing 
Facebook engineers who worked on large scale graphs with trillions of edges mentioned the following issues they faced: 
1) They needed to extend the basic pregel model of graph processing. For example, to perform k-means clustering, they had to enable generalizations for the iterations involved such as to make the framework more reusable and the tasks are abstracted per vertex. In this case, first, the vertex compute method calculates the distance to all the centroids while becoming the member of the nearest cluster. Then the centroids are recalculated and at the next outer iteration, the new centroids are available to every vertex. 
2) Pregel model also requires computations that are characterized as "thinking like a vertex" where as some computations need to be executed in a centralized fashion. 
3) Pregel only allows one message type and one message combiner, therefore computations at a vertex needed to fit all in. To allow computations to be distinctive, cleaner and more reusable, they added composable computation. 
4) When messages are distinct, commutative and associative, they can be combined and thus making the message more "aggregatable". For those that cannot be aggregated, they introduced a technique called "superstep splitting" where a fragment of a message is sent and a partial computation updates the state of the vertex value. This is all in memory. 
5) To improve performance and scalability, they had to introduce additional parallelization in the runtime and the shared contexts which they called "WorkerContext". Bottlenecks and overheads such as checkpointing were addressed by scheduling. This was finer level than what the infrastructure provided. 
6) They even optimized the memory utilization of the graph infrastructure because it allowed arbitrary vertex id, vertex value, edge and message classes. They did this with 1) serializing edges with byte array and 2) serializing messages on the server. 
7) They also improved parallelization with sharded aggregators that provided efficient shared state across workers. With this approach, each aggregator gets assigned to a randomly picked worker which then gathers the values, performs the aggregation and distributes the final values to master and other workers. This distributes the load that was otherwise entirely on the master. 
Since the graphs are large, it is partitioned as n parts on m machines. Computations are distributed across workers and aggregators.  
#codingexercise

Void ToBinary(int num, ref string binary)

{

    If (num == 0) return;

    binary = num %2 + binary;

    ToBinary(num/2, ref binary);

}

Void ToInt(string binary, ref int num, int k = 0)

{

    If (String.IsNullOrEmpty(binary)) return;

    Char c = binary.RemoveLast();

    Num += math.pow(2, k)  × c.toInt();

    ToInt(binary, ref num, k + 1);

}


Int ResetKthBit(int n, int k)

{

String binary;

ToBinary(n, ref binary);

Assert(binary.count > n);

If (binary[k] == 0)

     binary[k] = 1;

Int num;

ToInt(binary, ref num, 0);

Return num;

}