Monday, October 17, 2016

#codingexercise
Detect number of three-node cycles in graph
We use depth first search to traverse the graph and generate paths.
DFS ( V, E, paths)
For each vertex v in V
       V.color=white
       V.d = nil
  Time = 0
 For each vertex v in V:
       If v.color == white:
           path = new Path();  
           DFS-Visit (V, E, v, path, paths)
     
 DFS-VISIT (V,E, u, path, ref paths)
  time = time + 1
   u.d = time
   u.color = gray
   path.Add(u);
   foreach  vertex v adjacent to u
        If v.color == white
           DFS-VISIT(V,E,v)
         Else
               If v.d  <= u.d < u.f <= v.f  :
                        paths.Add(path); 
u.color = black
time = time + 1
 u.f = time


filter(paths):
var d = new Dictionary<string, int>();
foreach path in paths:
     foreach vertex  v in path:
           if v.next and v.next.next and v.next.next.next and v.next.next.next == v:
               pathidentifier = stringify(v,3)
                if d.contains(pathidentifier) == false:
                    d.Add(pathidentifier, 1);


Find the greatest number we can get by rearranging the digits of a number n involving k shifts only as shown below:
Take n=43592169 and k=5
1st swap: 43952169
2nd swap: 49352169
3rd swap: 94352169
4th swap: 94532169
5th swap: 95432169 :- final number.

int GetMaxWithKShifts(int n, int k)
{
var digits = n.todigits();
int rem = k;
for (int i = 0; i < digits.Length && rem > 0; i++)
{
int index = GetMaxDigitIndexNext(digits, i);
if (index == i || digits[i] == digits[index]) continue;
if (index-i <= rem){
   Shift(digits, i, index);
   rem -= index - i;
}
}
return digits.ToInt();
}

Alternatively the maximum number may be obtained recursively by considering next higher element. The result for the above and below could vary based on which rule to follow:
void GetMaxWithKSwaps(int k, ref digits, ref candidate)
{
if (k == 0)
   return;
for (int i = 0; i < digits.length - 1; i++)
{
   for (int j = i+1; j < digits.length; j++)
   {
     if (digits[i] < digits[j])
     {
        Swap(digits, i, j);
        if (digits.compare(candidate) > 0)
           candidate = digits;
        GetMaxWithKSwaps(int k, ref digits, ref candidate);
        Swap(digits, i, j);
     }
   }
}
}

Or we could enumerate all arrangements

Void Permute (int k, String digits, StringBuilder candidate, bool[] used)
{
  If ( b.Length == a.Length)  { 
  if (digits.compare(candidate) > 0 && CountSwaps(digits, candidate) == k) 
      Console.Write(candidate.ToString(); 
  return;
   }
  For (int I = 0; I < digits.Length ; i++)
  {
    If (used[i]) continue;
     used[i] = true;
     candidate += digits[i];
     Permute(digits, candidate, used);
     candidate [candidate.Length – 1] = ‘/0’;
     used[i] = false;
  }
}

Also given an original sequence and a final sequence, it is possible to say how many swaps are required.

For example, we can say 43592169 and 95432169 has five swaps because

  • There are four positions out of order and we can get two right with two swaps

Sunday, October 16, 2016

We have seen some generalized discussion about what random fields are, their benefits, their representations, the objective functions and their optimizations including techniques such as CG method which ties in things we have previously studied for the purposes of text analysis. We will now see how to apply it. At this point, we are merely forming hypothesis. We realize that the PMI  matrix helped find the word relationships with word vectors. We said that the semantic network embedded within the word-space can be used to find candidates for positive and negative sampling. We said that the log exponential form of the PMI matrix defines the undirected graphical model. We said we could replace this model with feature functions where features are defined in terms of both transitions from previous state to current state and observations for current state. Then we said we could generalize this with parameter vectors and minimize the objective function with Conjugate Gradient method.
As we move from an undirected graphical model, to linear chain CRF, let us take a look at reworking not just the PMI matrix but also the semantic embedding.  First we modify the recommendations for creating the semantic embedding in the word space discussed earlier from the paper by Johannson and Pina by using linear chain CRF and optimization instead of the steepest gradient descent that they perform. We do this by introducing a set of weights which equal 1 for a given state and zero otherwise. In other words, we define feature functions  The reason we choose this problem to apply the CRF is that it has a direct minimization problem which we attempt to solve with CRF instead. There we were using sense vectors from a choice of hypernyms and hyponyms and we applied distance function to compute similarities between senses. The sense vectors were similar to the word vectors except that they drew their features from the thesaurus. Now we use the same features as states for the underlying sequence model and say that the words are based on some sequence of states where the previous state determines the current state and the current state has independent observations. We could do this merely by associating each word with their state.  By drawing a sequence model, although less than perfect from the generative directed models, we are able to form the CRF directly.  This just serves as an example to illustrate how to apply the sequence model directly. The joint distribution that we used to form the sense vectors in the first place is taken as the log probability So p(y)p(y/x) becomes lambda-y + sum lambda-y. x where the first term is the bias weight and the second term is the feature weight. The two together when raised exponentially and normalized becomes the conditional distribution.  Therefore it is possible to rewrite the joint distributions in the form of bias and feature weights after which we already know how to proceed by defining feature functions that are non-zero for a single class label, representing an objective function and applying Conjugate Gradient method.

The Steps for introducing semantic CRF are being considered as follows:
1) Generate the sense embedding network in the same word space
2) Generate a PMI word-context matrix that has both positive and negative samples. Both the samples should include semantic content such as hyponyms and hypernyms.
3) We establish a conditional distribution from the top PMIs and the top senses.
4) Then we maximize the conditional log likelihood of this distribution.


Saturday, October 15, 2016

We were discussing parameter vectors with conditional random fields. A random field is a particular distribution in a family of distributions. It it discriminative which means  it models the conditional distributions directly which avoids the dependencies of the observations.This makes it less prone to violations from the assumptions of independence between the features.A CRF is written in the form of feature functions. Each feature function has a feature for state transition and a feature for state observation. Feature functions are not limited to indicator functions where a set of weights take the value one for a class instance and zero otherwise.  We could exploit other richer features of the input and replace the weights with parameter vectors. The parameter vectors are estimated with the same sum and log notations as we have seen for what is called penalized maximum likelihood. To do this, we take the log likelihood which is a conditional probability of the prediction summed over each of the input. However we showed the CRF as an improvement over the conditional probabilities, so we can write the log likelihood in terms of the state transition and state observation real valued feature functions and parameter vector. To avoid overfitting when the weight vector norms is too large, we penalize large weights and this is called regularizing it. A regularization parameter determines the strength of the penalty.  The choice of this parameter can ideally be found by iterating over the range of parameters. In practice, however, typically only a few are tried out by varying the regularization parameter by a factor of 10. There can be other choice of regularization parameters but this kind of tuning is very well known in statistics and therefore applied directly.
The conditional log likelihood thus formed can be maximized by taking partial derivatives. In this case, it results in three components.  The first term is the expected value of the feature function when the distribution is observed. An expected value is the sum of all possible values of a variable each multiplied by the probability of its occurrence. In this case we use product of feature weights and bias weights. directly for an average.  The second term is the derivative of the log of the normalization constant which sums over all the possible state sequences.This second term is the expectation of the feature function under the model distribution, not the observed one. The optimum solution is when the gradient is zero and the two expectations are equal. This pleasing interpretation is a standard result about maximum likelihood. The third term is merely a regularization of the parameter vectors and can therefore be treated as insignificant. A number of optimization techniques can be chosen to perform this optimization. Because the original objective function we started out with is a logarithm of the exponentials, it is concave which facilitates these techniques. The simplest one would be the steepest ascent but it would take too many iterations. The difficulty is that there are millions of parameters so computation steps are expensive. As a result, current techniques now involve approximations for interim steps such as in quasi-Newton method. Conjugate Gradient method also performs approximations and are suited for these kind of probems.
So far we have seen some pretty generalized discussion about what random fields, their benefits, their representations, the objective functions and their optimizations including techniques such as CG method which ties in things we have previously studied for the purposes of text analysis.

Friday, October 14, 2016

Today we continue discussing the conditional random field.  A random field is a particular distribution in a family of distributions. We were discussing the discriminative and generative modeling.By modeling the conditional distributions directly, we avoid the dependencies of the observations. This makes the discriminative modeling less prone to violations from the assumptions of independence between the features.The conditional random fields are discriminative. A CRF is written in the form of feature functions. Each feature function has a feature for state transition and a feature for state observation. In an undirected graphical model, we had an exponential form of vectors. Instead we now use a set of weights that are shared across all the class. When they take a non zero value for a single class and zero otherwise, they become feature functions. We have now expressed the exponentials in the form of feature functions.
In the semantic network, this applies directly. We model the conditional probability p(c|w;theta) using p(c|w, theta) = exp(vector_for_c.vector_for_w) divided by sum of all such exponentials. where c is the semantic class and w is the word from the text or vocabulary encountered. This is an undirected graphical model and we can refine it with feature functions just the same way. When we express it in terms of feature functions, we are using it to articulate the current word's identity. But we are not restricted to using indicator functions.  We could exploit other richer features of the input, such as prefixes and suffixes, context surrounding the word, and even named entities. This leads to the general definition of linear chain CRFs where the set of weights are replaced by parameter vectors.
In CRF, we allow the score of the tranisititon (i,j) to depend on the current observation vector by adding a feature that chains the transition and the observations  and this is commonly used in text applications. Note that in interations involving time steps, the observation argument is written as a vector xt which contains all the components of all global observations x that are needed for computing features at time t. If the CRF uses the next word as a feature, then the vector for the last word would have included the identity of the next word.

The parameter vectors are estimated the same way as we would any conditional distributions. We take the sequence of inputs and the sequence of predictions and estimate using the penalized maximum likelihood which is a way of saying we take the sum of the log of the conditional distribution of the prediction to the input.

The independence of each sequence is something we can relax later.

Courtesy : Sutton and McCallum
#codingexercise
Int ResetKthBitFromLastInBinaryRepresentation(int n, int k)

{

String binary;

ToBinary(n, ref binary);

Assert(binary.count > n);

If (n-k >= 0 &&  binary[n-k] == 0)

     binary[n-k] = 1;

Int num;

ToInt(binary, ref num, 0);

Return num;

}

Thursday, October 13, 2016

Today we continue discussing the conditional random field.  A random field is a particular distribution in a family of distributions. We were discussing the discriminative and generative modeling. Generative means it is based on the model of  joint distribution of a state with regard to observation. Discriminative means it is based on the model of conditional distribution of a state given the observation. The main advantage of the discriminative modeling is that it is better suited for rich overlapping features. By modeling the conditional distributions directly, we avoid the dependencies of the observations. This makes the discriminative modeling less prone to violations from the assumptions of independence between the features.
The conditional random fields are very much discriminative. They make independence assumptions between the states but not between the observations.
We also saw the advantages of sequence modeling which is an arrangement of linear chain of output variables. we combine both the linear chain as well as the discriminative modeling. This yields a linear chain CRF. A CRF is written in the form of feature functions. Each feature function has the form fz(yt, yt-1,xt) which goes to say that the states are dependent only on the previous state and that they are independent for their observations. In other words there is a feature, which we describe in terms of an indicator which takes the value 1 for the state to equal an instance and zero otherwise, for every state-transition and there is a feature for every state-observation pair.
In an undirected graphical model, we were normalizing a family of local functions or distributions and each local function had an exponential form and represented "sufficient statistics" in terms of observation and class. There were a chosen 'z' number of such local functions. But instead of using one vector per class, we used a set of weights that are shared across all the class. When it is non-zero for one class and 1 for that class, it becomes a feature function and we can express a logistic regression model in a normalization of the exponentials of the weighted feature functions. If we interpret it generatively, we take all possible values of the numerator and sum it to use as the denominator for each component. Now we have gone a step further and substituted each feature function with a form that is dependent on state-transition as well as state-observation.
 If we look at the skip-gram model, we were using the undirected graph model. Instead we could now use semantic feature function from a semantic embedding in the same word space along with conditional representations. If we were to do named-entity recognition, we could also consider semantic feature function there.

If we look at the feature function for the semantic, we could use a conditional distribution for the semantic classes.
And we don't even have to do class /label propagation.
Courtesy : Sutton and McCallum

Wednesday, October 12, 2016

Today we continue discussing the conditional random field. We were discussing the undirected graphical model and its application in naive Bayes classifier and maximum entropy classifier - both of which are used in natural language processing and are examples of a directed graphical model.  The naive Bayes classifier is based on conditional probability of each feature and the maximum entropy classifier is based on the log probability of each class. Instead of using one vector per class in the latter model, a set of weights are defined that is shared across all the classes. These are non-zero only for a single class and are called feature functions. The feature function  can be defined in terms of feature weights and bias weights. The feature weights is an indicator function of the feature which takes the value 1 when the feature equals an instance and zero otherwise. The bias weights similarly take the value 1 for the bias and zero otherwise. The notations used for conditional random field involve feature weights and bias weights.
While classifiers predict only a single class variable, the graphical model could be used to model many variables that are interdependent. When the output variables are arranged in a sequence or a linear chain, we get a sequence model. This is the approach taken by a hidden Markov Model. An HMM models a sequence of observations by assuming that there is a sequence of states. For example in named entity recognition task, we try to identify and classify proper names as person, location, organization and other.  Each state depends only on the previous state. And it is independent of all its ancestors. An HMM assumes that each observation variable depends only on the current state. This model is therefore specified by three probability distributions: the distribution p(y) over initial states, the transition distribution from one state to another and finally the observation distribution of the occurrence with respect to the state.
When we include interdependence between features, we use a generative model. This is usually done in one of two ways - enhance the model to represent dependencies among the inputs, or make simplifying independence assumptions. The first approach is difficult because we have to maintain tractability.  The second approach can hurt performance. The difference in their behaviors is largely due to the fact that one is generative and the other is discriminative.
Generative means it is based on the model of  joint distribution of a state with regard to observation. Discriminative means it is based on the model of conditional distribution of a state given the observation. We can also form a model based on generative-discriminative pair.


The generative model has many benefits. The main advantage of the discriminative modeling is that it is better suited for rich overlapping features. By modeling the conditional distributions directly, we avoid the dependencies of the observations. This makes the discriminative modeling less prone to violations from the assumptions of independence between the features.
The conditional random fields make independence assumptions between the states but not between the observations.

Courtesy : Sutton and McCallum

Tuesday, October 11, 2016

Today we discuss conditional random field.We said when we have diverse multiple features for a vector that depend on each other, one way to represent them is using conditional random fields. We can see this clearly in relational data that have two characteristics. First, each entity has a rich set of features and second there are statistical dependencies between the entites we wish to model. These relationships and dependencies between entities are naturally represented in graph.  Graph based models are have been used to represent the joint probability distribution p(y,x) where the variables y represent the attributes of the entities that we wish to predict and the input variables x represent our observed knowledge about the entities. However if we were to model joint distributions, then modeling the distribution p(x) can include complex dependencies. Modeling these dependencies among inputs can lead to intractable models, but ignoring them can lead to reduced performance. Therefore, a solution may be to directly model the conditional distribution p(y|x), which is sufficient for classification. This is what conditional random fields does. A conditional random field is simply a conditional distribution p(y|x) with an associated graphical structure. The  conditional dependencies among the input variable x now do not need to be represented explicitly and the input can assume to have rich input features.
A conditional random fields (CRF) allows us to directly model the conditional probability distribution p(y|x) which is sufficient for classification
We now review current training and inference techniques for conditional random fields. Unlike linear chain models, CRFs can capture long distance dependencies between labels. For example, when a noun is repeated several times in a text, each mention may contribute a different mention for the entity. Since all mentions have the same labels, it is helpful to use CRFs. Moreover, we can use skip-chain CRF which is a model that jointly performs segmentation and collective labeling of extracted mentions.
When graphs are drawn with these probability distributions, the adjacency is based on the factorizations. The main idea is to represent  a distribution over a large number of random variables by a product of local functions that depend on only a small number of variables. For example, the similarities are based on PMI.
This undirected graphical model can also be expressed as a factor graph which is a bipartite graph in which a variable node is connected to a factor node. A factor can be a set of one or more functions that map the outcomes to the observations.
A directed graphical model also known as Bayesian network, is based on a directed graph that factorize based on conditions. The term generative model is used to refer to a directed graphical model in which the outputs topologically precede the inputs. In other words, it propagates in forward only manner.
A directed graphical model is useful for classification. Both the naive Bayes classifier and the maximum entropy classifier are used in natural language processing and are examples of a directed graphical model.  The naive Bayes classifier is based on conditional probability of each feature and the maximum entropy classifier is based on the log probability of each class. Instead of using one vector per class in the latter model, a set of weights are defined that is shared across all the classes. These are non-zero only for a single class and are called feature functions. The feature function  can be defined in terms of feature weights and bias weights. The feature weights is an indicator function of the feature which takes the value 1 when the feature equals an instance and zero otherwise. The bias weights similarly take the value 1 for the bias and zero otherwise. The notations used for conditional random field involve feature weights and bias weights.

Courtesy : Sutton and McCallum