Thursday, October 13, 2016

Today we continue discussing the conditional random field.  A random field is a particular distribution in a family of distributions. We were discussing the discriminative and generative modeling. Generative means it is based on the model of  joint distribution of a state with regard to observation. Discriminative means it is based on the model of conditional distribution of a state given the observation. The main advantage of the discriminative modeling is that it is better suited for rich overlapping features. By modeling the conditional distributions directly, we avoid the dependencies of the observations. This makes the discriminative modeling less prone to violations from the assumptions of independence between the features.
The conditional random fields are very much discriminative. They make independence assumptions between the states but not between the observations.
We also saw the advantages of sequence modeling which is an arrangement of linear chain of output variables. we combine both the linear chain as well as the discriminative modeling. This yields a linear chain CRF. A CRF is written in the form of feature functions. Each feature function has the form fz(yt, yt-1,xt) which goes to say that the states are dependent only on the previous state and that they are independent for their observations. In other words there is a feature, which we describe in terms of an indicator which takes the value 1 for the state to equal an instance and zero otherwise, for every state-transition and there is a feature for every state-observation pair.
In an undirected graphical model, we were normalizing a family of local functions or distributions and each local function had an exponential form and represented "sufficient statistics" in terms of observation and class. There were a chosen 'z' number of such local functions. But instead of using one vector per class, we used a set of weights that are shared across all the class. When it is non-zero for one class and 1 for that class, it becomes a feature function and we can express a logistic regression model in a normalization of the exponentials of the weighted feature functions. If we interpret it generatively, we take all possible values of the numerator and sum it to use as the denominator for each component. Now we have gone a step further and substituted each feature function with a form that is dependent on state-transition as well as state-observation.
 If we look at the skip-gram model, we were using the undirected graph model. Instead we could now use semantic feature function from a semantic embedding in the same word space along with conditional representations. If we were to do named-entity recognition, we could also consider semantic feature function there.

If we look at the feature function for the semantic, we could use a conditional distribution for the semantic classes.
And we don't even have to do class /label propagation.
Courtesy : Sutton and McCallum

Wednesday, October 12, 2016

Today we continue discussing the conditional random field. We were discussing the undirected graphical model and its application in naive Bayes classifier and maximum entropy classifier - both of which are used in natural language processing and are examples of a directed graphical model.  The naive Bayes classifier is based on conditional probability of each feature and the maximum entropy classifier is based on the log probability of each class. Instead of using one vector per class in the latter model, a set of weights are defined that is shared across all the classes. These are non-zero only for a single class and are called feature functions. The feature function  can be defined in terms of feature weights and bias weights. The feature weights is an indicator function of the feature which takes the value 1 when the feature equals an instance and zero otherwise. The bias weights similarly take the value 1 for the bias and zero otherwise. The notations used for conditional random field involve feature weights and bias weights.
While classifiers predict only a single class variable, the graphical model could be used to model many variables that are interdependent. When the output variables are arranged in a sequence or a linear chain, we get a sequence model. This is the approach taken by a hidden Markov Model. An HMM models a sequence of observations by assuming that there is a sequence of states. For example in named entity recognition task, we try to identify and classify proper names as person, location, organization and other.  Each state depends only on the previous state. And it is independent of all its ancestors. An HMM assumes that each observation variable depends only on the current state. This model is therefore specified by three probability distributions: the distribution p(y) over initial states, the transition distribution from one state to another and finally the observation distribution of the occurrence with respect to the state.
When we include interdependence between features, we use a generative model. This is usually done in one of two ways - enhance the model to represent dependencies among the inputs, or make simplifying independence assumptions. The first approach is difficult because we have to maintain tractability.  The second approach can hurt performance. The difference in their behaviors is largely due to the fact that one is generative and the other is discriminative.
Generative means it is based on the model of  joint distribution of a state with regard to observation. Discriminative means it is based on the model of conditional distribution of a state given the observation. We can also form a model based on generative-discriminative pair.


The generative model has many benefits. The main advantage of the discriminative modeling is that it is better suited for rich overlapping features. By modeling the conditional distributions directly, we avoid the dependencies of the observations. This makes the discriminative modeling less prone to violations from the assumptions of independence between the features.
The conditional random fields make independence assumptions between the states but not between the observations.

Courtesy : Sutton and McCallum

Tuesday, October 11, 2016

Today we discuss conditional random field.We said when we have diverse multiple features for a vector that depend on each other, one way to represent them is using conditional random fields. We can see this clearly in relational data that have two characteristics. First, each entity has a rich set of features and second there are statistical dependencies between the entites we wish to model. These relationships and dependencies between entities are naturally represented in graph.  Graph based models are have been used to represent the joint probability distribution p(y,x) where the variables y represent the attributes of the entities that we wish to predict and the input variables x represent our observed knowledge about the entities. However if we were to model joint distributions, then modeling the distribution p(x) can include complex dependencies. Modeling these dependencies among inputs can lead to intractable models, but ignoring them can lead to reduced performance. Therefore, a solution may be to directly model the conditional distribution p(y|x), which is sufficient for classification. This is what conditional random fields does. A conditional random field is simply a conditional distribution p(y|x) with an associated graphical structure. The  conditional dependencies among the input variable x now do not need to be represented explicitly and the input can assume to have rich input features.
A conditional random fields (CRF) allows us to directly model the conditional probability distribution p(y|x) which is sufficient for classification
We now review current training and inference techniques for conditional random fields. Unlike linear chain models, CRFs can capture long distance dependencies between labels. For example, when a noun is repeated several times in a text, each mention may contribute a different mention for the entity. Since all mentions have the same labels, it is helpful to use CRFs. Moreover, we can use skip-chain CRF which is a model that jointly performs segmentation and collective labeling of extracted mentions.
When graphs are drawn with these probability distributions, the adjacency is based on the factorizations. The main idea is to represent  a distribution over a large number of random variables by a product of local functions that depend on only a small number of variables. For example, the similarities are based on PMI.
This undirected graphical model can also be expressed as a factor graph which is a bipartite graph in which a variable node is connected to a factor node. A factor can be a set of one or more functions that map the outcomes to the observations.
A directed graphical model also known as Bayesian network, is based on a directed graph that factorize based on conditions. The term generative model is used to refer to a directed graphical model in which the outputs topologically precede the inputs. In other words, it propagates in forward only manner.
A directed graphical model is useful for classification. Both the naive Bayes classifier and the maximum entropy classifier are used in natural language processing and are examples of a directed graphical model.  The naive Bayes classifier is based on conditional probability of each feature and the maximum entropy classifier is based on the log probability of each class. Instead of using one vector per class in the latter model, a set of weights are defined that is shared across all the classes. These are non-zero only for a single class and are called feature functions. The feature function  can be defined in terms of feature weights and bias weights. The feature weights is an indicator function of the feature which takes the value 1 when the feature equals an instance and zero otherwise. The bias weights similarly take the value 1 for the bias and zero otherwise. The notations used for conditional random field involve feature weights and bias weights.

Courtesy : Sutton and McCallum

Monday, October 10, 2016

Considerations for large graph processing 
Facebook engineers who worked on large scale graphs with trillions of edges mentioned the following issues they faced: 
1) They needed to extend the basic pregel model of graph processing. For example, to perform k-means clustering, they had to enable generalizations for the iterations involved such as to make the framework more reusable and the tasks are abstracted per vertex. In this case, first, the vertex compute method calculates the distance to all the centroids while becoming the member of the nearest cluster. Then the centroids are recalculated and at the next outer iteration, the new centroids are available to every vertex. 
2) Pregel model also requires computations that are characterized as "thinking like a vertex" where as some computations need to be executed in a centralized fashion. 
3) Pregel only allows one message type and one message combiner, therefore computations at a vertex needed to fit all in. To allow computations to be distinctive, cleaner and more reusable, they added composable computation. 
4) When messages are distinct, commutative and associative, they can be combined and thus making the message more "aggregatable". For those that cannot be aggregated, they introduced a technique called "superstep splitting" where a fragment of a message is sent and a partial computation updates the state of the vertex value. This is all in memory. 
5) To improve performance and scalability, they had to introduce additional parallelization in the runtime and the shared contexts which they called "WorkerContext". Bottlenecks and overheads such as checkpointing were addressed by scheduling. This was finer level than what the infrastructure provided. 
6) They even optimized the memory utilization of the graph infrastructure because it allowed arbitrary vertex id, vertex value, edge and message classes. They did this with 1) serializing edges with byte array and 2) serializing messages on the server. 
7) They also improved parallelization with sharded aggregators that provided efficient shared state across workers. With this approach, each aggregator gets assigned to a randomly picked worker which then gathers the values, performs the aggregation and distributes the final values to master and other workers. This distributes the load that was otherwise entirely on the master. 
Since the graphs are large, it is partitioned as n parts on m machines. Computations are distributed across workers and aggregators.  
#codingexercise

Void ToBinary(int num, ref string binary)

{

    If (num == 0) return;

    binary = num %2 + binary;

    ToBinary(num/2, ref binary);

}

Void ToInt(string binary, ref int num, int k = 0)

{

    If (String.IsNullOrEmpty(binary)) return;

    Char c = binary.RemoveLast();

    Num += math.pow(2, k)  × c.toInt();

    ToInt(binary, ref num, k + 1);

}


Int ResetKthBit(int n, int k)

{

String binary;

ToBinary(n, ref binary);

Assert(binary.count > n);

If (binary[k] == 0)

     binary[k] = 1;

Int num;

ToInt(binary, ref num, 0);

Return num;

}


Sunday, October 9, 2016

Today we use the semantic network embedding as a word vector feature in the collocation context model.
From the Skip gram negative sampling (SGNS) approach, we do this by directly taking the semantic network as indicator of word, context pairs. When we recall the skip gram model, we see that the word, context pairs which were to be chosen as positive sampling were those that were having a high probability of being found in the corpus. The contexts are chosen from around the word. But while a large number of word,contexts could appear in the corpus, the quality of the selection improves when the contexts are themselves content-full and linearly far away from the focus words. In other words the similarity have to be topical. We can achieve this directly with the help of a semantic network. The lookup of the semantic network is now described. We already have a mapping between the sense and the real valued vector that builds the semantic network. When the words and contexts appear as different senses but are mapped to either the same node or are co-located as neighbors in the semantic network, then we know that they have similarity by topic. The semantic network was already built by the relatedness of the topics, we just need to map those labels back to the words. This is easily done with the word mapping to senses as part of the semantic network embedding. The semantic network already guarantees the vectors representing the lemmas will be related to those corresponding to the senses. All we need to see if the word and contexts are together or far apart in terms of the semantic network.
Why go through the effort of establishing a semantic network and associating words and contexts with their topics? Either approach alone does not suffice. In fact there is no clear answer to why SGNS produces good word representations. The semantic network articulates one way to say it. We just need to have a threshold  in terms of a number of word context pairs with semantic correlation that the SGNS will consider from all word contexts considered and early on.  Complex models built with a variety of factors do not serve as a panacea. At the same time this effort involved is optional and guaranteed to benefit. This immediately boosts the quality of positive sampling in the skip gram model without even introducing negative sampling.
The purpose of the negative sampling on the other hand is to allow the vectors to not result in being all similar. This is usually the case when the scalar product of the word and context vectors is a large number say about 40 because their vectors are similar. Our approach is also at risk from this threat. However the semantic network also comes in useful with negative sampling. We know that the word, context pairs that are far apart in the semantic network embedding have very little topic correlation. So instead of picking random word, context pairs we are more methodical in the selection with similar objective. In negative sampling, the word, context pair do not appear in the corpus. Therefore, we fix one and pick the other from those associated with far away nodes so that there are at least a threshold number of such negative samples. This improves the quality of negative samples as well. Overall we are saying that with articulated picks of positive and negative sampling we improve the overall word representations.

Saturday, October 8, 2016

Today we present a summary of paper titled "Embedding a Semantic Network in Word Space" by Johannson and Nieto Pina

 A semantic network is created with vectors that represent word meanings (from thesaurus) of words in a text. Those that have many meanings are represented by vectors that can be described as a combination of singular sense vectors. And those that have a singular sense vector is kept similar to the neighbors in a network. This leads to a constrained optimization problem. The distance function is a straightline distance function and squared.
To give an example of the senses for a word, the word 'mouse' may refer to a 'rat or an 'electronic device'. Furthermore, a 'mouse' is a 'rodent' so it has prominent 'teeth'.  The purpose of this vectorization is to find a vector that can represent mouse in the rodent sense.
This paper tries to embed a graph structure  in the word space. This approach uses two sources of information - corpus statistics and network structure. This was shown to work well as a classifier that creates lexical units for FrameNet frames. FrameNet is a collection of over 170000 manually annotated  sentences which provide a unique training dataset for role labeling
The senses are chosen based on  a number of hypernyms and hyponyms from WordNet alhough we could use a thesaurus.
The algorithm maps each sense sij to a sense embedding, a real-valued vector E(sij) in the same vector space as the lemma embeddings. A mixed constraint binds the two embeddings. The lemma and sense embeddings are related through a mix constraint. This mix constraint is expressed with the occurrence probabilities of the senses as weights to the real valued vectors. This is a way to say that we count the contexts to form the vectors when the words have more than one sense. It has a bonus benefit that we disambiguate the words. The motivation behind this paper is now formalized as minimizing the sum of distances between each sense and its neighbours, while satisfying the mix constraint for each lemma.
The mix constraint helps the lemma to be true to the text while the minimization helps build the graph. It also leaves the words with singular meaning unchanged.
The algorithm can now be described in the following way. It proceeds in an online fashion by considering one lemma at a time.
1) Pick a lemma and choose some senses from the thesaurus
2) Adjust the embeddings of the senses as well as their mix in order to minimize the loss function which is expressed as the weighted square of distance between two vectors ranging over all of the senses of each pair (the sense and its neighbor). The embeddings of the neighbors are kept fixed.
3) Iterate through all the lemmas for a fixed number of times or a threshold for the change.
Improvements in each iteration does not need to be gradient based. We already know that the weighted centroid of the neighbors is the likely target. Therefore the residual is the difference between the linear combination of weighted centroids and the lemma embedding. The sense embedding for lemma is the weighted offset from the centroid. If the mix variable is zero, the sense embedding is the centroid which goes to say that it is determined completely by the neighbors and if it is 1 it is completely the embedding of the lemma.

Friday, October 7, 2016

In Natural Language Processing, Word2Net and Roget’s thesaurus are often used interchangeably.  We argue that a thesaurus can be put to more use.
First, it helps in sampling words with semantically equivalent content.  We achieve this by finding a word from the thesaurus for each of the word in the text. These words are chosen at random from the synonyms pertaining to each word although arguably we could choose the simplest or most frequently used.
Second it helps in finding an alternate text comprising of words from the thesaurus that we measure based on lexical centrality. We perform a random walk on the graph and the nodes that are visited the most frequently are selected as the alternate text. In order to avoid duplicates or duplicate content, the final selection depends on its maximal marginal relevance.
We don’t use a bipartite graph between the original text and the alternate text. Although such graphs have their own use, we argue that they don’t serve our purpose to perform a Laplace Transform. Moreover we see this different from bootstrapping where we start with a few seed relation examples or patterns and iteratively grow the set of relations and patterns based on occurrence in a large corpus. There a bipartite graph helps with representing the mutual information between instances (hubs) and patterns (authorities)
We could form a page rank algorithm directly on the thesaurus subset corresponding to the words in the given text.  Hughes and Ramage (2007) showed that approach would calculate the stationary distribution of the nodes in the Thesaurus graph, biased on each of the input words in a given word pair. The divergence between these distributions is calculated as a measure of the relatedness of the two words. Although this approach has proved to be a significant improvement for semantic relatedness, we have only introduced semantically equivalent and redundant words and thereby made it more computationally intensive as compared to page ranking the original text. The purpose of the Laplace transform is to create semantic equivalent alternative regularized text. Even if there is no such thing as a canonicalized text, the transformation provides semantically alternate text.
We don’t cluster on the thesaurus subset because the words will be selected better by applying it directly on the original text unless we introduce a feature of the word vector based on the thesaurus. One such feature may be one to one mappings of synonyms from the thesaurus for the words chosen as vector features for the PMI matrix. Arguably the thesaurus based feature could be part of the CRF instead of the PMI matrix. Either way we could achieve the word embedding with thesaurus feature.
Having ruled out bipartite graph approach, clustering approach and page ranking on the thesaurus subset, we now try to build lexical chains and find the words in the overlapping regions between two chains.

Courtesy: Survey of graphs in NLP by Nastase and Mihalcea

#codingexercise
Given an array : A1[] = 2, 1, 2, 5, 7, 1, 9, 3, 6, 8, 8
A2[] = 2, 1, 8, 3
Sort A1 in such a way that the relative order among the elements will be same as those are in A2. If the element is not present in A2, append them at last in sorted order.
o/p : 2, 2, 1, 1, 8, 8, 3, 5, 6, 7, 9
 List<int> RelSort( ref list<int > A1, ref list<int> A2)
{
Var ret = new List();
Var h = new HashTable<int, int>();
A1.Sort();
For(int i = 0; i < A2.count; i++)
{
Int k = A1.indexOf(A2[i]);
If (k!=-1)
{
H[A2[i]]++;
For(; k < A1.count && A1[k] == A2[i]; k++)
   ret.Add(A1[k]);
}
}
For(int j =0; j < A1.count; j++)
{
If (h[A1[j]] == 0)
    ret.Add( A1[j]);
}
Return ret;
}
In the above code, we assume that the relative order of elements is dictated by A2