Sunday, October 9, 2016

Today we use the semantic network embedding as a word vector feature in the collocation context model.
From the Skip gram negative sampling (SGNS) approach, we do this by directly taking the semantic network as indicator of word, context pairs. When we recall the skip gram model, we see that the word, context pairs which were to be chosen as positive sampling were those that were having a high probability of being found in the corpus. The contexts are chosen from around the word. But while a large number of word,contexts could appear in the corpus, the quality of the selection improves when the contexts are themselves content-full and linearly far away from the focus words. In other words the similarity have to be topical. We can achieve this directly with the help of a semantic network. The lookup of the semantic network is now described. We already have a mapping between the sense and the real valued vector that builds the semantic network. When the words and contexts appear as different senses but are mapped to either the same node or are co-located as neighbors in the semantic network, then we know that they have similarity by topic. The semantic network was already built by the relatedness of the topics, we just need to map those labels back to the words. This is easily done with the word mapping to senses as part of the semantic network embedding. The semantic network already guarantees the vectors representing the lemmas will be related to those corresponding to the senses. All we need to see if the word and contexts are together or far apart in terms of the semantic network.
Why go through the effort of establishing a semantic network and associating words and contexts with their topics? Either approach alone does not suffice. In fact there is no clear answer to why SGNS produces good word representations. The semantic network articulates one way to say it. We just need to have a threshold  in terms of a number of word context pairs with semantic correlation that the SGNS will consider from all word contexts considered and early on.  Complex models built with a variety of factors do not serve as a panacea. At the same time this effort involved is optional and guaranteed to benefit. This immediately boosts the quality of positive sampling in the skip gram model without even introducing negative sampling.
The purpose of the negative sampling on the other hand is to allow the vectors to not result in being all similar. This is usually the case when the scalar product of the word and context vectors is a large number say about 40 because their vectors are similar. Our approach is also at risk from this threat. However the semantic network also comes in useful with negative sampling. We know that the word, context pairs that are far apart in the semantic network embedding have very little topic correlation. So instead of picking random word, context pairs we are more methodical in the selection with similar objective. In negative sampling, the word, context pair do not appear in the corpus. Therefore, we fix one and pick the other from those associated with far away nodes so that there are at least a threshold number of such negative samples. This improves the quality of negative samples as well. Overall we are saying that with articulated picks of positive and negative sampling we improve the overall word representations.

Saturday, October 8, 2016

Today we present a summary of paper titled "Embedding a Semantic Network in Word Space" by Johannson and Nieto Pina

 A semantic network is created with vectors that represent word meanings (from thesaurus) of words in a text. Those that have many meanings are represented by vectors that can be described as a combination of singular sense vectors. And those that have a singular sense vector is kept similar to the neighbors in a network. This leads to a constrained optimization problem. The distance function is a straightline distance function and squared.
To give an example of the senses for a word, the word 'mouse' may refer to a 'rat or an 'electronic device'. Furthermore, a 'mouse' is a 'rodent' so it has prominent 'teeth'.  The purpose of this vectorization is to find a vector that can represent mouse in the rodent sense.
This paper tries to embed a graph structure  in the word space. This approach uses two sources of information - corpus statistics and network structure. This was shown to work well as a classifier that creates lexical units for FrameNet frames. FrameNet is a collection of over 170000 manually annotated  sentences which provide a unique training dataset for role labeling
The senses are chosen based on  a number of hypernyms and hyponyms from WordNet alhough we could use a thesaurus.
The algorithm maps each sense sij to a sense embedding, a real-valued vector E(sij) in the same vector space as the lemma embeddings. A mixed constraint binds the two embeddings. The lemma and sense embeddings are related through a mix constraint. This mix constraint is expressed with the occurrence probabilities of the senses as weights to the real valued vectors. This is a way to say that we count the contexts to form the vectors when the words have more than one sense. It has a bonus benefit that we disambiguate the words. The motivation behind this paper is now formalized as minimizing the sum of distances between each sense and its neighbours, while satisfying the mix constraint for each lemma.
The mix constraint helps the lemma to be true to the text while the minimization helps build the graph. It also leaves the words with singular meaning unchanged.
The algorithm can now be described in the following way. It proceeds in an online fashion by considering one lemma at a time.
1) Pick a lemma and choose some senses from the thesaurus
2) Adjust the embeddings of the senses as well as their mix in order to minimize the loss function which is expressed as the weighted square of distance between two vectors ranging over all of the senses of each pair (the sense and its neighbor). The embeddings of the neighbors are kept fixed.
3) Iterate through all the lemmas for a fixed number of times or a threshold for the change.
Improvements in each iteration does not need to be gradient based. We already know that the weighted centroid of the neighbors is the likely target. Therefore the residual is the difference between the linear combination of weighted centroids and the lemma embedding. The sense embedding for lemma is the weighted offset from the centroid. If the mix variable is zero, the sense embedding is the centroid which goes to say that it is determined completely by the neighbors and if it is 1 it is completely the embedding of the lemma.

Friday, October 7, 2016

In Natural Language Processing, Word2Net and Roget’s thesaurus are often used interchangeably.  We argue that a thesaurus can be put to more use.
First, it helps in sampling words with semantically equivalent content.  We achieve this by finding a word from the thesaurus for each of the word in the text. These words are chosen at random from the synonyms pertaining to each word although arguably we could choose the simplest or most frequently used.
Second it helps in finding an alternate text comprising of words from the thesaurus that we measure based on lexical centrality. We perform a random walk on the graph and the nodes that are visited the most frequently are selected as the alternate text. In order to avoid duplicates or duplicate content, the final selection depends on its maximal marginal relevance.
We don’t use a bipartite graph between the original text and the alternate text. Although such graphs have their own use, we argue that they don’t serve our purpose to perform a Laplace Transform. Moreover we see this different from bootstrapping where we start with a few seed relation examples or patterns and iteratively grow the set of relations and patterns based on occurrence in a large corpus. There a bipartite graph helps with representing the mutual information between instances (hubs) and patterns (authorities)
We could form a page rank algorithm directly on the thesaurus subset corresponding to the words in the given text.  Hughes and Ramage (2007) showed that approach would calculate the stationary distribution of the nodes in the Thesaurus graph, biased on each of the input words in a given word pair. The divergence between these distributions is calculated as a measure of the relatedness of the two words. Although this approach has proved to be a significant improvement for semantic relatedness, we have only introduced semantically equivalent and redundant words and thereby made it more computationally intensive as compared to page ranking the original text. The purpose of the Laplace transform is to create semantic equivalent alternative regularized text. Even if there is no such thing as a canonicalized text, the transformation provides semantically alternate text.
We don’t cluster on the thesaurus subset because the words will be selected better by applying it directly on the original text unless we introduce a feature of the word vector based on the thesaurus. One such feature may be one to one mappings of synonyms from the thesaurus for the words chosen as vector features for the PMI matrix. Arguably the thesaurus based feature could be part of the CRF instead of the PMI matrix. Either way we could achieve the word embedding with thesaurus feature.
Having ruled out bipartite graph approach, clustering approach and page ranking on the thesaurus subset, we now try to build lexical chains and find the words in the overlapping regions between two chains.

Courtesy: Survey of graphs in NLP by Nastase and Mihalcea

#codingexercise
Given an array : A1[] = 2, 1, 2, 5, 7, 1, 9, 3, 6, 8, 8
A2[] = 2, 1, 8, 3
Sort A1 in such a way that the relative order among the elements will be same as those are in A2. If the element is not present in A2, append them at last in sorted order.
o/p : 2, 2, 1, 1, 8, 8, 3, 5, 6, 7, 9
 List<int> RelSort( ref list<int > A1, ref list<int> A2)
{
Var ret = new List();
Var h = new HashTable<int, int>();
A1.Sort();
For(int i = 0; i < A2.count; i++)
{
Int k = A1.indexOf(A2[i]);
If (k!=-1)
{
H[A2[i]]++;
For(; k < A1.count && A1[k] == A2[i]; k++)
   ret.Add(A1[k]);
}
}
For(int j =0; j < A1.count; j++)
{
If (h[A1[j]] == 0)
    ret.Add( A1[j]);
}
Return ret;
}
In the above code, we assume that the relative order of elements is dictated by A2

Thursday, October 6, 2016

We were discussing the use of graphs in natural language processing: One of the frequently used techniques is label propagation. It is a method of self-supervision which allows the labels of a small set of annotated data to spread in a consistent fashion to unlabeled data.
The graph is seeded with known positive and negative samples and then each node is labeled in proportion to the percentage of times  a random walk on the graph ends at that node. Only those nodes gets a high score that are both similar to the seed nodes and are central to the document set. The graphs consist of both sentences and features and is bipartite as the sentences only connect with features and vice versa.

Label propagation has several uses. One of the applications is to find translations for out of vocabulary words. Initially a graph is constructed from the available text and phrase table. Each vertex is a phrase type or some extracted information from the text. It is connected to other vertices with edges that are formed from some similarity measure between the two profiles and filtered based on a threshold value. There are three types of vertices - labeled, unlabeled and out of vocabulary. Nodes for which the translations are available are annotated with target-side translations and their feature values. The algorithm propagates the translations from labeled nodes to unlabeled nodes. This handles different types of out of vocabulary words.  The graph constructed is very large and the methods are scalable.

What the label propagation makes the assumption that edge weights encodes the degree of belief about the similarity of the soft labeling  for the connected vertices.  Therefore the soft labels for the unlabeled vertices can be computed from the labeled vertices.

The graph used with the label propagation method is essentially a node similarity graph with the assumption that the connected vertices have edges based on some similarity.   If we take a thesaurus and apply it directly to this graph, it is rather restrictive. However, if we take the PMI matrix and apply it to the graph, we get better information. The PMI is based on both positive and negative samples and can even include terms from the thesaurus as negative samples. Then with the similarity graph, we can use a CRF we build that has heterogeneous features of interest including a feature for the thesaurus. When we tune the CRF with the similarity graph, we can then use it for the text that becomes available.

Courtesy: Survey of graphs in NLP by Nastase and Mihalcea

 #codingexercise
Print the second non-repeating character
Void printSecondNonRepeatingChar(stream<char> s)
{
Const int Max_char = 256;
Var nr = new LinkedList() { null; }
Var present = new List<char>(max_char);
Var dup = new List<bool>(max_char){ false};
Var current = s.read();
While (current != EOF)
{
If(dup[current] == false){
If (present[char] == null){
nr.Add(current);
present[current] = nr.Last();
}else{
nr.Remove(current);
present[current] = null:
Dup[current] = true;
}
}
If (nr.empty() == false && nr.count > 1)
    Console.write("Second non-repeated char is {0}", nr.first().next);
Current = s.Read();
}

}

Wednesday, October 5, 2016

We were discussing the use of graphs in natural language processing: We said when we have diverse multiple features for a vector that depend on each other, one way to represent them is using conditional random fields.
This comes useful in clustering with word vectors because now we can include features from thesaurus as well. When we represent organizational structure as graphs, we can use the concept of neighbors to determine equivalence. Two vertices are said to have regular equivalence if they are equally related to equivalent others. They are vertices who have neighbors that are themselves similar. In a graph, a vertex is in a class by itself. Vertices that have no other ties with any vertex are in the same class but they have to have a tie with a vertex in the previous class.

The purpose of these equivalence classes is to show the levels of similarity. In the above example we have three levels of equivalence.


We discussed negative sampling in addition to conditional random fields.The conditional random field helped us keep the dependencies of the observed data implicit when associating a graphical framework only with the conditional probabilities. And the observed data input vector could have any  number of heterogenous features such as prefixes, suffixes, word-bigrams.  This comes particularly useful in label propagation and establishing long distance dependencies between labels over the graphical framework. Consequently the thesaurus semantic similarity information can become a feature of the observed vector while allowing the labels to have dependencies with one another.  Using the graphical network and the equivalences we can not only use the labels but also the associated framework. The negative sampling was used to fill out the PMI matrix but now we can use the graphical framework directly for the same relational data.

We apply label propagation on the same graphical model

#codingexercise:
Given the arrival and departure times of trains that stop at a station, find the minimum number of platforms required so that no train waits

int GetMinPlatforms(List<time> arr, List<time> dep)
{
 assert (arr.Count == dep.Count);
 int n = arr.Count;
if (n == 0) return 0;

arr.Sort();
dep.Sort();

int plat= 1;
int max= 1;
int  I  = 1;
int j = 0;

while (I < n && j < n)
{
if (arr[I] < dep[j])
{
  plat++;
  i++;
  if (plat > max)
       max = plat;
}else{
  plat--;
  j++;
}
}
return max;
}

A chef has recently opened a new restaurant with a unique style. The restaurant is divided into K compartments (numbered from 1 to K) and each compartment can be occupied by at most one customer. Each customer that visits the restaurant has a strongly preferred compartment p (1 ? p ? K), and if that compartment is already occupied, then the customer simply leaves. Now obviously, the chef wants to maximize the total number of customers that dine at his restaurant and so he allows (or disallows) certain customers so as to achieve this task. You are to help him with this. Given a list of N customers with their arrival time, departure time and the preferred compartment, you need to calculate the maximum number of customers that can dine at the restaurant

int GetMaxCustomers(List<time>arr, List<time> dep, List<int> comp)
{
assert (arr.Count == dep.Count);
int n = arr.Count
if (n == 0) return 0;
comp.ForEach(x => x = 0); //occupancy
arr.Sort();
dep.Sort();
int customer = 1;
int max = 1;
while (I <n && j <n)
{
if (arr[I] < dep[j])
{
 if (comp[j]) continue;
if (arr[I] < dep[j])
{
  customer++;
  i++;
  comp[j] = 1;
  if (customer > max)
       max = customer;
}else{
  customer--;
  j++;
  comp[j] = 0;
}
}
}
return max;
}

Tuesday, October 4, 2016

We were discussing the use of graphs in natural language processing: We said when we have diverse multiple features for a vector that depend on each other, one way to represent them is using conditional random fields. We can see this clearly in relational data that have two characteristics. First, each entity has a rich set of features and second there are statistical dependencies between the entites we wish to model. These relationships and dependencies between entities are naturally represented in graph.  Graph based models are have been used to represent the joint probability distribution p(y,x) where the variables y represent the attributes of the entities that we wish to predict and the input variables x represent our observed knowledge about the entities. However if we were to model joint distributions, then modeling the distribution p(x) can include complex dependencies. Modeling these dependencies among inputs can lead to intractable models, but ignoring them can lead to reduced performance. Therefore, a solution may be to directly model the conditional distribution p(y|x), which is sufficient for classification. This is what conditional random fields does. A conditional random field is simply a conditional distribution p(y|x) with an associated graphical structure. The  conditional dependencies among the input variable x now do not need to be represented explicitly and the input can assume to have rich input features.
Reference: Sutton and McCallum

#codingexercise

Given a list of integers, segregate positive and negative ones
int segregate(ref List<int> A,)
{
     int I = -1;
     for (int j = 0; j < n; j++)
     {
          if (A[j] < 0)
          {
            i++;
            swap(A,i,j);
          }
     }
      return i+1;
}
Now interleave the positive and negative alternatively in a segregated list

void interleave(ref List<int> A, int i , int j)// i = segregate(A), j = 0
{
  while ( i < n && j < i)
  {
          Swap(A, j, i);
           i++;
           j += 2;
  }
}
The same holds for interleaving with negative values first because in this case we merely start with j = 1

Monday, October 3, 2016

We were discussing graphs and text analysis.
The main difference between the thesaurus and the bag of words model is that the former selects words from the thesaurus for semantic relationships while the latter selects the words based on collocation in contexts. 
We can consider this in two steps: 
1) we select the words from the thesaurus just as we would for the lexical chain 
2) we substitute these words in the negative sampling of Goldberg's PMI as discussed here. 
Subramanya, Petrov and Pereira (2010) showed a way to use PMI directly in graphs. We start with a word similarity graph. The similarity graph is used during a training of CRF on the target domain. Local sequence contexts in the form of n-grams are represented as graph vertices. The part of speech (POS) of  a word occurrence is determined by its local context as can be observed from text. For each n-gram, a set of context features is extracted whose values are the pointwise mutual information (PMI) between the n-gram and its features. Each node is represented by the PMI vectors and the similarity function is a cosine of these vectors.  The neighbors of the nodes are used as features for the CRF and this results in a "word embedding" of the larger contextual information in the model. The  graph is then used to discover new features, to propagate adjustments to the weights of know features and to train the CRF in a semi -supervised manner. 
#codingexercise
Given a String and a pattern find all anagrams of the pattern in the original string. You need to print all the index of the location where the match was found

List<int> Match(string text, string pattern)
{
var h = new Hashtable<char, int>();
for (int i = 0; i < pattern.Length; i++)
    h[pattern[i]]++;
var ret = new List<int>();
for (int i = 0; i < text.Length; i++)
{
    var m = h.Copy();
    for (int j = i ; j < i + pattern.count && j < text.length; j++)
           if (m.keys.Contains(text[j]))
                m[text[j]]--;
    if (m.Values.All(x => x == 0))
         ret.Add(i);
 }
return ret;
}

We could also do this in linear time:
List<int> GetIndexes(String pattern, string text)
{
    var h = new Hashtable<char, int>();
    var window = new Hashtable<char,int>();
    for (int i = 0; i < pattern.count; i++)
    {
       h[pattern[i]]++;
       window[text[i]]++;
    }
    for (int i = pattern.Count; i < text.Count; i++)
    {
        if (compare(h, window) == true)
            ret.Add(i-Pattern.Count);
        if (window.contains(text[i]))
             window[text[i]]++;
        else
             window[text[i]] = 1;
        int k = i - pattern.count;
        if (window.contains(text[k]))
             window[text[k]] --;
        else
             window[text[k]] = -1;
    }

    // Check for the last window in text
    if (compare(h, window) == true)
         ret.Add(text.count-pattern.Count);
}

Find the minimum distance between integers x and y in an array of integers
int GetMinDist(List<int> A, int x, int y)
{
   int min = INT_MAX;
   var prev =  A.IndexOf( t => t == x || t == y);
   for (int i =0 ; i < n; i++)
   {
      if (A[i] == x || A[i] == y)
      {
          if ( prev != -1 && A[prev] != A[i] && (i - prev) < min )
          {
             min = i - prev;
          }
          prev = i;
      }
   }
   return min;
}

Find the second highest score from student performances
select max(score) from students where score not in (select max(score) from students)


The above works similarly for n-1 th score.