Cluster computing

Thursday, October 6, 2016

We were discussing the use of graphs in natural language processing: One of the frequently used techniques is label propagation. It is a method of self-supervision which allows the labels of a small set of annotated data to spread in a consistent fashion to unlabeled data.
The graph is seeded with known positive and negative samples and then each node is labeled in proportion to the percentage of times a random walk on the graph ends at that node. Only those nodes gets a high score that are both similar to the seed nodes and are central to the document set. The graphs consist of both sentences and features and is bipartite as the sentences only connect with features and vice versa.

Label propagation has several uses. One of the applications is to find translations for out of vocabulary words. Initially a graph is constructed from the available text and phrase table. Each vertex is a phrase type or some extracted information from the text. It is connected to other vertices with edges that are formed from some similarity measure between the two profiles and filtered based on a threshold value. There are three types of vertices - labeled, unlabeled and out of vocabulary. Nodes for which the translations are available are annotated with target-side translations and their feature values. The algorithm propagates the translations from labeled nodes to unlabeled nodes. This handles different types of out of vocabulary words. The graph constructed is very large and the methods are scalable.

What the label propagation makes the assumption that edge weights encodes the degree of belief about the similarity of the soft labeling for the connected vertices. Therefore the soft labels for the unlabeled vertices can be computed from the labeled vertices.

The graph used with the label propagation method is essentially a node similarity graph with the assumption that the connected vertices have edges based on some similarity. If we take a thesaurus and apply it directly to this graph, it is rather restrictive. However, if we take the PMI matrix and apply it to the graph, we get better information. The PMI is based on both positive and negative samples and can even include terms from the thesaurus as negative samples. Then with the similarity graph, we can use a CRF we build that has heterogeneous features of interest including a feature for the thesaurus. When we tune the CRF with the similarity graph, we can then use it for the text that becomes available.

Courtesy: Survey of graphs in NLP by Nastase and Mihalcea

#codingexercise
Print the second non-repeating character
Void printSecondNonRepeatingChar(stream<char> s)
{
Const int Max_char = 256;
Var nr = new LinkedList() { null; }
Var present = new List<char>(max_char);
Var dup = new List<bool>(max_char){ false};
Var current = s.read();
While (current != EOF)
{
If(dup[current] == false){
If (present[char] == null){
nr.Add(current);
present[current] = nr.Last();
}else{
nr.Remove(current);
present[current] = null:
Dup[current] = true;
}
}
If (nr.empty() == false && nr.count > 1)
Console.write("Second non-repeated char is {0}", nr.first().next);
Current = s.Read();
}

}

Wednesday, October 5, 2016

We were discussing the use of graphs in natural language processing: We said when we have diverse multiple features for a vector that depend on each other, one way to represent them is using conditional random fields.
This comes useful in clustering with word vectors because now we can include features from thesaurus as well. When we represent organizational structure as graphs, we can use the concept of neighbors to determine equivalence. Two vertices are said to have regular equivalence if they are equally related to equivalent others. They are vertices who have neighbors that are themselves similar. In a graph, a vertex is in a class by itself. Vertices that have no other ties with any vertex are in the same class but they have to have a tie with a vertex in the previous class.

The purpose of these equivalence classes is to show the levels of similarity. In the above example we have three levels of equivalence.

We discussed negative sampling in addition to conditional random fields.The conditional random field helped us keep the dependencies of the observed data implicit when associating a graphical framework only with the conditional probabilities. And the observed data input vector could have any  number of heterogenous features such as prefixes, suffixes, word-bigrams. This comes particularly useful in label propagation and establishing long distance dependencies between labels over the graphical framework. Consequently the thesaurus semantic similarity information can become a feature of the observed vector while allowing the labels to have dependencies with one another. Using the graphical network and the equivalences we can not only use the labels but also the associated framework. The negative sampling was used to fill out the PMI matrix but now we can use the graphical framework directly for the same relational data.

We apply label propagation on the same graphical model

#codingexercise:
Given the arrival and departure times of trains that stop at a station, find the minimum number of platforms required so that no train waits

int GetMinPlatforms(List<time> arr, List<time> dep)
{
assert (arr.Count == dep.Count);
int n = arr.Count;
if (n == 0) return 0;

arr.Sort();
dep.Sort();

int plat= 1;
int max= 1;
int I = 1;
int j = 0;

while (I < n && j < n)
{
if (arr[I] < dep[j])
{
plat++;
i++;
if (plat > max)
       max = plat;
}else{
  plat--;
j++;
}
}
return max;
}

A chef has recently opened a new restaurant with a unique style. The restaurant is divided into K compartments (numbered from 1 to K) and each compartment can be occupied by at most one customer. Each customer that visits the restaurant has a strongly preferred compartment p (1 ? p ? K), and if that compartment is already occupied, then the customer simply leaves. Now obviously, the chef wants to maximize the total number of customers that dine at his restaurant and so he allows (or disallows) certain customers so as to achieve this task. You are to help him with this. Given a list of N customers with their arrival time, departure time and the preferred compartment, you need to calculate the maximum number of customers that can dine at the restaurant

int GetMaxCustomers(List<time>arr, List<time> dep, List<int> comp)
{
assert (arr.Count == dep.Count);
int n = arr.Count
if (n == 0) return 0;
comp.ForEach(x => x = 0); //occupancy
arr.Sort();
dep.Sort();
int customer = 1;
int max = 1;
while (I <n && j <n)
{
if (arr[I] < dep[j])
{
if (comp[j]) continue;
if (arr[I] < dep[j])
{
customer++;
i++;
comp[j] = 1;
if (customer > max)
       max = customer;
}else{
  customer--;
j++;
comp[j] = 0;
}
}
}
return max;
}

Tuesday, October 4, 2016

We were discussing the use of graphs in natural language processing: We said when we have diverse multiple features for a vector that depend on each other, one way to represent them is using conditional random fields. We can see this clearly in relational data that have two characteristics. First, each entity has a rich set of features and second there are statistical dependencies between the entites we wish to model. These relationships and dependencies between entities are naturally represented in graph. Graph based models are have been used to represent the joint probability distribution p(y,x) where the variables y represent the attributes of the entities that we wish to predict and the input variables x represent our observed knowledge about the entities. However if we were to model joint distributions, then modeling the distribution p(x) can include complex dependencies. Modeling these dependencies among inputs can lead to intractable models, but ignoring them can lead to reduced performance. Therefore, a solution may be to directly model the conditional distribution p(y|x), which is sufficient for classification. This is what conditional random fields does. A conditional random field is simply a conditional distribution p(y|x) with an associated graphical structure. The conditional dependencies among the input variable x now do not need to be represented explicitly and the input can assume to have rich input features.
Reference: Sutton and McCallum

#codingexercise

Given a list of integers, segregate positive and negative ones
int segregate(ref List<int> A,)
{
     int I = -1;
     for (int j = 0; j < n; j++)
     {
          if (A[j] < 0)
          {
            i++;
            swap(A,i,j);
          }
     }
      return i+1;
}
Now interleave the positive and negative alternatively in a segregated list

void interleave(ref List<int> A, int i , int j)// i = segregate(A), j = 0
{
while ( i < n && j < i)
{
          Swap(A, j, i);
           i++;
           j += 2;
  }
}
The same holds for interleaving with negative values first because in this case we merely start with j = 1

Monday, October 3, 2016

We were discussing graphs and text analysis.

The main difference between the thesaurus and the bag of words model is that the former selects words from the thesaurus for semantic relationships while the latter selects the words based on collocation in contexts.

We can consider this in two steps:

1) we select the words from the thesaurus just as we would for the lexical chain

2) we substitute these words in the negative sampling of Goldberg's PMI as discussed here.

Subramanya, Petrov and Pereira (2010) showed a way to use PMI directly in graphs. We start with a word similarity graph. The similarity graph is used during a training of CRF on the target domain. Local sequence contexts in the form of n-grams are represented as graph vertices. The part of speech (POS) of a word occurrence is determined by its local context as can be observed from text. For each n-gram, a set of context features is extracted whose values are the pointwise mutual information (PMI) between the n-gram and its features. Each node is represented by the PMI vectors and the similarity function is a cosine of these vectors. The neighbors of the nodes are used as features for the CRF and this results in a "word embedding" of the larger contextual information in the model. The graph is then used to discover new features, to propagate adjustments to the weights of know features and to train the CRF in a semi -supervised manner.

#codingexercise
Given a String and a pattern find all anagrams of the pattern in the original string. You need to print all the index of the location where the match was found

List<int> Match(string text, string pattern)
{
var h = new Hashtable<char, int>();
for (int i = 0; i < pattern.Length; i++)
    h[pattern[i]]++;
var ret = new List<int>();
for (int i = 0; i < text.Length; i++)
{
    var m = h.Copy();
    for (int j = i ; j < i + pattern.count && j < text.length; j++)
           if (m.keys.Contains(text[j]))
                m[text[j]]--;
    if (m.Values.All(x => x == 0))
         ret.Add(i);
}
return ret;
}

We could also do this in linear time:
List<int> GetIndexes(String pattern, string text)
{
    var h = new Hashtable<char, int>();
    var window = new Hashtable<char,int>();
    for (int i = 0; i < pattern.count; i++)
    {
       h[pattern[i]]++;
       window[text[i]]++;
    }
    for (int i = pattern.Count; i < text.Count; i++)
    {
        if (compare(h, window) == true)
            ret.Add(i-Pattern.Count);
        if (window.contains(text[i]))
             window[text[i]]++;
        else
             window[text[i]] = 1;
        int k = i - pattern.count;
        if (window.contains(text[k]))
             window[text[k]] --;
        else
             window[text[k]] = -1;
    }

    // Check for the last window in text
    if (compare(h, window) == true)
         ret.Add(text.count-pattern.Count);
}

Find the minimum distance between integers x and y in an array of integers
int GetMinDist(List<int> A, int x, int y)
{
int min = INT_MAX;
var prev = A.IndexOf( t => t == x || t == y);
for (int i =0 ; i < n; i++)
{
if (A[i] == x || A[i] == y)
{
if ( prev != -1 && A[prev] != A[i] && (i - prev) < min )
{
min = i - prev;
}
prev = i;
}
}
return min;
}

Find the second highest score from student performances
select max(score) from students where score not in (select max(score) from students)

The above works similarly for n-1 th score.

Sunday, October 2, 2016

Today we read a survey of using graphs in natural language processing. There are many applications of graph in nlp ranging from summarization, word embedding and syntactic and semantic relationships.
Graphs are directly applicable in syntactic and semantic relations. WordNet is a great example. It's a semantic network that is heavily used for word sense disambiguation, semantic similarity, question answering, and others. There have been efforts to expand or improve this ontology by connecting different domains such as in linkeddata.org. This uses the web to connect related data that wasn't previously linked. It connects data on the Semantic Web using URIs and RDF. Graphs have also come to be different from the traditional nodes and vertices with the introduction of heterogeneous graphs, graphs with multilayered edges etc.
A heterogeneous graph is one where the vertices may correspond to different types of entities and the edges to different types of links between vertices of the same or different type. An example of a heterogeneous graph is one consisting of articles, their authors and the bibliographic references. Edges between authors could correspond to coauthorship/collaboration, citation, edges between authors and their papers represent authorship.
Graphs has been used in unsupervised learning, language identification, part of speech induction or word sense induction. In addition to traditional application of graph in terms of answering queries, where graphs differentiate from other data structures in nlp, is that we now have persisted nodes/edges/features to do tasks such as clustering, and to improve statistical relational learning. for example we can have similarity as edges with measures as weights. Also, conditional random fields (CRF) allow us to directly model the conditional probability distribution p(y|x) which is sufficient for classification
Several different clustering approaches have already been directly applied to nlp. Some of the other most used techniques include min-cut, spectral graph transducer, random walk-based approaches and label propagation. The last one is a method by which we allow the labels of a small set of annotated data to spread in a consistent fashion to unlabeled data.
We start with a word similarity graph. The similarity graph is used during a training of CRF on the target domain. Local sequence contexts in the form of n-grams are represented as graph vertices. The part of speech (POS) of a word occurrence is determined by its local context as can be observed from text. For each n-gram, a set of context features is extracted whose values are the pointwise mutual information (PMI) between the n-gram and its features. Each node is represented by the PMI vectors and the similarity function is a cosine of these vectors. The neighbors of the nodes are used as features for the CRF and this results in a "word embedding" of the larger contextual information in the model. The graph is then used to discover new features, to propagate adjustments to the weights of know features and to train the CRF in a semi -supervised manner.

#codingexercise
Check if a binary tree is a binary search tree
bool isBST(node root)
{
return IsBstHelper(root, INT_MIN, INT_MAX);
}

bool IsBstHelper(node root, int min, int max)
{
if (root==null) return true;
if (root.data < min || root.data> max) return false;
return IsBstHelper(root.left, min, root.data-1) &&
IsBstHelper(root.right, root.data+1, max);
}

Find the largest binary tree

int GetMaxBST(node root)
{
if (isBST(root)) return treesize(root);
return max(GetMaxBST(root.left), GetMaxBST(root.right));
}

Another tree question
void printLeftView(node root, int level, ref int maxlevel)
{
if (root == null) return;
if (maxlevel < level){
console.write(root.data);
return level;
}
printLeftView(root.left, level+1, ref maxlevel);
printLeftView(root.right, level+1, ref maxlevel);
}

Saturday, October 1, 2016

Many user are accessing the web-site with access pattern:
User 1 has access pattern : x->y->z->a->b->c->d->e->f
user 2 has access pattern : z->a->b->c->d
user 3 has access pattern : y->z->a->b->c->d
user 4 has access pattern : a->b->c->d
and list goes on for lot many users which are finite and numbered.
Now the question is can we determine the top 3 most occurring k-Page-sequence.
for the above example result will be : (k=3) a->b->c , b->c->d , z->a->b.
var GetFrequent(Hashtable t, List<List<Pages>> patterns, int k, int n)
{
for (int i = 0; i < patterns.Count; i++)
{
for (int j = 0; j < patterns[i].count-k+1; j++)
{
stringbuilder pages= string.empty();
for (int l = 0; l < k; l++)
pages += patterns[i][j+l];
t[pages]++;
}
}
var kvps = t.ToList();
kvps.sort(delegate (KeyValuePair<string, string> pair1, KeyValuePair<string, string> pair2)
{ return pair1.value.CompareTo(pair2.value); });
kvps.Reverse()
return kvps.Values.Range(0,n);
}

Given two integers k and n, write a function that prints all the sequences of length k composed of numbers 1,2..n. These sequences are to be printed in sorted order.
Input: k = 2, n = 3

Output:
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3

void PrintSeqIterative(int n, int k)
{
var A = new List<int>(k){1};
while(True)
{
A.ForEach(x => Console.Write(x));
Console.WriteLine();
if (getSuccessor(arr, k,n) == 0)
break;
}
}
int getSuccessor(List<int> A, int k, int n)
{
int p = k-1;
while(p >= 0 && A[p] == n) p--;
if (p <0)
return 0;
A[p] = A[p] + 1;
for (int i = p+1; i < k; i++)
A[i] = 1;
return 1;
}
This can be made recursive as well.