Cluster computing

Sunday, October 2, 2016

Today we read a survey of using graphs in natural language processing. There are many applications of graph in nlp ranging from summarization, word embedding and syntactic and semantic relationships.
Graphs are directly applicable in syntactic and semantic relations. WordNet is a great example. It's a semantic network that is heavily used for word sense disambiguation, semantic similarity, question answering, and others. There have been efforts to expand or improve this ontology by connecting different domains such as in linkeddata.org. This uses the web to connect related data that wasn't previously linked. It connects data on the Semantic Web using URIs and RDF. Graphs have also come to be different from the traditional nodes and vertices with the introduction of heterogeneous graphs, graphs with multilayered edges etc.
A heterogeneous graph is one where the vertices may correspond to different types of entities and the edges to different types of links between vertices of the same or different type. An example of a heterogeneous graph is one consisting of articles, their authors and the bibliographic references. Edges between authors could correspond to coauthorship/collaboration, citation, edges between authors and their papers represent authorship.
Graphs has been used in unsupervised learning, language identification, part of speech induction or word sense induction. In addition to traditional application of graph in terms of answering queries, where graphs differentiate from other data structures in nlp, is that we now have persisted nodes/edges/features to do tasks such as clustering, and to improve statistical relational learning. for example we can have similarity as edges with measures as weights. Also, conditional random fields (CRF) allow us to directly model the conditional probability distribution p(y|x) which is sufficient for classification
Several different clustering approaches have already been directly applied to nlp. Some of the other most used techniques include min-cut, spectral graph transducer, random walk-based approaches and label propagation. The last one is a method by which we allow the labels of a small set of annotated data to spread in a consistent fashion to unlabeled data.
We start with a word similarity graph. The similarity graph is used during a training of CRF on the target domain. Local sequence contexts in the form of n-grams are represented as graph vertices. The part of speech (POS) of a word occurrence is determined by its local context as can be observed from text. For each n-gram, a set of context features is extracted whose values are the pointwise mutual information (PMI) between the n-gram and its features. Each node is represented by the PMI vectors and the similarity function is a cosine of these vectors. The neighbors of the nodes are used as features for the CRF and this results in a "word embedding" of the larger contextual information in the model. The graph is then used to discover new features, to propagate adjustments to the weights of know features and to train the CRF in a semi -supervised manner.

#codingexercise
Check if a binary tree is a binary search tree
bool isBST(node root)
{
return IsBstHelper(root, INT_MIN, INT_MAX);
}

bool IsBstHelper(node root, int min, int max)
{
if (root==null) return true;
if (root.data < min || root.data> max) return false;
return IsBstHelper(root.left, min, root.data-1) &&
IsBstHelper(root.right, root.data+1, max);
}

Find the largest binary tree

int GetMaxBST(node root)
{
if (isBST(root)) return treesize(root);
return max(GetMaxBST(root.left), GetMaxBST(root.right));
}

Another tree question
void printLeftView(node root, int level, ref int maxlevel)
{
if (root == null) return;
if (maxlevel < level){
console.write(root.data);
return level;
}
printLeftView(root.left, level+1, ref maxlevel);
printLeftView(root.right, level+1, ref maxlevel);
}

Saturday, October 1, 2016

Many user are accessing the web-site with access pattern:
User 1 has access pattern : x->y->z->a->b->c->d->e->f
user 2 has access pattern : z->a->b->c->d
user 3 has access pattern : y->z->a->b->c->d
user 4 has access pattern : a->b->c->d
and list goes on for lot many users which are finite and numbered.
Now the question is can we determine the top 3 most occurring k-Page-sequence.
for the above example result will be : (k=3) a->b->c , b->c->d , z->a->b.
var GetFrequent(Hashtable t, List<List<Pages>> patterns, int k, int n)
{
for (int i = 0; i < patterns.Count; i++)
{
for (int j = 0; j < patterns[i].count-k+1; j++)
{
stringbuilder pages= string.empty();
for (int l = 0; l < k; l++)
pages += patterns[i][j+l];
t[pages]++;
}
}
var kvps = t.ToList();
kvps.sort(delegate (KeyValuePair<string, string> pair1, KeyValuePair<string, string> pair2)
{ return pair1.value.CompareTo(pair2.value); });
kvps.Reverse()
return kvps.Values.Range(0,n);
}

Given two integers k and n, write a function that prints all the sequences of length k composed of numbers 1,2..n. These sequences are to be printed in sorted order.
Input: k = 2, n = 3

Output:
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3

void PrintSeqIterative(int n, int k)
{
var A = new List<int>(k){1};
while(True)
{
A.ForEach(x => Console.Write(x));
Console.WriteLine();
if (getSuccessor(arr, k,n) == 0)
break;
}
}
int getSuccessor(List<int> A, int k, int n)
{
int p = k-1;
while(p >= 0 && A[p] == n) p--;
if (p <0)
return 0;
A[p] = A[p] + 1;
for (int i = p+1; i < k; i++)
A[i] = 1;
return 1;
}
This can be made recursive as well.

Friday, September 30, 2016

Yesterday we were discussing the role of thesaurus in natural language processing.
The essence of the use of thesaurus is to build, for every significant word of the text, a list of heads under which they appear, and to look for intersection between these lists. The candidates that are poor translations are replaced with the words or phrases that are selected from the head that contains the most words from the initial translation. In addition, Thesaurus has been used for word sense disambiguation.
Using the thesaurus, we can also build lexical chains which are strong semantic relations between the words that it is made up of. Morris and Hirst calculated lexical cohesion. Stairman automated this process. Ellman used lexical chains to reconstruct the representation of text's meaning from its content.
As an aside, WordNet helped tremendously with word sense disambiguation. It has also been used in different languages. Due to the limitations of the printed version of the thesaurus, many have turned to WordNet.But if the Thesaurus were machine tractable and free, Jarmasz argued in his thesis, it would be better than WordNet. Further both of them can be combined.

It is not straightforward to use the lexical chain words as vector features because they would lose the cohesion in the chain. Similarly pair wise words distances cannot be used because we don’t have the relationships between the words to select them as candidates before looking them up in the thesaurus. However k-means clustering directly applies to words when they are looked up in the thesaurus. Consequently cluster centroids can be found and then the words can be selected based on and all the candidates will be tagged to the nearest cluster.

The main difference between the thesaurus and the bag of words model is that the former selects words from the thesaurus for semantic relationships while the latter selects the words based on collocation in contexts.

That said, we can contruct vectors from semantically correlated words just as we would from the text.
Levy and Goldberg introduced Pointwise Mutual Information and this is equally applicable to semantically correlated words.
While they chose random contexts in negative sampling, we can substitute with lexical chains upto a finite number. Further if we had a technique to use semantic cluster centroids, we could use them directly instead of lexical chains.

#codingexercise
For given array a of size n we create a set of a[i] , a[a[i]] , a[a[a[i]]] ….. i varies from 0 to n-1 , find the max size of such set.
Let us take an array with two elements for binary values 1 and 0 as first and second elements. Then we can create a set
a[0] , a[a[0]] , a[a[a[0]]] …..
a[1] , a[a[1]] , a[a[a[1]]] …..
Which are
1, 0,1, 0, ...
0, 1, 0, 1, ...
Void printNested( list<int> A, int i, int index)
{
If (i < A.count){
If( index < A.count){
Console.write(A[index]);
PrintNested(A, i, A[index]);
}else{
PrintNested(A, i+1, i+1);
}
}
}
}

It may be interesting to note that there can be infinite series

Thursday, September 29, 2016

Today we continue to discuss the relevance of thesaurus in Natural language processing. Mant algorithms hav3 been professed which derermin the latent similarity based on word co-occurrence matrix. Yet it is narural to find similarity based on some ontology. Word2Net filled this gap very well. Words were arranged in a .predefined tree and their hierarchy gave an organization of the topic they belonged under. Furthermore, they gave similarity measures as distance between the words along the edges. The drawbacks were that not all words were found in the wordnet and the organization didnt really help with every context. where it helped the most was in its association with parts of speech. A thesaurus improves the usefulness in determining similarity by providing all possible synonyms and thus exploring as much semantic connections as possible. In fact, many nlp algorithms have been mentioned by their authors to ne significantly improved when there is such an online thesaurus. They stated that a thesaurus does better with semantic closeness than wordnet theoretically.
Such a thesaurus was not available in machine readable format.
Today we have learned to store graphs with trillions of edges. We can partition them on thousands of nodes using technologies such as Apache Giraph. With several thousand worker nodes, we can find relationships between words in parallelized summation forms. For example, if we were to perform k-means clustering of the words based on accepted similarities, the vertices could start as partitioned on different nodes but eventually become tagged with the cluster they belong to. The master would initialize the centroids and share it with local aggregators. These aggregators in turn would tag their vertices to the nearest centroid. The centroids would then be recomputed when the aggregators have shared their results with the master.
This summation form technique works very well with linear scalability to the size of graph.
However, the way this similarity was used since wordnet age was in finding lexical chains through the text.

A lexical chain was a set of words in the texr that contributed to some strain of meaning in the overall discourse. The best of these lexical chains was selected as the overall topic of the texr. Naturally a thesaurus performed better than wordnet in finding these chains however both were likely to pick false chains.
This approach was greatly improved by statistical analysis which considered the text as a bag of words model where words were high dimensional words.

We will continue this discussion but first let us look into a coding exercise
Given a stream of characters, find the first non-repeating character.
A solution is to maintain a doubly linked list which keeps track of the nonrepeating characters seen in order.
Void printFirstNonRepeatingChar(stream<char> s)
{
Const int Max_char = 256;
Var nr = new LinkedList() { null; }
Var present = new List<char>(max_char);
Var dup = new List<bool>(max_char){ false};
Var current = s.read();
While (current != EOF)
{
If(dup[current] == false){
If (present[char] == null){
nr.Add(current);
present[current] = nr.Last();
}else{
nr.Remove(current);
present[current] = null:
Dup[current] = true;
}
}
If (nr.empty() == false)
Console.write("First non-repeated char is {0}", nr.First());
Current = s.Read();
}

}

The above can also be modified to exit after first non-repeated character.

Also note that the state is toggled after the repetition.

If the number of unique characters is not restricted by ascii, the above problem does not scale. Consequently a hash table may be required.
# combination puzzles : https://1drv.ms/w/s!Ashlm-Nw-wnWlBdZZKfVTy5CTc4i

Wednesday, September 28, 2016

Today we continue discussing metrics and their collection from services.
We observed that one way to keep metrics true is to collect them directly from the service where the service makes a call to push the metric.
One example of this is that we can count the number of success from the code by calling
StatsdClient.Metrics.Counter("myapi.post.Success");
Statsd is a client for Graphite-Graphana stack and makes it easy to push the metrics to Graphite so that they show in the charts in Graphana.
Statsd exposes the notion of timers as well. A timer captures the duration of an operation and we can use it in our constructor and destructor to measure the lifetime of an operation and this can be used with the metrics.
Statsd batches up the metrics it pushes to Graphite. Therefore the timing window for the statsd flush interval must be at least as long as the graphite's minimum time otherwise Graphite will drop packets.
And we must also be careful to turn legacyNamespace off in Graphite otherwise the statsd measurements will show up in two or three different folders.
Statsd is not the only client. There are statsd, collectd, bucky and logster. Statsd is the original client and the most used of them all. Collectd is written in EPEL in older versions. Bucky connects them both because it listens to both the packets from statsd and collectd and sends them to Graphite. Logster can tail your log file and send metrics back to Graphite.
Some metrics can be confusing. For example, count and rate are two independent metrics. Count measures the total number of times an event happened. Rate measures how many times the event happened per unit of time.
Metrics can collect over time. There are retention rules mentioned which can dictate how long the metrics need to be held in the database files. Moreover, these files are fixed size.

Convert a BST into inorder singly linked list in place.
Node toList(node root)
{
If (root == null) return root;
If (root.left != null)
{
var left = toList(root.left);
for( ; left.right != null; left = left.right);
Left.right = root;
}
If (root.right != null)
{
var right = toList(root);
for(; right.left != null; right = right.left);
root.right = right;
}
Return root;
}

Convert a BST into preorder singly linked list in place.
Node toPreList(node root)
{
If (root == null) return root;
var temp = root.right;
var left = null;
If (root.left != null)
{
left = toPreList(root.left);
root.right = left;
for( ; left != null && left.right != null; left = left.right);
//Left.right = temp;
}
If (temp != null)
{
var right = toPreList(temp);
left.right = right;
}
Return root;
}

Convert a BST into postorder singly linked list in place.
Node toPostList(node root)
{
If (root == null) return root;
var temp = null;
var left = null;
var right = null;
If (root.left != null)
{
left = toPostList(root.left);
temp = left;
for( ; left != null && left.right != null; left = left.right);
}
If (root.right != null)
{
var right = toPostList(temp);
if (temp == null) temp = right;
if (left != null) left.right = right;
for( ; right != null && right.right != null; right = right.right);
}
if (temp == null) temp = right;
if (right != null) right.right = root;
if (temp == null) temp = root;
return temp;
}

In the above methods, we have used singly linked list. We can also make doubly linked list by adding links in the reverse direction
void ToDLL(Node head, Node prev)
{
if (head == null) return;
head.prev = prev;
var prev = head;
ToDLL(head.next, prev);
}
Test cases:
NULL,
1,
1->2,
1->2->3

Tuesday, September 27, 2016

True Metrics
In yesterday's post we referred to getting data for the metrics from logs and database states because the information is pulled without affecting the operations of the system. However there are a few drawbacks to this approach.
First, most logs make their way to a time series database, also called a log index, and the log files are hardly used directly to compute the data for the metrics
Second, this flow of logs may be interrupted or just plainly missed. In some cases, this goes undetected that the logs are not making their way to a log index.
Third, within the time series database, there is no ownership of logs and consequently events can be deleted.
Fourth, the same goes for duplication of events which tampers with the data for the metrics.
Fifth, the statistics computed from a time series database depends on the criteria used to perform the search also called the query. This query may not always be accurate and correct for the metric to be reported.
These vulnerabilities dilute the effectiveness of a metric and therefore lead to not only inaccurate reports but possibly misleading charts.
Let's compare this with the data available from the database states.
The states of the services are internal and tightly controlled by the service.
They are transactional in nature and therefore can be counted as true representation of the corresponding events
They are up to date with the most recent activity of the service and hence provide information upto the nearest point of time.
Queries against the database are not only going to be accurate but likely used by the service itself when the service exports metrics directly.
The data is as authoritative as possible because the service relied on it
However, the drawback is that not all information is available in the database particularly those for error conditions. In such cases, the log files are the best option.
The metrics data may be pulled frequently and this might interfere with the operations of the database or the service depending on who is providing the information.
Therefore a solution is to keep the log file and the time series database in sync by controlling the processes that flow the data between them. For example, a syslog drain to the time series database that is setup once when the service is deployed guarantees that the log events will flow to the index. Users to the index may be given read only permissions.
Lastly log index have a way of exporting data suitable for use with external charting tools. Some log indexes also provide rich out of box charting. For example, Splunk lets us find the number of calls made against each of the http status codes with a nifty search command as follows:
search source=sourcetype earliest = -7d@d | timechart count by status
Since the 200 value is going to be the predominant value, it can be used as an overlay over the chart so that the other values are clearer, colored and easier to read.
If the status codes are not appearing as a separate field which is a common case, we can use the cluster command with something like
index=_internal source=sourcetype | cluster showcount=t | table _time, cluster_count, _raw | timechart count by cluster_count
or we can specify rex "HTTP/1.1\" (?<status>\d+)"

#puzzle question
You have a 4x4 matrix with the first row occupied the numbers 17, 99, 59 and 71.
Find an arrangement of the matrix such that their sum in any direction is equal.
Answer: There can be many answers possible if the occupied cells were all over the matrix. Instead if the first row is fixed, then we can fill in the remaining only in a way such that there are no duplicates row wise and column wise. This way all vertical, horizontal and diagonal elements add to the same sum.

#codingexercise
Given a sorted array of number , value K and value X, find the K nearest number to the value

void printKNearest(List<int> A, int x, int k)
{
int n = A.Count;
// returns one element from the array.
int left = GetTransitionPointByBinarySearch(A, 0, n-1, x);
int right = left + 1;
int count = 0;
if (A[left] == X) left --;
while (left >= 0 && right < n && count < k)
{
if (x-A[left] <A[right]-X){
Console.Write(A[left]);
   left--;
}else{
Console.Write(A[right]);
   right--;
}
}
while (count < k && left >= 0){
    console.Write(A[left]);
    left--;
}
while (count < k && right <= n-1){
    console.Write(A[right]);
    right++;
}
}

One of the key observations is that these metrics could be pushed directly from the api of the services.