Cluster computing

Saturday, October 1, 2016

Many user are accessing the web-site with access pattern:
User 1 has access pattern : x->y->z->a->b->c->d->e->f
user 2 has access pattern : z->a->b->c->d
user 3 has access pattern : y->z->a->b->c->d
user 4 has access pattern : a->b->c->d
and list goes on for lot many users which are finite and numbered.
Now the question is can we determine the top 3 most occurring k-Page-sequence.
for the above example result will be : (k=3) a->b->c , b->c->d , z->a->b.
var GetFrequent(Hashtable t, List<List<Pages>> patterns, int k, int n)
{
for (int i = 0; i < patterns.Count; i++)
{
for (int j = 0; j < patterns[i].count-k+1; j++)
{
stringbuilder pages= string.empty();
for (int l = 0; l < k; l++)
pages += patterns[i][j+l];
t[pages]++;
}
}
var kvps = t.ToList();
kvps.sort(delegate (KeyValuePair<string, string> pair1, KeyValuePair<string, string> pair2)
{ return pair1.value.CompareTo(pair2.value); });
kvps.Reverse()
return kvps.Values.Range(0,n);
}

Given two integers k and n, write a function that prints all the sequences of length k composed of numbers 1,2..n. These sequences are to be printed in sorted order.
Input: k = 2, n = 3

Output:
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3

void PrintSeqIterative(int n, int k)
{
var A = new List<int>(k){1};
while(True)
{
A.ForEach(x => Console.Write(x));
Console.WriteLine();
if (getSuccessor(arr, k,n) == 0)
break;
}
}
int getSuccessor(List<int> A, int k, int n)
{
int p = k-1;
while(p >= 0 && A[p] == n) p--;
if (p <0)
return 0;
A[p] = A[p] + 1;
for (int i = p+1; i < k; i++)
A[i] = 1;
return 1;
}
This can be made recursive as well.

Friday, September 30, 2016

Yesterday we were discussing the role of thesaurus in natural language processing.
The essence of the use of thesaurus is to build, for every significant word of the text, a list of heads under which they appear, and to look for intersection between these lists. The candidates that are poor translations are replaced with the words or phrases that are selected from the head that contains the most words from the initial translation. In addition, Thesaurus has been used for word sense disambiguation.
Using the thesaurus, we can also build lexical chains which are strong semantic relations between the words that it is made up of. Morris and Hirst calculated lexical cohesion. Stairman automated this process. Ellman used lexical chains to reconstruct the representation of text's meaning from its content.
As an aside, WordNet helped tremendously with word sense disambiguation. It has also been used in different languages. Due to the limitations of the printed version of the thesaurus, many have turned to WordNet.But if the Thesaurus were machine tractable and free, Jarmasz argued in his thesis, it would be better than WordNet. Further both of them can be combined.

It is not straightforward to use the lexical chain words as vector features because they would lose the cohesion in the chain. Similarly pair wise words distances cannot be used because we don’t have the relationships between the words to select them as candidates before looking them up in the thesaurus. However k-means clustering directly applies to words when they are looked up in the thesaurus. Consequently cluster centroids can be found and then the words can be selected based on and all the candidates will be tagged to the nearest cluster.

The main difference between the thesaurus and the bag of words model is that the former selects words from the thesaurus for semantic relationships while the latter selects the words based on collocation in contexts.

That said, we can contruct vectors from semantically correlated words just as we would from the text.
Levy and Goldberg introduced Pointwise Mutual Information and this is equally applicable to semantically correlated words.
While they chose random contexts in negative sampling, we can substitute with lexical chains upto a finite number. Further if we had a technique to use semantic cluster centroids, we could use them directly instead of lexical chains.

#codingexercise
For given array a of size n we create a set of a[i] , a[a[i]] , a[a[a[i]]] ….. i varies from 0 to n-1 , find the max size of such set.
Let us take an array with two elements for binary values 1 and 0 as first and second elements. Then we can create a set
a[0] , a[a[0]] , a[a[a[0]]] …..
a[1] , a[a[1]] , a[a[a[1]]] …..
Which are
1, 0,1, 0, ...
0, 1, 0, 1, ...
Void printNested( list<int> A, int i, int index)
{
If (i < A.count){
If( index < A.count){
Console.write(A[index]);
PrintNested(A, i, A[index]);
}else{
PrintNested(A, i+1, i+1);
}
}
}
}

It may be interesting to note that there can be infinite series

Thursday, September 29, 2016

Today we continue to discuss the relevance of thesaurus in Natural language processing. Mant algorithms hav3 been professed which derermin the latent similarity based on word co-occurrence matrix. Yet it is narural to find similarity based on some ontology. Word2Net filled this gap very well. Words were arranged in a .predefined tree and their hierarchy gave an organization of the topic they belonged under. Furthermore, they gave similarity measures as distance between the words along the edges. The drawbacks were that not all words were found in the wordnet and the organization didnt really help with every context. where it helped the most was in its association with parts of speech. A thesaurus improves the usefulness in determining similarity by providing all possible synonyms and thus exploring as much semantic connections as possible. In fact, many nlp algorithms have been mentioned by their authors to ne significantly improved when there is such an online thesaurus. They stated that a thesaurus does better with semantic closeness than wordnet theoretically.
Such a thesaurus was not available in machine readable format.
Today we have learned to store graphs with trillions of edges. We can partition them on thousands of nodes using technologies such as Apache Giraph. With several thousand worker nodes, we can find relationships between words in parallelized summation forms. For example, if we were to perform k-means clustering of the words based on accepted similarities, the vertices could start as partitioned on different nodes but eventually become tagged with the cluster they belong to. The master would initialize the centroids and share it with local aggregators. These aggregators in turn would tag their vertices to the nearest centroid. The centroids would then be recomputed when the aggregators have shared their results with the master.
This summation form technique works very well with linear scalability to the size of graph.
However, the way this similarity was used since wordnet age was in finding lexical chains through the text.

A lexical chain was a set of words in the texr that contributed to some strain of meaning in the overall discourse. The best of these lexical chains was selected as the overall topic of the texr. Naturally a thesaurus performed better than wordnet in finding these chains however both were likely to pick false chains.
This approach was greatly improved by statistical analysis which considered the text as a bag of words model where words were high dimensional words.

We will continue this discussion but first let us look into a coding exercise
Given a stream of characters, find the first non-repeating character.
A solution is to maintain a doubly linked list which keeps track of the nonrepeating characters seen in order.
Void printFirstNonRepeatingChar(stream<char> s)
{
Const int Max_char = 256;
Var nr = new LinkedList() { null; }
Var present = new List<char>(max_char);
Var dup = new List<bool>(max_char){ false};
Var current = s.read();
While (current != EOF)
{
If(dup[current] == false){
If (present[char] == null){
nr.Add(current);
present[current] = nr.Last();
}else{
nr.Remove(current);
present[current] = null:
Dup[current] = true;
}
}
If (nr.empty() == false)
Console.write("First non-repeated char is {0}", nr.First());
Current = s.Read();
}

}

The above can also be modified to exit after first non-repeated character.

Also note that the state is toggled after the repetition.

If the number of unique characters is not restricted by ascii, the above problem does not scale. Consequently a hash table may be required.
# combination puzzles : https://1drv.ms/w/s!Ashlm-Nw-wnWlBdZZKfVTy5CTc4i

Wednesday, September 28, 2016

Today we continue discussing metrics and their collection from services.
We observed that one way to keep metrics true is to collect them directly from the service where the service makes a call to push the metric.
One example of this is that we can count the number of success from the code by calling
StatsdClient.Metrics.Counter("myapi.post.Success");
Statsd is a client for Graphite-Graphana stack and makes it easy to push the metrics to Graphite so that they show in the charts in Graphana.
Statsd exposes the notion of timers as well. A timer captures the duration of an operation and we can use it in our constructor and destructor to measure the lifetime of an operation and this can be used with the metrics.
Statsd batches up the metrics it pushes to Graphite. Therefore the timing window for the statsd flush interval must be at least as long as the graphite's minimum time otherwise Graphite will drop packets.
And we must also be careful to turn legacyNamespace off in Graphite otherwise the statsd measurements will show up in two or three different folders.
Statsd is not the only client. There are statsd, collectd, bucky and logster. Statsd is the original client and the most used of them all. Collectd is written in EPEL in older versions. Bucky connects them both because it listens to both the packets from statsd and collectd and sends them to Graphite. Logster can tail your log file and send metrics back to Graphite.
Some metrics can be confusing. For example, count and rate are two independent metrics. Count measures the total number of times an event happened. Rate measures how many times the event happened per unit of time.
Metrics can collect over time. There are retention rules mentioned which can dictate how long the metrics need to be held in the database files. Moreover, these files are fixed size.

Convert a BST into inorder singly linked list in place.
Node toList(node root)
{
If (root == null) return root;
If (root.left != null)
{
var left = toList(root.left);
for( ; left.right != null; left = left.right);
Left.right = root;
}
If (root.right != null)
{
var right = toList(root);
for(; right.left != null; right = right.left);
root.right = right;
}
Return root;
}

Convert a BST into preorder singly linked list in place.
Node toPreList(node root)
{
If (root == null) return root;
var temp = root.right;
var left = null;
If (root.left != null)
{
left = toPreList(root.left);
root.right = left;
for( ; left != null && left.right != null; left = left.right);
//Left.right = temp;
}
If (temp != null)
{
var right = toPreList(temp);
left.right = right;
}
Return root;
}

Convert a BST into postorder singly linked list in place.
Node toPostList(node root)
{
If (root == null) return root;
var temp = null;
var left = null;
var right = null;
If (root.left != null)
{
left = toPostList(root.left);
temp = left;
for( ; left != null && left.right != null; left = left.right);
}
If (root.right != null)
{
var right = toPostList(temp);
if (temp == null) temp = right;
if (left != null) left.right = right;
for( ; right != null && right.right != null; right = right.right);
}
if (temp == null) temp = right;
if (right != null) right.right = root;
if (temp == null) temp = root;
return temp;
}

In the above methods, we have used singly linked list. We can also make doubly linked list by adding links in the reverse direction
void ToDLL(Node head, Node prev)
{
if (head == null) return;
head.prev = prev;
var prev = head;
ToDLL(head.next, prev);
}
Test cases:
NULL,
1,
1->2,
1->2->3

Tuesday, September 27, 2016

True Metrics
In yesterday's post we referred to getting data for the metrics from logs and database states because the information is pulled without affecting the operations of the system. However there are a few drawbacks to this approach.
First, most logs make their way to a time series database, also called a log index, and the log files are hardly used directly to compute the data for the metrics
Second, this flow of logs may be interrupted or just plainly missed. In some cases, this goes undetected that the logs are not making their way to a log index.
Third, within the time series database, there is no ownership of logs and consequently events can be deleted.
Fourth, the same goes for duplication of events which tampers with the data for the metrics.
Fifth, the statistics computed from a time series database depends on the criteria used to perform the search also called the query. This query may not always be accurate and correct for the metric to be reported.
These vulnerabilities dilute the effectiveness of a metric and therefore lead to not only inaccurate reports but possibly misleading charts.
Let's compare this with the data available from the database states.
The states of the services are internal and tightly controlled by the service.
They are transactional in nature and therefore can be counted as true representation of the corresponding events
They are up to date with the most recent activity of the service and hence provide information upto the nearest point of time.
Queries against the database are not only going to be accurate but likely used by the service itself when the service exports metrics directly.
The data is as authoritative as possible because the service relied on it
However, the drawback is that not all information is available in the database particularly those for error conditions. In such cases, the log files are the best option.
The metrics data may be pulled frequently and this might interfere with the operations of the database or the service depending on who is providing the information.
Therefore a solution is to keep the log file and the time series database in sync by controlling the processes that flow the data between them. For example, a syslog drain to the time series database that is setup once when the service is deployed guarantees that the log events will flow to the index. Users to the index may be given read only permissions.
Lastly log index have a way of exporting data suitable for use with external charting tools. Some log indexes also provide rich out of box charting. For example, Splunk lets us find the number of calls made against each of the http status codes with a nifty search command as follows:
search source=sourcetype earliest = -7d@d | timechart count by status
Since the 200 value is going to be the predominant value, it can be used as an overlay over the chart so that the other values are clearer, colored and easier to read.
If the status codes are not appearing as a separate field which is a common case, we can use the cluster command with something like
index=_internal source=sourcetype | cluster showcount=t | table _time, cluster_count, _raw | timechart count by cluster_count
or we can specify rex "HTTP/1.1\" (?<status>\d+)"

#puzzle question
You have a 4x4 matrix with the first row occupied the numbers 17, 99, 59 and 71.
Find an arrangement of the matrix such that their sum in any direction is equal.
Answer: There can be many answers possible if the occupied cells were all over the matrix. Instead if the first row is fixed, then we can fill in the remaining only in a way such that there are no duplicates row wise and column wise. This way all vertical, horizontal and diagonal elements add to the same sum.

#codingexercise
Given a sorted array of number , value K and value X, find the K nearest number to the value

void printKNearest(List<int> A, int x, int k)
{
int n = A.Count;
// returns one element from the array.
int left = GetTransitionPointByBinarySearch(A, 0, n-1, x);
int right = left + 1;
int count = 0;
if (A[left] == X) left --;
while (left >= 0 && right < n && count < k)
{
if (x-A[left] <A[right]-X){
Console.Write(A[left]);
   left--;
}else{
Console.Write(A[right]);
   right--;
}
}
while (count < k && left >= 0){
    console.Write(A[left]);
    left--;
}
while (count < k && right <= n-1){
    console.Write(A[right]);
    right++;
}
}

One of the key observations is that these metrics could be pushed directly from the api of the services.

Monday, September 26, 2016

Today we look at an interesting problem space of metrics reporting.
Almost all application owners want to have an understanding of their resources usages. These resources usage metrics may apply to different clouds, per cloud or even as granular as per services.For example, we may have a microservice that provisions a single resource for the application. It may have APIs for create-update-delete that return different status codes. A simple metric may be to find out success rate in terms of the total calls made. The success in this case is determined by the number of http success status code sent across all the api calls. Since this kind of metric comes from individual services, the application owner may want to see these metrics across all services. This is best shown in a dashboard with drill downs to different services.

It's important to note that metrics data is inherently time based. Although metrics can be spot measurements, cumulative over time or delta between time slices, they have an association with the time that they were measured. Consequently, most metric data is stored in a time series database. Also, previous measurements can be summarized with statistics while the running measurements apply to only the current window of time. Therefore, some of these time-series databases can even be fix-sized.

There are many different tools available to present such charts from data sources. Almost all such reporting mechanisms want to pull data for their charts. However, they cannot directly call the services to query their metrics because this may affect the functionality and performance of those services. Therefore they do it in a passive manner by finding information from log trails and database states. Most services have logging configured and their logs are usually indexed in a time series database of their own. Consequently the pull information can come from these logs.

The Grafana stack is one such custom reporting solution. It presents beautiful dashboards with immense flexibility to add and configure different charts from several time series data sources. As long as the metrics data makes its way to the database configured with Grafana such as Graphite, InfluxDB, Prometheus, the reporting solution is a breeze.

Generally, this information flow into the reporting database needs to be automated. In the example cited above, it would be pulled from the log index server and pushed into say Graphite. This collection statistics daemon usually works very well in periodically pulling information from wherever they are available. If the daemon has plugins to custom reporting data sources such as Graphite or full service reporting solutions to say Nagios or Zabbix, then the data automatically makes its way to charts for the end user.

Since there are different technologies with overlapping functionalities and even more vendors for those technologies in any IT deployments, workforce often has to decide how the information flows. Sometimes this is straightforward in an out of box mechanism from a purchased or re-licensed software but other times, it requires automation to route the data from existing data sources that can neither be replaced because of their operational dependency nor can they be configured to directly send data to the charts. It's in this case where collection may need to be written.

Billing and Monitoring are two aspects of IT operations that often have dependencies on these very same metric data. Therefore, the software products for those operations may very well support reporting out of the box.

This goes to show that there are many ways to tackle the same problem for metrics to charts data flow but the thumbrules of 1) reusing existing infrastructure and 2) applying customization - be it automation or dashboards, to be as minimal as necessary, hold true in many contexts.

#codinginterview
In yesterday's post we discussed decoding digits as letters from a stream. It was batch processing. Here we optimize it:
void decodeStreamOptimized(Stream numbers, ref List<string> results)
{
var last = int_max;
var current = numbers.Read();
while ( current != EOF){
var last = digits.Last();
digits = new List<int>(){ last, current };
s = new StringBuilder();
decode( digits, digits.count(), ref s, ref results);
// now separate levels between consecutive evaluations
results.Add(null);
last = current;
current = numbers.Read();
}
}
The results in each evaluation round is a single character denoted by z if the current digit is taken independently or z' if the last and current digit is taken together
Consecutive results R1 and R2 are therefore stitched together as one of two binary selections:
R1zR2z or R2z'

Another way to optimize would be to take a large window and translate before moving to the next.

Sunday, September 25, 2016

We continue our discussion on statistical analysis in Natural Language Processing
Word2vec uses continuous bag of words model that predicts a word based on the surrounding words and the skip gram model that predicts the surrounding words based on the current word. The skip gram models the conditional probability of a context c given a word w with probability theta using the softmax function
The denominator in the softmax function is the sum of such expectations for i,j ranging over the entire vocabulary where vw and vw' are input and output vector representations of word w in vocabulary W. This is a heavily optimized expression but the idea is that each word tries to predict every word around it. Further research by Levy and Goldberg tried to predict the word not based on surrounding words but based on context where context is not just words but anything that contributes to context
The denominator is also rather large. Also a summation over hundreds of thousands of contexts using the high dimensional vectors for each input and output vector representation of word w in vocabulary W is too expensive to compute. While Word2Vec optimizes this with hierarchical selection, they state their assumption as the selection of context for a word such that it is better than the say expected value of all others, results in good word embedding. This follows from the belief that the improvement comes because similar words have similar vectors. Intuitively, this gives us the notion of thesaurus. People use different words to mean the same thing and their purports often manifest similarity in the tightness between the word and its context in the sampling window.
#codingquestion
Given a stream of digits, evaluate the possible decodings if leading, trailing and consecutive zeros are excluded
For example, 121 => ABA , AU, LA
if A= 1, B =2, .. Z = 26

void decode(List<int>digits, int n, ref stringbuilder s, ref List<string> results)
{
if (n == 0) {Console.Write(s); results.Add(s.reverse()); return;}
if (n == 1) { s += ToLetter(digits, n-1, 1); Console.Write(s); results.Add(s.reverse()); return; }
if (digits[n-1] > 0){
s += ToLetter(digits, n-1, 1);
decode(digits, n-1, ref s, ref results);
s.RemoveLast();
}
if (digits[n-2] < 2 || (digits[n-2] == 2 && digits[n-1] < 7)){
s += ToLetter(digits, n-2, 2);
decode(digits, n, ref s, ref results);
s.RemoveLast();
}
}

void decodeStream(Stream numbers, ref List<string> results)
{
var digits = new List<int>();
var current = numbers.Read();
while ( current != EOF){
digits.Add(current.ToInteger());
s = new StringBuilder();
decode( digits, digits.count(), ref s, ref results);
current = numbers.Read();
}
}