Friday, September 30, 2016

Yesterday we were discussing the role of thesaurus in natural language processing.
The essence of the use of thesaurus is to build, for every significant word of the text, a list of heads under which they appear, and to look for intersection between these lists. The candidates that are poor translations are replaced with the words or phrases that are selected from the head that contains the most words from the initial translation. In addition, Thesaurus has been used for word sense disambiguation.
Using the thesaurus, we can also build lexical chains which are strong semantic relations between the words that it is made up of. Morris and Hirst calculated lexical cohesion. Stairman automated this process. Ellman used lexical chains to reconstruct  the representation of text's meaning from its content.
As an aside, WordNet helped tremendously with word sense disambiguation. It has also been used in different languages. Due to the limitations of the printed version of the thesaurus, many have turned to WordNet.But if the Thesaurus were machine tractable and free, Jarmasz argued in his thesis, it would be better than WordNet. Further both of them can be combined.
It is not straightforward to use the lexical chain words as vector features because they would lose the cohesion in the chain. Similarly pair wise words distances cannot be used because we don’t have the relationships between the words to select them as candidates before looking them up in the thesaurus. However k-means clustering directly applies to words when they are looked up in the thesaurus. Consequently cluster centroids can be found and then the words can be selected based on and all the candidates will be tagged to the nearest cluster. 
The main difference between the thesaurus and the bag of words model is that the former selects words from the thesaurus for semantic relationships while the latter selects the words based on collocation in contexts.
That said, we can contruct vectors from semantically correlated words just as we would from the text.
Levy and Goldberg introduced Pointwise Mutual Information and this is equally applicable to semantically correlated words.
While they chose random contexts in negative sampling, we can substitute with lexical chains upto a finite number. Further if we had a technique to use semantic cluster centroids, we could use them directly instead of lexical chains.

#codingexercise
For given array a of size n we create a set of a[i] , a[a[i]] , a[a[a[i]]] ….. i varies from 0 to n-1 , find the max size of such set.
Let us take an array with two elements  for binary values 1 and 0 as first and second elements. Then we can create a set
a[0] , a[a[0]] , a[a[a[0]]] …..
a[1] , a[a[1]] , a[a[a[1]]] …..
Which are
1, 0,1, 0, ...
0, 1, 0, 1, ...
Void printNested( list<int> A, int i, int index)
{
If (i < A.count){
  If( index < A.count){
     Console.write(A[index]);
     PrintNested(A, i, A[index]);
   }else{
      PrintNested(A, i+1, i+1);
    }
}
}
}

It may be interesting to note that there can be infinite series

Thursday, September 29, 2016

Today we continue to discuss the relevance of thesaurus in Natural language processing. Mant algorithms hav3 been professed which derermin the latent similarity based on word co-occurrence matrix. Yet it is narural to find similarity based on some ontology. Word2Net filled this gap very well. Words were arranged in a .predefined tree and their hierarchy gave an organization of the topic they belonged under. Furthermore, they gave similarity measures as distance between the words along the edges.  The drawbacks were that not all words were found in the wordnet and the organization didnt really help with every context. where it helped the most was in its association with parts of speech.  A thesaurus improves the usefulness in determining similarity by providing all possible synonyms and thus exploring as much semantic connections as possible. In fact, many nlp algorithms have been mentioned by their authors to ne significantly improved when there is such an online thesaurus. They stated that a thesaurus does better with semantic closeness than wordnet theoretically.
Such a thesaurus was not available in machine readable format.
Today we have learned to store graphs with trillions of edges. We can partition them on thousands of nodes using technologies such as Apache Giraph. With several thousand worker nodes, we can find relationships between words in parallelized summation forms. For example, if we were to perform k-means clustering of the words based on accepted similarities, the vertices could start as partitioned on different nodes but eventually become tagged with the cluster they belong to. The master would initialize the centroids and share it with local aggregators. These aggregators in turn would tag their vertices to the nearest centroid. The centroids would then be recomputed when the aggregators have shared their results with the master.
This summation form technique works very well with linear scalability to the size of graph.
However, the way this similarity was used since wordnet age was in finding lexical chains through the text.

A lexical chain was a set of words in the texr that contributed to some strain of meaning in the overall discourse. The best of these lexical chains was selected as the overall topic of the texr. Naturally a thesaurus performed better than wordnet in finding these chains however both were likely to pick false chains.
This approach was greatly improved by statistical analysis which considered the text as a bag of words model where words were high dimensional words.

We will continue this discussion but first let us look into a coding exercise
Given a stream of characters, find the first non-repeating character.
A solution is to maintain a doubly linked list which keeps track of the nonrepeating characters seen in order.
Void printFirstNonRepeatingChar(stream<char> s)
{
Const int Max_char = 256;
Var nr = new LinkedList() { null; }
Var present = new List<char>(max_char);
Var dup = new List<bool>(max_char){ false};
Var current = s.read();
While (current != EOF)
{
If(dup[current] == false){
If (present[char] == null){
nr.Add(current);
present[current] = nr.Last();
}else{
nr.Remove(current);
present[current] = null:
Dup[current] = true;
}
}
If (nr.empty() == false)
    Console.write("First non-repeated char is {0}", nr.First());
Current = s.Read();
}

}


The above can also be modified to exit after first non-repeated character.

Also note that the state is toggled after the repetition.

If the number of unique characters is not restricted by ascii, the above problem does not scale. Consequently a hash table may be required.
# combination puzzles : https://1drv.ms/w/s!Ashlm-Nw-wnWlBdZZKfVTy5CTc4i 

Wednesday, September 28, 2016

Today we continue discussing metrics and their collection from services.
We observed that one way to keep metrics true is to collect them directly from the service where the service makes a call to push the metric.
One example of this is that we can count the number of success from the code by calling
StatsdClient.Metrics.Counter("myapi.post.Success");
Statsd is a client for Graphite-Graphana stack and makes it easy to push the metrics to Graphite so that they show in the charts in Graphana.
Statsd exposes the notion of timers as well.  A timer captures the duration of an operation and we can use it in our constructor and destructor to measure the lifetime of an operation and this can be used with the metrics.
Statsd batches up the metrics it pushes to Graphite. Therefore the timing window for the statsd flush interval must be at least as long as the graphite's minimum time otherwise Graphite will drop packets.
 And we must also be careful to turn legacyNamespace off in Graphite otherwise the statsd measurements will show up in two or three different folders.
Statsd is not the only client. There are statsd, collectd, bucky and logster. Statsd is the original client and the most used of them all. Collectd is written in EPEL in older versions. Bucky connects them both because it listens to both the packets from statsd and collectd and sends them to Graphite. Logster can tail your log file and send metrics back to Graphite.
Some metrics can be confusing. For example, count and rate are two independent metrics. Count measures the total number of times an event happened.  Rate measures how many times the event happened per unit of time.
Metrics can collect over time. There are retention rules mentioned which can dictate how long the metrics need to be held in the database files. Moreover, these files are fixed size.
Convert a BST into inorder singly linked list in place.
Node toList(node root)
{
If (root == null) return root;
If (root.left != null)
{
var left = toList(root.left);
for( ; left.right != null; left = left.right);
Left.right = root;
}
If (root.right != null)
{
var right = toList(root);
for(; right.left != null; right = right.left);
root.right = right;
}
Return root;
}


Convert a BST into preorder singly linked list in place.
Node toPreList(node root)
{
If (root == null) return root;
var temp = root.right;
var left = null;
If (root.left != null)
{
left = toPreList(root.left);
root.right = left;
for( ; left != null && left.right != null; left = left.right);
//Left.right = temp;
}
If (temp != null)
{
var right = toPreList(temp);
left.right = right;
}
Return root;
}

Convert a BST into postorder singly linked list in place.
Node toPostList(node root)
{
If (root == null) return root;
var temp = null;
var left = null;
var right = null;
If (root.left != null)
{
left = toPostList(root.left);
temp = left;
for( ; left != null && left.right != null; left = left.right);
}
If (root.right != null)
{
var right = toPostList(temp);
if (temp == null) temp = right;
if (left != null) left.right = right;
for( ; right != null && right.right != null; right = right.right);
}
if (temp == null) temp = right;
if (right != null) right.right = root;
if (temp == null) temp = root;
return temp;
}

In the above methods, we have used singly linked list. We can also make doubly linked list by adding links in the reverse direction
void ToDLL(Node head, Node prev)
{
if (head == null) return;
head.prev = prev;
var prev = head;
ToDLL(head.next, prev);
}
Test cases:
NULL,
1,
1->2,
1->2->3

Tuesday, September 27, 2016

True Metrics
In yesterday's post we referred to getting data for the metrics from logs and database states because the information is pulled without affecting the operations of the system. However there are  a few drawbacks to this approach.
First, most logs make their way to a time series database, also called a log index, and the log files are hardly used directly to compute the data for the metrics
Second, this flow of logs may be interrupted or just plainly missed. In some cases, this goes undetected that the logs are not making their way to a log index.
Third, within the time series database, there is no ownership of logs and consequently events can be deleted.
Fourth, the same goes for duplication of events which tampers with the data for the metrics.
Fifth, the statistics computed from a time series database depends on the criteria used to perform the search also called the query. This query may not always be accurate and correct for the metric to be reported.
These vulnerabilities dilute the effectiveness of a metric and therefore lead to not only inaccurate reports but possibly misleading charts.
Let's compare this with the data available from the database states.
The states of the services are internal and tightly controlled by the service.
They are transactional in nature and therefore can be counted as true representation of the corresponding events
They are up to date with the most recent activity of the service and hence provide information upto the nearest point of time.
Queries against the database are not only going to be accurate but likely used by the service itself when the service exports metrics directly.
The data is as authoritative as possible because the service relied on it
However, the drawback is that not all information is available in the database particularly those for error conditions. In such cases, the log files are the best option.
The metrics data may be pulled frequently and this might interfere with the operations of the database or the service depending on who is providing the information.
Therefore a solution is to keep the log file and the time series database in sync by controlling the processes that flow the data between them. For example, a syslog drain to the time series database that is setup once when the service is deployed guarantees that the log events will flow to the index. Users to the index may be given read only permissions.
Lastly log index have a way of exporting data suitable for use with external charting tools. Some log indexes also provide rich out of box charting. For example, Splunk lets us find the number of calls made against each of the http status codes with a nifty search command as follows:
search source=sourcetype earliest = -7d@d | timechart count by status
Since the 200 value is going to be the predominant value, it can be used as an overlay over the chart so that the other values are clearer, colored and easier to read.
If the status codes are not appearing as a separate field which is a common case, we can use the cluster command with something like
index=_internal source=sourcetype | cluster showcount=t  | table _time, cluster_count, _raw | timechart count by cluster_count
or we can specify rex "HTTP/1.1\" (?<status>\d+)"

#puzzle question
You have a 4x4 matrix with the first row occupied  the numbers 17, 99, 59 and 71.
Find an arrangement of the matrix such that their sum in any direction is equal.
Answer: There can be many answers possible if the occupied cells were all over the matrix. Instead if the first row is fixed, then we can fill in the remaining only in a way such that there are no duplicates row wise and column wise. This way all vertical, horizontal and diagonal elements add to the same sum.

#codingexercise
Given a sorted array of number , value K and value X, find the K nearest number to the value

void printKNearest(List<int> A, int x, int k)
{
int n = A.Count;
 // returns one element from the array.
int left = GetTransitionPointByBinarySearch(A, 0, n-1, x);
int right = left + 1;
int count = 0;
if (A[left] == X) left --;
while (left >= 0 && right < n && count < k)
{
if (x-A[left] <A[right]-X){
  Console.Write(A[left]);
   left--;
}else{
  Console.Write(A[right]);
   right--;

}
}
while (count < k && left >= 0){
    console.Write(A[left]);
    left--;
}
while (count < k && right <= n-1){
    console.Write(A[right]);
    right++;
}

}


One of the key observations is that these metrics could be pushed directly from the api of the services.

Monday, September 26, 2016

Today we look at an interesting problem space of metrics reporting.
Almost all application owners want to have an understanding of their resources usages. These resources usage metrics may apply to different clouds, per cloud or even as granular as per services.For example, we may have a microservice that provisions a single resource for the application. It may have APIs for create-update-delete that return different status codes. A simple metric may be to find out success rate in terms of the total calls made. The success in this case is determined by the number of http success status code sent across all the api calls. Since this kind of metric comes from individual services, the application owner may want to see these metrics across all services. This is best shown in  a dashboard with drill downs to different services.

It's important to note that metrics data is inherently time based. Although metrics can be spot measurements, cumulative over time or delta between time slices, they have an association with the time that they were measured.  Consequently, most metric data is stored in a time series database. Also, previous measurements can be summarized with statistics while the running measurements apply to only the current window of time. Therefore, some of these time-series databases can even be fix-sized.

There are many different tools available to present such charts from data sources. Almost all such reporting mechanisms want to pull data for their charts. However, they cannot directly call the services to query their metrics because this may affect the functionality and performance of those services. Therefore they do it in a passive manner by finding information from log trails and database states. Most services have logging configured and their logs are usually indexed in a time series database of their own. Consequently the pull information can come from these logs.

The Grafana stack is one such custom reporting solution. It presents beautiful dashboards with immense flexibility to add and configure different charts from several time series data sources. As long as the metrics data makes its way to the database configured with Grafana such as Graphite, InfluxDB, Prometheus, the reporting solution is a breeze.

Generally, this information flow into the reporting database needs to be automated. In the example cited above, it would be pulled from the log index server and pushed into say Graphite.  This collection statistics daemon usually works very well in periodically pulling information from wherever they are available. If the daemon has plugins to custom reporting data sources such as Graphite or full service reporting solutions to say Nagios or Zabbix, then the data automatically makes its way to charts for the end user.

Since there are different technologies with overlapping functionalities and even more vendors for those technologies in any IT deployments, workforce often has to decide how the information flows. Sometimes this is straightforward in an out of box mechanism from a purchased or re-licensed software but other times, it requires automation to route the data from existing data sources that can neither be replaced because of their operational dependency nor can they be configured to directly send data to the charts. It's in this case where collection may need to be written.

Billing and Monitoring are two aspects of IT operations that often have dependencies on these very same metric data. Therefore, the software products for those operations may very well support reporting out of the box.

This goes to show that there are many ways to tackle the same problem for metrics to charts data flow but the thumbrules of 1) reusing existing infrastructure and 2) applying customization - be it automation or dashboards, to be as minimal as necessary, hold true in many contexts.

#codinginterview
In yesterday's post we discussed decoding digits as letters from a stream. It was batch processing. Here we optimize it:
void decodeStreamOptimized(Stream numbers,  ref List<string> results)
{
var last = int_max;
var current =  numbers.Read();
while ( current != EOF){
var last = digits.Last();
digits = new List<int>(){ last, current };
s = new StringBuilder();
decode( digits, digits.count(), ref s, ref results);
// now separate levels between consecutive evaluations
results.Add(null);
last = current;
current = numbers.Read();
}
}
The results in each evaluation round is a single character denoted by z if the current digit is taken independently or z' if the last and current digit is taken together
Consecutive results R1 and R2 are therefore stitched together as one of two binary selections:
R1zR2z or R2z'

Another way to optimize would be to take a large window and translate before moving to the next.

Sunday, September 25, 2016

We continue our discussion on statistical analysis in Natural Language Processing
Word2vec uses continuous bag of words model that predicts a word based on the surrounding words and the skip gram model that predicts the surrounding words based on the current word. The skip gram models the conditional probability of a context c given a word w with probability theta using the softmax function 
The denominator in the softmax function is the sum of such expectations for i,j ranging over the entire vocabulary where vw and vw' are input and output vector representations of word w in vocabulary W. This is a heavily optimized expression but the idea is that each word tries to predict every word around it. Further research by Levy and Goldberg tried to predict the word not based on surrounding words but based on context where context is not just words but anything that contributes to context
The denominator is also rather large.  Also a summation over hundreds of thousands of contexts using the high dimensional vectors for each input and output vector representation of word w in vocabulary W is too expensive to compute. While Word2Vec optimizes this with hierarchical selection, they state their assumption as the selection of context for a word such that it is better than the say expected value of all others, results in good word embedding. This follows from the belief that the improvement comes because similar words have similar vectors. Intuitively, this gives us the notion of thesaurus. People use different words to mean the same thing and their purports often manifest similarity in the tightness between the word and its context in the sampling window.
#codingquestion
Given a stream of digits, evaluate the possible decodings if leading, trailing and consecutive zeros are excluded
For example, 121 => ABA , AU, LA
if A= 1, B =2, .. Z = 26

void decode(List<int>digits, int n, ref stringbuilder s, ref List<string> results)
{
if (n == 0) {Console.Write(s); results.Add(s.reverse()); return;}
if (n == 1) { s += ToLetter(digits, n-1, 1); Console.Write(s); results.Add(s.reverse()); return; }
if (digits[n-1] > 0){
      s += ToLetter(digits, n-1, 1);
      decode(digits, n-1, ref s, ref results);
      s.RemoveLast();
}
if (digits[n-2] < 2 || (digits[n-2] == 2 && digits[n-1] < 7)){
       s += ToLetter(digits, n-2, 2);
       decode(digits, n, ref s, ref results);
       s.RemoveLast();
}
}

void decodeStream(Stream numbers,  ref List<string> results)
{
var digits = new List<int>();
var current =  numbers.Read();
while ( current != EOF){
digits.Add(current.ToInteger());
s = new StringBuilder();
decode( digits, digits.count(), ref s, ref results);
current = numbers.Read();
}
}


Convert a BST into inorder singly linked list in place.
Node toList(node root)
{
If (root == null) return root;
If (root.left != null)
{
var left = toList(root.left);
for( ; left.right != null; left = left.right);
Left.right = root;
}
If (root.right != null)
{
var right = toList(root);
for(; right.left != null; right = right.left);
root.right = right;
}
Return root;
}

Saturday, September 24, 2016

We continue our discussion on statistical analysis in Natural Language Processing
Levy's and Goldberg's paper introduced a departure from Word2Vec  in how words and contexts are chosen. They introduced Negative Sampling which is the reverse of sampling contexts for words. In the latter case, we were picking say a handful of generalizations that we argued were contexts and mapped them to word in the word-context matrix. Now for the same number of contexts, we choose others at random that we know are not contexts. The assumption is that the randomly chosen context will be the unobserved ones. This means that the negative and the positive samples are selected so that they are as disjoint as possible. Furthermore the quality of the positive selections improve as their probabilities are higher

The purpose of negative sampling is to minimize the probability that words,contexts pairs are chosen such that they all appear from the text and they consequently have the same values for their vectors which is not very helpful.  By introducing random word,context we now have a mix. This has the side-effect that the word, contexts chosen as positive sampling will indeed be more representative of the text. 
But where are these random word, context pairs chosen from ? Do they come from the corpus or within the text. Arguably, if it is outside the sampling window, it is sufficient.

Find the number of three edge cycles in an undirected graph.
We use depth first search to traverse the graph and generate paths.
DFS ( V, E, paths)
For each vertex v in V
       V.color=white
       V.d = nil
  Time = 0
 For each vertex v in V:
       If v.color == white:
           path = new Path();  
           DFS-Visit (V, E, v, path, paths)
     
 DFS-VISIT (V,E, u, path, ref paths)
  time = time + 1
   u.d = time
   u.color = gray
 
   path.Add(u);
   foreach  vertex v adjacent to u
        If v.color == white
           DFS-VISIT(V,E,v)
         Else
               If v.d  <= u.d < u.f <= v.f  :
                        paths.Add(path); 
u.color = black
time = time + 1
 u.f = time


filter(paths):
var d = new Dictionary<string, int>();
foreach path in paths:
     foreach vertex  v in path:
           if v.next.next.next == v:
               pathidentifier = stringify(v,3)
                if d.contains(pathidentifier) == false:
                    d.Add(pathidentifier, 1);


This topological sort does not work with breadth first traversal because each vertex is able to participate as the starting point only in a depth first traversal. That said we can use either breadth first traversal or depth first traversal to traverse the graph and enumerate paths. Finding cycles from enumerated paths is easy because we look for the start and return to same vertex across all the paths.

Friday, September 23, 2016

We continue our discussion on Statistical framework in natural language processing.
The mutual information of two random variables is a measure of the mutual dependence between two variables. When the two random variables are taken together, we measure their joint distribution. Otherwise we take them independently which is termed marginal distribution. MI is therefore log(p(X,Y)/p(X)p(Y)). If they are independent the numerator and denominator are the same and MI equals zero.
Pointwise Mutual information or PMI refers to single events, whereas MI refers to the average of all possible events. MI is the expected value of PMI.
In natural language processing, this comes useful because we know that the words and contexts are dependent on each other. Therefore we can take the words as one of the marginal distributions and their contexts as another. This then can give us a matrix of PMIs for each cell in the word context matrix.
This is a pretty large matrix and a sparse matrix because words and context don't really appear together in a large number of ways. Matrix factorization can be done with techniques such as Singular Value Decomposition which we will go into later.
Word2Vec factorizes these matrices. It does this in a weighted pair giving more influence to frequent pairs. These weights have an interesting contribution. Without them, the original PMI matrix can still be reconstructed but the weighted and non-weighted objectives result in different generalizations. 
By generalizations, it means whatever we choose for the context such as prepositions
When the dimensions are reduced, this plays out an even bigger role.
If we change the algorithm such as word2vec to goldberg, we can even expect different results. For example,  the word Washington may bring up similar words as Seattle, Evergreen etc and on the other hand the same word might bring up other states as Oregon, Utah, California etc.

Let us now look at how negative sampling plays into this. Negative sampling means it is the reverse of sampling contexts for words. In the latter case, we were picking say a handful of generalizations that we argued were contexts and mapped them to word in the word-context matrix. Now for the same number of contexts, we choose others at random that we know are not contexts and this is negative sampling.

#codingexercise
Check if 2 binary trees are equal recursively and iteratively
bool isEqual(Node root1, Node root2)
{
if (root1 == NULL && root2 == NULL) return true;
if (root1 == NULL) return false;
if (root2 == NULL) return false;
return root1.data == root2.data && isEqual(root1.left, root2.left) && isEqual(root1.right, root2.right);
}
bool IsEqualIterative(Node root1, Node root2)
{
var current1 = root1;
var current2 = root2;
var stk = new Stack<Node>(); // we only need one stack

while (current1 != null ||  current2 != null || stk. empty() == false)
{

if (current1 == null && current2 != null ) return false;
if (current1 != null && current2 == null ) return false;

if (current1 != null)
{
if (current1.data != current2.data) return false;
stk.push(current1);
current1 = current1.left;
current2 = current2.left;
continue;
}

if (stk. empty() == false)
{
current1 = stk. pop();
current1 = current1.right;
current2 = current2.right;
}

}
return true;
}
Although the order of checking does not depend on the type of traversal, it is better and simpler to use preorder traversal.

Thursday, September 22, 2016

Statistical framework is especially relevant in natural language processing.

The purpose of using random variables and a Bayes framework is that it has advantages over the  frequentist approach. The focus on the frequentist approach is that the estimated model will be closer to the truth  when there is sufficient amount of data available. In the Bayes framework, this is in terms of probabilities from the training sample. For example, if we want to define a probability distribution over a language vocabulary, then one can define a random variable X(w) where w ranges over the words in the vocabulary. The probability of a word in the vocabulary then becomes  p(X=w). Random variables can be both  discrete and continuous. In the case of discrete variables we attach a weight to each element in the sample space such as in a multinomial distribution and the discrete variable is thus assumed to have an underlying mass function which is the sum of the probabilities ranging over the possible values that it can take. The same is done for a continuous random  variable except that we integrate it over a sample space and call it a probability density function. Since multiple random variables are defined on the same sample space, we can describe a joint distribution.
Word2vec uses CBOW architecture that predicts a word based on the surrounding words and the skip gram that predicts the surrounding words based on the current word. This is done in a specific way called the softmax function and it is summarized in a formulation as:
p(wo/wi) = exp(vo dash transpose. vi) / sum of such expectations for i,j ranging over the entire vocabulary where vw and vw' are input and output vector representations of word w in vocabulary W. This is a heavily optimized expression but the idea is that each word tries to predict every word around it. Further research by Levy and Goldberg tried to predict the word not based on surrounding words but based on context where context is not just words but anything that contributes to context such as preposition. This had a surprisingly different word embedding. The functional similarity seen here is very different from the similarities seen with the word2vec.  Later Stanford researchers started to find out what should the similarity function should look like ? And they came up with an objective function that showed the results similar to word2vec. They used word context co-occurrence matrix by using some generalizations of context and came up with an objective function that looked a new weighted least square regression model. When they factorize the co-occurrence matrix, they get results similar to word2vec. While word2vec tries to fit a model, predict, find the errors and refine it based on backpropagation, the Stanford objective function tries to count the words by context matrix, take a log and factorizes it. They explained word2vec and also found a supposedly faster way though this has been challenged quite a bit by Levy and Goldberg.  Levy and Goldberg subsequently did more work into analyzing skip gram with negative sampling. Negative sampling tries to predict context that is not there. It take a window of sampling and takes negative ten contexts. Levy and Goldberg tried to count it as the number of times the word appears alone and the number of times the context appears alone versus taken together. They fit this information in a well known formula called the pointwise mutual information (PMI). They declined to factorize the PMI matrix and used the rows of words to contexts as vectors directly and their experiments compared the similarity functions.
I have still not understood why negative sampling and PMI do not do the same diligence as precision and recall. With precision we include the false positives and with recall we include the false negatives.
We discussed research that was using matrix and high dimensional vectors using statistical framework and neural nets but matrices and graphs go well together. We will review a few graph techniques for finding correlations that word2vec does. Word2vec is hard to apply because you have to tune different parameters. With graphs and persisted connections we hope to alleviate that.
#codingexercise
Given an array of n elements, where each element is at most k away from its target position, sort the array in the reverse order
Void sortKReverse(ref List<int> A, int k)
{
Assert(A.count > k);
Var h = new Heap<int>();
For (int i =0: i < k && i < n; i++)
{
h.Add(A[i]);
}
Int i = k+1;
For(int j =0; j < n; j++)
{
   If (i < n){
       A[j] = h.replacemax(A[i]);
       I ++;
}else{
      A[J]  = h.max()
}
}
}

#urlshortener
Void printUrlHash(string url){
Console.Write("{2:X8}", url.GetHashCode());
}
Statistical framework is especially relevant in natural language processing.

The purpose of using random variables and a Bayes framework is that it has advantages over the  frequentist approach. The focus on the frequentist approach is that the estimated model will be closer to the truth  when there is sufficient amount of data available. In the Bayes framework, this is in terms of probabilities from the training sample. For example, if we want to define a probability distribution over a language vocabulary, then one can define a random variable X(w) where w ranges over the words in the vocabulary. The probability of a word in the vocabulary then becomes  p(X=w). Random variables can be both  discrete and continuous. In the case of discrete variables we attach a weight to each element in the sample space such as in a multinomial distribution and the discrete variable is thus assumed to have an underlying mass function which is the sum of the probabilities ranging over the possible values that it can take. The same is done for a continuous random  variable except that we integrate it over a sample space and call it a probability density function. Since multiple random variables are defined on the same sample space, we can describe a joint distribution
Word2vec uses CBOW architecture predicts a word based on the surrounding words and the skip gram predicts the surrounding words based on the current word. This is done in a specific way called the softmax function and it is summarized in a formulation as:
p(wo/wi) = exp(vo dash transpose. vi) / sum of such expectations for i,j ranging over the entire vocabulary where vw and vw' are input and output vector representations of word w in vocabulary W. This is a heavily optimized expression but the idea is that each word tries to predict every word around it. Further research by others tried to predict the word based on surrounding words but based on context where context is not just words but anything that contributes to context such as preposition.

#codingexercise
Given an array of n elements, where each element is at most k away from its target position, sort the array in the reverse order
Void sortKReverse(ref List<int> A, int k)
{
Assert(A.count > k);
Var h = new Heap<int>();
For (int i =0: i < k && i < n; i++)
{
h.Add(A[i]);
}
Int i = k+1;
For(int j =0; j < n; j++)
{
   If (i < n){
       A[j] = h.replacemax(A[i]);
       I ++;
}else{
      A[J]  = h.max()
}
}
}