Cluster computing

Tuesday, November 1, 2016

Today we continue to review the mechanics of Active Directory replication:
We are referring to intrasite replication for the most often use cases of Active directory domain name-context item changes. Such AD replication takes place between DCs in pull operations, not pushes. AD controls and manages the transmission of AD Data with replication metadata. The first item of replication metadata is the USN which is the update sequence number that denotes a version at the local DC.The other DCs keep track of these USN for each other. To tell apart a replicating change from the originating change, another item of the metadata called the high watermark vector is used. In order to prevent the entire database from being sent over the wire on every change, the high watermark is used to notify only the incremental changes. However this is not all. If a DC has a change that it needs to replicate and if the only things to go on were the USN and the high water mark, then that DC would contact another DC to replicate the same change back that it already replicated which would create an endless cycle of replication and chew up progressively more and more bandwidth. To combat this, we need the up-to-dateness vector also called the UTDV. This is used for propagation dampening and it is used for every DC in the domain. The UTDV table keeps track of not only the highest USN that each DC has received from its partners but also the highest USN it has received from other DCs replicating the a given NC. The replicated change therefore includes the following:
the GUID of the DC that is replicating the change
the USN from the DC that is replicating the change
the GUID of the DC that originated the change
the USN from the DC that originated the change.
Sometimes you wonder why Active Directory can't be a database.
#codingexercise
Given a string, determine the minimum number of cuts needed to partition a string.
int minSplit(String val)
{
int n = val.Length;
if (n == 0) return 0;
var C = new int[n,n];
var P = new int[n,n];
for (int i = 0; i < n; i++)
{
P[i,i] = true;
C[i,i] = 0;
}
for (int L = 2; L <= n; L++)
{
int j = i + L - 1;
if (L == 2)
P[i,j] = (val[i] == val[j]);
else
P[i,j] = (val[i] == val[j]) && P[i+1, j-1];

if (P[i,j] == true)
C[i,j] = 0;
else
{
C[i,j] = INT_MAX;
for (int k = i; k < j-1; k++)
C[i,j] = min(C[i,j], C[i,k] + C[k+1, j] + 1);
}
}
return C[0,n-1];
}

Monday, October 31, 2016

Today we take a break to review the mechanics of Active Directory replication:
We are referring to intrasite replication for the most often use cases of Active directory domain name-context item changes. Such AD replication takes place between DCs in pull operations, not pushes.
Each domain controller (DC) stores three naming contexts (NC) in its local copy of AD at the minimum. These are schema NC, Configuration NC, and AD Domain NC which corresponds to AD Data.  The last one is usually the one that frequently changes. There are two type of writes - an originating write and a replication write. Since directory changes can be made from any active directory DC, any one DC can be the originating DC. AD controls and manages the transmission of AD Data with replication metadata.
The first item of replication metadata is the USN which is the update sequence number and pertinent only locally at a DC to indicate a sequence number associated with the changes at that DC.
The other DCs keep track of these USN for each other. The other DCs are referred to by their GUIDs so that the information does not change when the DC is renamed or offline.
Since an originating change can be come a replicated change for the other DCs and perpetually propagated, the cycle is broken with the help of an indicator called the high watermark vector. This tells apart an originating change from a replicated change and also prevents repeated changes for the same entry.
#codingexercise
Given an array integer k, find the maximum of each and every contiguous subarray of size k
Yesterday we saw a simple but O(n^2) complexity for solving this exercise.
We could also make it linear by keeping track of a list of candidates in sorted descending manner. As the window moves the numbers smaller than the current number are removed from the tail and the max element from the head is printed. The list keeps track of index only.
void PrintMax(List<int>nums, int k)
{
var sorted = new List<int>();
if ( k < nums.count() || k == 0 || nums.empty()) return;
int I;
for (int I =0; I < k; i++)
{
while (!sorted.empty() && nums[I] >= nums[sorted.Last()]))
      sorted.RemoveLast();
Sorted.InsertLast(I);
}
for (int I = k; I < n; i++)
{
Console.WriteLine(nums[sorted.first()]);
while (!sorted.empty() && sorted.frst() <= I - k)
      sorted.RemoveFirst();
while (!sorted.empty() && nums[I] >= nums[sorted.Last()])
      sorted.RemoveLast();
sorted.InsertLast(I);
}
Console.WriteLine(nums[sorted.first()]);
}

Sunday, October 30, 2016

Today we continue our discussion on natural language processing with a particular word ontology called FrameNet. We see how it is different from other ontologies , such as WordNet, PropBank , AMR etc. While AMR introduced graphical model, it did not have the rich semantic grouping that FrameNet had. Now let us look at a specific example of a model that make use of terms in a graphical data structure to see what FrameNet may bring to it. To do that we first look at one such model that does PoS tagging. A similarity graph is constructed and the labels are propagated. A classifier is trained on prelabeled data. Then it is used to predict the decode the labels from the target domain. We saw how a standard label propagation technique works.
Let us now look at one instance of this classifier as a tagger.
We start with a set of labeled source domain examples that gives pairs of parts of speech labels to inputs and a disparate test data that has yet to be labeled. The input is a sequence of words in the sentence and the output is a sequence of labels. Previous label affects the current label and the current label is independent of the previous ancestors except for the parent. Also the current label is iid. Therefore this is a good candidate for markov chain or a linear CRF. Therefore the goal here is to learn a CRF of the form exponentials of weighted feature functions where the functions range from 1 to K and the inputs range over some window of input sequence. Given only the labeled data, the optimal feature weights are given by the sum of conditional probabilities with respect to a specific set of feature weights and some adjustment based on a trade-off parameter and a regularizer. This we can decide based on training data. The regularizer helps with guiding the learning process. The CRF is based on conditional distributions and not on the actual probabilities of the inputs. So we cannot rely on the untrained data. That is why we use a regularizer and the graph comes in useful as a smoothness regularizer. Subramanya et al use this technique to train the CRF. They take a set of CRF parameters, first compute the marginals over the untrained data which is simply the conditional distribution in this iteration over all the data. The marginals over tokens are then aggregated to the marginals over types, which are used to initialize the graph label distributions. The labels are then propagated. The posteriors from the graph are then used to smooth the state posteriors.
A posterior is a conditional distribution. While the prior represents the uncertainty over our choice parameter before we have sampled the data, the posterior represents the uncertainty over our choice parameter after we have sampled the data and attempted to estimate it. Prior is regular probability while the posterior is conditional probability distribution because it conditions over the observed data.
The unlabeled data then produces a new set of automatic annotations that can be combined with the labeled data to retrain the CRF.
#codingexercise
Given an array integer k, find the maximum of each and every contiguous subarray of size k
List<int> GetMaxForWindows(List<int> nums, int windowSize)
{
var ret = new List<int>();
for (int i =0; i <= nums.Count - windowSize; i++)
{
max = nums[i];
for (int j = 1; j < windowSize; j++)
{
if (nums[i+j] > max)
max = nums[i+j];
}
ret.Add(max);
}
return ret;
}

Saturday, October 29, 2016

Today we continue our discussion on natural language processing with a particular word ontology called FrameNet. We see how it is different from other ontologies , such as WordNet, PropBank , AMR etc. While AMR introduced graphical model, it did not have the rich semantic grouping that FrameNet had. Now let us look at a specific example of a model that make use of terms in a graphical data structure to see what FrameNet may bring to it. To do that, we consider building a classifier could be build that creates lexical units for FrameNet frames and maps terms or senses to their corresponding FrameNet Frames.
Let us first look at one such model that does POS tagging with the help of graphs. It relies on the assumption that similar words will have similar Part of Speech tags. A similarity graph is constructed and the labels are propagated. Data both labeled and unlabeled is represented by vertices in a graph. Graph edges link vertices that are likely to have the same label. Edge weights determine how strongly the labels agree. Usually the number of labels are small. And even smaller are the labels that are composites. The number of atomic labels may be small but there may be much large number of ways to combine them. One way to combine them would be to structure them as a tree.
A classifier is trained on prelabeled data. Then it is used to predict the decode the labels from the target domain. The classifier which here can also be called a tagger can be based on any suitable model. For example, it can be based on word matrices. However computations for large matrices do not scale well especially if there are inversions involved. On the other hand conditional random fields are better suited for these purposes. They scale well because they use only conditional probability distributions. Standard inference and learning techniques as well as standard graph propagation technhiques are also scalable. The building blocks for the CRF inference could contribute to the efficiency.
The label propagation works like this:
Initialize the labels at all nodes in the network. For a given node, the label at start is the initialization label.
Set the iteration to 1
Take the nodes in a random order and assign it to a list
For each node in that list, use a function to assign the labels from all those adjacent to that node from the assignments in the previous iteration. For example, this could be the most frequently occurring label among the adjacencies.
If every node satisfies the chosen function, stop the algorithm or repeat with the next iteration.

#trivia question
What is a keep-alive header ?
The keep-alive header provides connection use policies. It's a hop by hop header that provides information about a persistent connection. It includes a timeout that determines how long a connection can be idle before it is closed.

Friday, October 28, 2016

Today we continue our discussion on natural language processing with a particular word ontology called FrameNet. We see how it is different from other ontologies , such as WordNet, PropBank etc. One such was AMR which had extensive documentation. It had PropBank predicate-argument semantics. It had name and value entities with entity linking. Entities and events formed coreference.
It included modality, negations and questions. It had relations between nominals. The content words were canonicalized. It would convert adverbs to adjectives to noun to verbs wherever possible. Moreover it represented all of this in a single graph. AMR Assets is a snazzy annotation tool that comprised of 15,000 AMRs which is roughly 270K words released with another 5000 AMRs or 150K words annotated and in the pipeline. AMR differed from FrameNet in that it's purpose was cost effective annotation together with NLP as opposed to FrameNet's lexicography. FrameNet frames was about rich semantic groupings while AMR retained PropBank's lexicalized rolesets. FrameNet annotated by using labeled spans in sentences. AMR was represented by graph where sentences and words did not directly translate to nodes. A sentence does not have any explicit relationships across frame annotations in FrameNet where as AMR comprised of predicates with shared arguments.
We note here the functionality of the FrameNet and the advantages of the form of AMR. In general, there are many ways to come up with an ontology or pseudo ontology for different purposes, however a graph based data structure where the nodes are not necessarily terms only comes helpful in many cases. Further a lookup of each term to such graph nodes also helps with directly representing models over the graph nodes instead of terms with more static pre-discovered relationships available for incorporation in the models.

#puzzle
Prove that for every prime p greater than 3, the value p ^ 2 - 1 is a multiple of 24.
p ^2 -1 can be written as (p-1)(p+1)
Since p is prime both the components above are even. So they must each be divisible by 2.
Further they are consecutive even numbers, therefore they should be divisible by 4 as well. So we have factors 2 x 2 x 2
Next p-1, p and p+1 are three consecutive numbers, hence they must be divisible by 3. We know p is not divisible, therefore this must be p-1 or p+1. This gives us 3 as a factor.
Consequently the factors are 2 x 2 x 2 x 3 = 24 for all p > 3.

#codingexercise
Find power x, n with logarithmic time complexity and limited storage
int power( int x, uint n)
{
int result = 1;
while (n > 0)
{
if (n & 1)
result = result * x;
n = n >> 1; // reduce n by half
x = x * x;
}
return result;
}

This can be made recursive as well but it wont be storage efficient.

Wednesday, October 26, 2016

Today we continue our discussion on natural language processing with a particular word ontology called FrameNet. We see how it is different from other ontologies. For example, FrameNet differs from WordNet in the lexical information and is in fact, complimentary in many ways. FrameNet has far fewer nouns, verbs, adjectives and adverbs as compared to WordNet. FrameNet Frame hierarchies provide better generalizations than WordNet synset hierarchies.
FrameNet also differs from PropBank (Proposition Bank discussed by Palmer et al 2005). PropBank began with labeling verbs and their arguments.
FrameNet assigns names such as communicator, evaluee and reason for corresponding PropBank role labels.
There have been other works to represent semantic information such as Abstract Meaning Representation ( Banarescu et al, Law 2013). AMR is a graph based representation of lexical concepts and typed relations between those concepts that are denoted by an English sentence.
However the lexical semantics in FrameNet are much deeper than AMR
What AMR means for Framenet and for our use is that we can represent the Frames in a graph where the relationships are persisted. These edges can then be traversed. Previously we used graphs with edges established based on similarity measures. Here we establish edges based on the relationships between frames.

#codingexercise
Find the number of islands of 1 in a binary matrix comprising of 0 and 1. In other words, count the number of connected components in an undirected graph.
int GetCountOfIslands(int[,] m, int row, int col)
{
bool visited = new bool[row,col]; // initialized to False
int count = 0;
for (int I = 0; I < row; I ++)
   for (int j = 0; j < col; j++)
       if (m[i.j] == 1 && visited[i,j] == False)
      {
             DFS(m, I, j, visited);
             count++;
      }
return count;
}
int DFS(int[,] m, int row, int col, bool[,] visited)
{
var radj = new int[] {-1, -1, -1, 0, 0, 1, 1, 1};
var cadj = new int[] {-1, 0, 1, -1, 1, -1, 0, 1};
visited [row,col] = true;
for (int k = 0; k < 8; k++)
if (withinMatrixAndNotVisited(m, row + radj[k], col + cadj[k], visited))
DFS (m, row + radj[k], col + cadj[k], visited);
}
We can also measure the size of each island with a minor modification to above.