Thursday, June 12, 2014

Tracers in distributed computing.
When there are many peers participating in a task, sometimes the only way to determine the actions done by each is to send a tracer through the system. This is also the same functionality that is so niftily used in networking to find the participating hosts in a route. I'm referring to the traceroute tool. This tool lists the hosts in the order that the IP packet travels and each host adds an entry with its ipaddress. When we are interested only in source destination reachability, we use the ping tool. The ping tool sends multiple echo requests to the destination server and calculates the packet loss. If the other side is responding properly, we can collect the responses. The traceroute tool on the other hand is a way to find what path is taken to reach the destination and is for a single packet only. There are no multiple packets sent. This comes in useful for say DNS path resolution.
In distributed computing however, we use tracers as markers to see how the diffusion propagates through the system. In gossip for example, we can use a dummy message to see the participants involved in sending the message.
The gossip propagates by selecting peers at random. Each peer when selected and received a task, can update it The goal in gossip is to spread the information as quickly as possible. This can happen even over unreliable network.
Gossip protocols can also be one that aggregates. To compute the aggregates, the values from the nodes are combined to arrive at a system wide value. The key idea in aggregation is that the aggregate must be computed as fixed size pairwise  exchanges. The exchanges terminated after a few rounds and usually by the end of all to all information flow. The gossip algorithms can arrange the node in a tree and computes aggregates such as sum or count by gossiping in a pattern that matches the tree.
Typically what we stamp in the information dissemination such as  a gossip protocol is node id, timestamp and the actual message pertaining to the information. In our cases the information is a reserved message that mentions to the nodes to add their stamp to the information and forwards it on.
The node ids could be a sorted list and each node adds its entry to the list provided. When a node sees two sorted lists, it merges them.

In today's post I want to take a break and talk about exception translator.
In windows, we have a method called the set_se_translator and it provides a convenient way to handle exceptions as C++ typed  exceptions. First a C++ exception wrapper class is defined  that can be used to attribute a specific class type to a  exception.  Each time an exception is raised, a custom C++ exception is raised, a custom exception translator function installed is called by the internal exception handling mechanism. The translation occurs from any kind of exception to the type registered;.
The translator function however does not log to a file because generally speaking we don't want to delay the exception handling
 In a multithreaded environment, translator functions are maintained separately for each thread. This means we can use the thread local storage to save state. At the same time, if this is done for a worker pool where the workers participate in different tasks, the translator may need to be toggled for each such task. The _set_se_translator works with /EHa only.
 The _set_se_translator takes an unsigned integer and a pointer to a Win32 _EXCEPTION_POINTERS structure as arguments and these can be found by the Win32 API GetExceptionCode() and GetExceptionInformation()


As an example


void translator(unsigned int, EXCEPTION_POINTERS*);
6

int main()
{
  try
{
    _set_se_translator( translator );
    func();
 }
catch (SE_Exception e )
{
     printf("Caught\n");
}
}


void translator (unsigned int u, EXCEPTION_POINTERS* p)
{
printf ("In translator\n");
throw SE_Exception();
}

Tuesday, June 10, 2014

We discuss bootstrapping in Natural Language Processing. We read from the paper by Ehara, Sato, Oiwa, and Nakagawa and from nlp blogger. Bootstrapping is a technique by which we use a binary classifer in an iterative manner.  In Bootstrapping unlabeled instances can be labeled using the initial labeled "seed" set with iterations The  steps involved are
build a classifier with rules to select the positive answer with high precision and
build a classifier with rules to select the negative answer with high precision
Run the classifiers  on a large data set to collect a labeled set.
Train the classifier on the labeled set.
Run the classifier on the original data.
 The choice of the seed set is important and hence its chosen based on two requirements - the choice of seed set affects the accuracy and there's labeling cost associated with it. The authors present a model called the expected model rotation that works well on realistic data. With this model, they propose an "iterative seeding framework" . At the end of each iteration, the seeds are assessed for quality and improved in response to the current human labels and characteristics. This involves a criteria that improves the seeds as well as a score to keep track of the similarity of the seeds that is updated with each iteration.
In bootstrapping given a dataset we can consider the first l data to be labeled and the rest u data to be unlabeled. The dataset consists of m dimensional feature vectors and the first l have labels that belong to a set C of semantic classes.Suppose we were to pick out the animals names from the given instances then the set C consists of animal or not animal labels and is binary. The ranking is given by the score vector y-animal - y-not-animal seed vectors. The simple bootstrapping proceeds by setting the score vector with the first classifier rule for the first label and continues until all of the rules for each label is identified.
If X is the feature matrix pertaining to a given text, the score vector after c iterations is obtained by (1/m.1/n. X X-Transpose) ^c. y
To make f seed dependent, Komachi et al 2008 mentioned using an equation to include the score vector as (-Laplacian)^c.y   since a Laplacian is defined as 1-D^(-1/2)X.X-TransposedD^(-1/2) . This method is called the Laplacian propagation and the score vector is given by (I+ Beta L)^(-1) y

We will shortly see how this Laplacian propagation works. The thing to note here is that bootstrapping works with supervised and unsupervised  modes only and even though it is said to work well, it doesn't have the same rigor as Vector space model or graph based efforts. The reason we are looking into bootstrapping is merely to see the usage of Laplacian operators.

Let us look further into the paper for the Laplacian operator. The authors propose a criteria for iterative seeding. For an unlabeled instance, they denote the goodness of seed as gi and select the instance with the highest goodness of seed as the next seed added in the next iteration. Each seed selection criterion defines each goodness of seed gi and the unlabeled instance influences the model. They show that the Komachi label propagation score can be considered as the margin between each unlabeled data point and the hyperplane obtained by ridge regression. They denote this with a transformed equation. 
Before we discuss word similarity measures and the OOV term normalization that we wanted to continue from the previous post, I want to take a minute to mention Centrality of the graph which is another useful concept that can help with the word similarity. Centrality is best discussed in Sociology and I refer from the notes of the course in  UCLA by Prof. McFarland and Motoki Watabe.
There are three different centrality measures:
1) degree of centrality
2) closeness centrality
3) betweenness centrality

Given a graph with n nodes, the centrality of  a node, refers to the number of edges attached to the node. To normalize the score for each node, we divide the number of edges attached to each node by n-1.

First, the degree of centrality:
If we have a graph A with four vertices that contains only one central node to which all other nodes attach, then that central node has a centrality of 3 and the other have a centrality of 1.
The degree of centrality is calculated as [ (3-3) + (3 -1) + (3 -1) + (3 -1) ] / max-possible-which-is-(n-1)(n-2) in this case = 6/6 = 1
For a graph B with all the four vertices where each node is connected to every other as in a square with diagonals, each of the vertices now have a centrality of score of 3 and the degree of centrality for the overall graph is [ (3-3) + (3 -3) + (3 -3) + (3 -3) ] / (n-1)(n-2) = 0
The numerator is calculated as the ratio of the individual sum of the differences between the max centrality score and that of individual scores

It may appear that the higher the normalized score the higher a node's likely relevance in the graph. However this is not always true. Cook et al 1983 demonstrated that this is not the case. We will look into the applications of each measure as we define it.

To calculate the closeness centrality, we find the farness of each node and take its inverse. The farness of a node s is defined as the sum of its distance to all other nodes. The more central a node is the lower its total distance to all other nodes. Closeness can be taken as  a measure of how fast the information can spread from a node to all other nodes sequentially.
For network A, the centrality score for node 1 is 3 / 3 and for the other nodes is 3 / 5
Since the maximum score in the above is 1,
the closeness centrality for network A is [( 1 - 1 ) + ( 1 - 3/5 ) + (1 - 3 / 5) + (1 - 3 / 5 )] / [n(n-1)/2 / 5]  = 1
and for network B, the centrality score for each node is uniformly 3 / 3 = 1
and closeness centrality for the overall network B is (0 + 0 + 0 + 0) / (6/5) = 0
 

To calculate the betweenness centrality, we find the number of times a vertex acts as a bridge in the shortest path between two other nodes. It was introduced as a measure for the controllability of a node in the dissemination of a network.

For network A, the centrality score for the most connected node is 3 / 3 = 1 and for all the other nodes is 0
The maximum score in the above is 3 / 3 = 1
The betweenness score for network A is [(1-1) + (1-0) + (1-0) + (1-0)]/ [(n - 1)(n-2)/2] = 3 /3 = 1
and
for network B the centrality score is for each node is the same which is 0 since the node can be  bypassed.
The maximum score is 0
and the overall score for network B is 0 / 3 = 0


Thus we see that in all the three cases, network A is more centralized than network B.
For information flow, network A has high controllability and high efficiency in that there is no redundant edges. At the same time, network A is riskier.
For network B, there is high flexibility and better tolerance to failures. However the cost from redundant edges is higher.

We saw different scores for vertices. To determine the admittance matrix for the vertices, we have to calculate a different kind of score.


Monday, June 9, 2014

In today's post we look at some of the details on graph based word similarity measures. While we looked at word similarity measure in the Saluja, Navratil 2012 paper, I'm going to focus on normalization of words. We mentioned the holing system in Bieman and Riedl. This holing system is only part of their system but they describe a way to capture distributional and contextual similarity.  The idea here is to represent observations as <t,c> pair where t is the term (at a certain offset) and c is the context feature. The position of t in c is denoted by the @ hole symbol. The authors give an example where dependency (nsub;gave2;I1) could be transferred to <gave2, (nsub; @; I1) > Notice how the holing operator is used to give the position of the term. and this in turn could be represented by <I1, (nsub; gave2;@)>. Note the change in the context with the terms and the use of the holing operator. In the second dependency representation, the @ symbol has been introduced to give the position of gave2 and in the third representation the @ symbol gives the position of I1. The last two dependency representations are in the <t,c> pairs.
This paper also uses a term-context feature bipartite graph just like the last paper we saw. In addition it calculates a significance score sig for each pair <t,c> using the counts of the terms, the counts of the context features and the counts of their co-occurrence. If the significance score is negative, the corresponding edge is not added. Also features that have more than a thousand words is removed because they are overly general. This way the distributional similarity is calculated but the contextual similarity is calculated instead from the second order graph.
This way the authors showed that with the sentence "I caught a nasty cold" the holing system produced <cold5, (amod,@, nasty4)>, and <cold5, (dobj;caught2; @)> and the scores were <heat, (dobj; caught; @)> with an LMI score of 42.0 and <weather, (amod; @; nasty)> with a score of 139.4. The word disease came highest in rank for the above sentence.

we will continue on this discussion.

Saturday, June 7, 2014

I've been looking at some objective functions for graph based similarity between words and I came across "Graph based Unsupervised Learning of Word Similarities using Heterogeneous Feature Types" by Saluja and Navratil. This seems a simple way to express the similarity. The graph they use is bipartite between words and feature sets. Edges connect two words or two feature sets or a word and a feature set.
To learn the graph, the following objective function in minimized:
Omega(Wv, Wf, Wz) = a0 Sum ( Wfp fq - (Wfp fq)-initial )^2
                                   + a1 Sum over vi Sum over fp ( Zvi fp - Zvi-initial fp) ^2
                                   + a2 Sum over vi, vj Sum over fp fq  Zvi fp Zvj fq ( Wvi, vj - Wfp,fq) ^ 2
                                   + a3 Sum over vi, vj  Sum over fp, fq Wvivj WfpWfq (Zvi,fp - Zvj, fq) ^ 2
where
Wfp,fq is the current similarity between feature fp and fq
Wvi vj is the current similarity between word vi and vj
Zvi, fp is the current importance weight of feature fp for word vi
a0 to a3 are constants that add up to 1 for the relative importance of each term
The first two terms are described as minimizing the l2-norm between the initial and the current values of Wfp,fq and Zvi, fp
The third term minimizes the similarity between the words vi and j and the features fp and fq
The fourth term minimizes the similarity between the importance weights in proportion to the similarities
As you can see the terms are progressive and enable smoothing over the two graphs.
But can we express this with a different equation that enables Laplacian smoothing and matrices
The Laplacian matrix sometimes called the admittance matrix or Kirchoff matrix of a graph G which is undirected and unweighted, is an n * n symmetric matrix with one row and column for each node defined by L = D - A where  D is the degree matrix, which is the diagonal matrix formed from the vertex degrees and A is the adjacency matrix.
A normalized vector of the Laplacian matrix of a graph is implemented by
L(G) = 1 if i == j and di != 0
         = -1 / sq rt (di.dj) if i and j are adjacent
         = 0 otherwise.
The Laplacian matrix measures to what extent a graph differs at one vertex from its values at nearby vertices. 
I'm going to take a break to discuss a topic that I've been asked to look into. A python library is not able to make an SSL Connection with an application that uses libssl. The error message received is that the 'The handshake operation timed out'. This is because the server is not completing the handshake in some cases.
We first look at what all steps are involved in the TLS handshake. TLS and SSL are different in that they don't interpolate and TLS 1.0 is often referred to as SSL 3.1 but at the same time the differences are not that much in our discussion and TLS 1.0 has a way to fall back to SSL 3.0. The handshake we describe is responsible for the authentication and the key exchange necessary to establish or resume secure sessions. In establishing a secure session, the following steps are involved :
cipher suite negotiation
authentication of the server and optionally the client
session key information exchange
The first step refers to the exchange of the cipher suite that both the client and the server will be using for the duration of the session.
The second step refers to the step where the server proves its identity to the client. The client might also need to prove its identity to the server.
The third step refers to the exchange of random numbers and the selection of a special number called the Pre-Master Secret which helps in creating a Master Secret and that in turn helps create the MAC secret which is the session key used for hashing and the write key, which is the session key used for encryption.
After the initial client hello, server hello and the server hello done message, the pre-master secret is generated by the client and the master secret is generated by both the client and the server. The client sends the change cipher spec notification to the server
Cryptographic algorithms can vary. Some fall in the bucket of Federal Information Processing Standards also called FIPS standard. These can be enabled or disabled  with FIPS mode and often is used to troubleshoot SSL issues. There are dedicated libraries  designed to support cross-platform development of security enabled client and server applications with optional support for hardware SSL acceleration on the server side and smart cards on the client side. These are called Network Security Services.