Cluster computing

Saturday, September 6, 2014

In the post previous to the last, I mentioned applying centrality to keyword extraction. In that we said we could find the edges in the graph based on mutual information. Mutual information is based on co-occurrence. That helps when the words are expressed in the same form. But writers seldom repeat their words and express their import using synonyms and belabored language. In such cases, the function of the keywords is also as important as the form. Therefore, we could use alternative metrics for using pair wise relationships. One such metric could be based on ontology. We look up two words to see whether they are similar. With WordNet for example, we get a distance metric. Such a metric adds more relevance to the edges of the graph. Synonyms and antonyms could have a constant measure different from the default for unknown words. By adding a metric for import in addition to co-occurrence, we improve the ranking we get from centrality.

SemanticSimilarity semsim=new SemanticSimilarity() ;
    float score=semsim.GetScore(word1, word2);
                               
Here is a link for the same : http://www.codeproject.com/Articles/11835/WordNet-based-semantic-similarity-measurement

I want to mention a caveat that this is not a substitute for clustering or mining algorithms. For more on mining algorithms, we can refer here:  http://msdn.microsoft.com/en-us/library/ms175595.aspx

and they include the following algorithms :

Classification algorithms

Regression algorithms

Segmentation algorithms

Association algorithms

Sequence Analysis Algorithms

Friday, September 5, 2014

// find anagrams
using System;
using System.IO;

class MyClass {
public void check_anagrams(string[] firstWords, string[] secondWords) {
if (firstWords == null ||
secondWords == null ||
firstWords.Length != secondWords.Length ||
firstWords.Length == 0) return;
for (int i = 0; i < firstWords.Length; i++)
{
char[] left = firstWords[i].ToCharArray();
char[] right = secondWords[i].ToCharArray();
Array.Sort<char>(left);
Array.Sort<char>(right);
if (new string(left) == new string(right))
Console.WriteLine ("1");
else
Console.WriteLine("0");
}
}
}

// find matching braces pairs
using System;
using System.Collections;

class MyClass {
public void check_braces(string[] expressions) {
foreach (var expression in expressions)
{
Stack s = new Stack();
char[] tokens = expression.ToCharArray();
int output = 1;
foreach (var token in tokens)
{
switch (token)
{
case '(':
s.Push(token);
break;
case '{':
s.Push(token);
break;
case '[':
s.Push(token);
break;
case ')':
if (s.Count > 0){
char c = (char)s.Pop();
if (c != '(') output = 0;
continue;}
else output = 0;
case '}':
if (s.Count > 0){
char c = (char)s.Pop();
if (c != '{') output = 0;
continue;}
else output = 0;
case ']':
if (s.Count > 0){
char c = (char)s.Pop();
if (c != '[') output = 0;
continue;}
else output = 0;
}
}
Console.WriteLine("{0}", output);
}
}
}

find the minimum integer X in a given unordered integer array such that each element needs to be adjusted by at most X to make the array sorted.

        static void Main(string[] args)
        {
            int[] v = new int[] { 5, 4, 1, 2, 3, 8 };
            hill(v);
        }
        static public void hill(int[] v)
        {
            if (v == null || v.Length == 0) return;
            int max = 0;
            int min = 0;
            int diff = max - min;
            for (int i = 1; i < v.Length; i++)
            {
                if (v[i] < v[i - 1])
                {
                    if (v[i - 1] > max) { max = v[i - 1]; min = v[i]; }
                    if (v[i] < min) { min = v[i]; }
                }
                else
                {
                    if (max - min > diff)
                        diff = max - min;
                }
            }
            Console.WriteLine("{0}", diff);
        }

In today's post I will discuss applying centrality to keyword extraction. In this article, I try to explain how to extract keywords using eigenvector centrality. This method gives us a way to measure the importance of the keywords. First, the words are represented as nodes in a graph. Then words that are connected to each other are given an edge in the graph. The edges are undirected because we only want the adjacency information. The connection between words is given by mutual information. Mutual information is based on pairwise grouping of words and this fits right in with our connections. The mutual information is calculated as the probability of the co-occurrence times the logarithm of the ratio of the probability of co-occurrence to the probabilities of the independent occurrences of the pair of words. With a threshold for the mutual information as a percentage of the range that it falls in, we can select the edges for the graph. This gives us a set of nodes and edges which we represent with an adjacency matrix.

Using the adjacency matrix for a given graph with vertex V and edges E, we find the eigenvector centrality score of the vertex v as 1/ lambda Sum (Xt) which can be written in vector notation as Ax = Lambda X. Lambda is the eigenvalue for which many different eigenvector exist. The highest one gives the maximum importance of the node in the network and is said to be found by power iteration.

Power iteration can work for very large matrix and finds the highest eigenvector for Ax = Lambdax

By iteration we mean we start with an approximate eigenvector b0

In each iteration, we multiply the eigenvector with the adjacency matrix and normalize it. For each iteration, the following is suggested:

Initialize vector b, deviation and n

While ( deviation is greater than tolerance)

Temporary_vector = A * b

Deviation = abs(norm(b)) – n

n = norm(x)

b = Temporary_vector / n

return vector b

Thursday, September 4, 2014

In this post, I talk about the protocol particularly one called Border Gateway Protocol because its gonna come up soon for a discussion. This is the one that lets autonomous systems on the internet communicate to each other. Generally that means countries but it could also be private such as by Internet service providers. It succeeded the Exterior Gateway Protocol . The protocol could also be used internally within the AS in which case its called the interior gateway protocol. BGP is currently the most widely used protocol by ISPs. It came from the need to provide routing in a decentralized and scalable way. Essentially both internal and external gateway protocols are about IP routing. While internal uses the OSPF, the external uses the BGP.
BGP is the path vector protocol that provides the routing information with its AS-path attribute. ASNs were invented to transmit the routes in bulk because the AS were registered and they were numbered. The AS exchange information as peers and they are decentralized so their configuration is based on a two way agreement. BGP exchanges routing information between peers with TCP connections. Peers don't generally import the AS-path that have their own references because that would cause a loop. These peers that participate in the BGP protocols can be hijacked which led to some interesting cyber attacks a few years earlier.

When we talk about protocols, we generally see implementations using handlers registered for protocols. Even HTTP has a protocol handler.
The signature of most protocol handlers is very similar to that of say a .Net Event handler :
public delegate void EventHandler(
Object sender,
EventArgs e
);

Different protocols are registered to form a stack of handlers. As a packet flows through this stack, the corresponding protocol headers are stripped or slapped and sent up for processing or down to the media for relay.

The reason we have the protocol handlers is because we want it to be state driven. Note that most protocols have a state diagram and the processing of the packets is essentially moving it from one state to the other. Event handlers therefore fit right in as state handlers. That is why most of the protocols have handlers associated.

One of the things I would like to additionally bring up when discussing protocols is that the networking protocol handlers are often implemented in the kernel of the operating systems. Consequently they are device drivers and most device drivers have routines to take care of different state transitions.

This is another interview question, How do we convert a binary tree to a heap ?

function visitNode(root, data) {
if (root)
{
data.push(root.val);
visitNode(root.left, data);
visitNode(root.right, data);
}
}

function heapify(data) {
if (data && data.length > 0)
{
int i = 0;
var root = {val: data[0], left : null, right: null};
var parent = null;
for (int i = 0; i <= data.length / 2; i++)
{
root.left = {val: data[2i+1], left : null, right: null};
root.right = {val: data[2i+2], left : null, right: null};
If (parent)
{
if ( i % 2 == 0) parent.right = root;
else parent.left = root;
}
parent = root;
root = {val: data, left : null, right: null};

}

}

data = []
visitNode(root, data);
sort(data);
heapify(data);

Wednesday, September 3, 2014

We revert to our discussion on one easy way to finding the importance of a node in a network.
This is the eigenvector centrality.

For a given graph with vertex V and edges E and adjacency matrix A, the centrality score of the vertex v is computed as 1/ lambda Sum (Xt) which can be written in vector notation as Ax = Lambda X. Lambda is the eigenvalue for which many different eigenvector exist. The highest one gives the maximum importance of the node in the network and is said to be found by power iteration.

Power iteration can work for very large matrix and finds the highest eigenvector for Ax = Lambdax
By iteration we mean we start with an approximate eigenvector b0

In each iteration, we multiply the eigenvector with the adjacency matrix and normalize it. For each iteration, the following is suggested:
{
for (int i = 0; i < N; i++) {
    double tmp[i] = 0;
    for (j = 0; j < N; j++)
         tmp[i] += A[i][j] * b[j]
}
double normalized_square = 0;
for (int k = 0; k < n; k++)
{
     normalized_square += tmp[k] * tmp[k];
}

double normalized = sqrt(normalized_square);
b = tmp/normalized;
}

we will quickly review the implementation of the EigenVector

The implementation for power method varies in the normalization step in some cases.
A working implementation can be found here (https://github.com/ravibeta/csharpexamples/blob/master/PowerMethod/PowerMethod/Program.cs)

Tuesday, September 2, 2014

static void printTreeNodesLevelWise(const Node* node) const
{
    Node* root = node;
    queue<Node*> q = new queue<Node*>();
    if (!root) return;
    q.push_back(root);
    q.push_back(NULL);
    root = q.pop_back();
   while (root)
   {
      if (root->left) q.push_back(root->left);
      if (root->right) q.push_back(root->right);
      root = q.pop_front();
      if (root)
          printf("%s ", root->data);
     else
     {
         printf ("/n");
         q.push_back(root);
         root = q.pop_front();
      }
    }
}

#include <iostream>
using namespace std;
int GetDuplicatesCount(int num[], int val, int start, int end);
int main()
{
   cout << "Hello World" << endl;
   int abc[8] = {1,2,2,3,3,3,5,6};
   printf("%d", GetDuplicatesCount(abc, 3, 0, 7));
   return 0;
}
int GetDuplicatesCount(int num[], int val, int start, int end)
{
    if (start > end) return 0;
    if (start == end && num[start] == val ) return 1;
    if (start == end) return 0;
    if (num[start] > num[end]) return 0;
    if (num[start] == val && num[start] == val) return end - start + 1;

    int mid = (start + end) / 2
    if (num[mid] < val) start = mid;
    if (num[mid] > val) end = mid;
    if (num[start] < val && start < mid) start++;
    if (num[end] > val && mid < end) end--;
    return GetDuplicatesCount(num, val, start, mid) + GetDuplicatesCount(num, val, mid+1, end);
}
{2,2,3} {3,3,5}
{2} {2,3} {3,3}, {5}
{2} {2} {3} {3,3}, {5}