Cluster computing

Friday, September 5, 2014

In today's post I will discuss applying centrality to keyword extraction. In this article, I try to explain how to extract keywords using eigenvector centrality. This method gives us a way to measure the importance of the keywords. First, the words are represented as nodes in a graph. Then words that are connected to each other are given an edge in the graph. The edges are undirected because we only want the adjacency information. The connection between words is given by mutual information. Mutual information is based on pairwise grouping of words and this fits right in with our connections. The mutual information is calculated as the probability of the co-occurrence times the logarithm of the ratio of the probability of co-occurrence to the probabilities of the independent occurrences of the pair of words. With a threshold for the mutual information as a percentage of the range that it falls in, we can select the edges for the graph. This gives us a set of nodes and edges which we represent with an adjacency matrix.

Using the adjacency matrix for a given graph with vertex V and edges E, we find the eigenvector centrality score of the vertex v as 1/ lambda Sum (Xt) which can be written in vector notation as Ax = Lambda X. Lambda is the eigenvalue for which many different eigenvector exist. The highest one gives the maximum importance of the node in the network and is said to be found by power iteration.

Power iteration can work for very large matrix and finds the highest eigenvector for Ax = Lambdax

By iteration we mean we start with an approximate eigenvector b0

In each iteration, we multiply the eigenvector with the adjacency matrix and normalize it. For each iteration, the following is suggested:

Initialize vector b, deviation and n

While ( deviation is greater than tolerance)

Temporary_vector = A * b

Deviation = abs(norm(b)) – n

n = norm(x)

b = Temporary_vector / n

return vector b

Thursday, September 4, 2014

In this post, I talk about the protocol particularly one called Border Gateway Protocol because its gonna come up soon for a discussion. This is the one that lets autonomous systems on the internet communicate to each other. Generally that means countries but it could also be private such as by Internet service providers. It succeeded the Exterior Gateway Protocol . The protocol could also be used internally within the AS in which case its called the interior gateway protocol. BGP is currently the most widely used protocol by ISPs. It came from the need to provide routing in a decentralized and scalable way. Essentially both internal and external gateway protocols are about IP routing. While internal uses the OSPF, the external uses the BGP.
BGP is the path vector protocol that provides the routing information with its AS-path attribute. ASNs were invented to transmit the routes in bulk because the AS were registered and they were numbered. The AS exchange information as peers and they are decentralized so their configuration is based on a two way agreement. BGP exchanges routing information between peers with TCP connections. Peers don't generally import the AS-path that have their own references because that would cause a loop. These peers that participate in the BGP protocols can be hijacked which led to some interesting cyber attacks a few years earlier.

When we talk about protocols, we generally see implementations using handlers registered for protocols. Even HTTP has a protocol handler.
The signature of most protocol handlers is very similar to that of say a .Net Event handler :
public delegate void EventHandler(
Object sender,
EventArgs e
);

Different protocols are registered to form a stack of handlers. As a packet flows through this stack, the corresponding protocol headers are stripped or slapped and sent up for processing or down to the media for relay.

The reason we have the protocol handlers is because we want it to be state driven. Note that most protocols have a state diagram and the processing of the packets is essentially moving it from one state to the other. Event handlers therefore fit right in as state handlers. That is why most of the protocols have handlers associated.

One of the things I would like to additionally bring up when discussing protocols is that the networking protocol handlers are often implemented in the kernel of the operating systems. Consequently they are device drivers and most device drivers have routines to take care of different state transitions.

This is another interview question, How do we convert a binary tree to a heap ?

function visitNode(root, data) {
if (root)
{
data.push(root.val);
visitNode(root.left, data);
visitNode(root.right, data);
}
}

function heapify(data) {
if (data && data.length > 0)
{
int i = 0;
var root = {val: data[0], left : null, right: null};
var parent = null;
for (int i = 0; i <= data.length / 2; i++)
{
root.left = {val: data[2i+1], left : null, right: null};
root.right = {val: data[2i+2], left : null, right: null};
If (parent)
{
if ( i % 2 == 0) parent.right = root;
else parent.left = root;
}
parent = root;
root = {val: data, left : null, right: null};

}

}

data = []
visitNode(root, data);
sort(data);
heapify(data);

Wednesday, September 3, 2014

We revert to our discussion on one easy way to finding the importance of a node in a network.
This is the eigenvector centrality.

For a given graph with vertex V and edges E and adjacency matrix A, the centrality score of the vertex v is computed as 1/ lambda Sum (Xt) which can be written in vector notation as Ax = Lambda X. Lambda is the eigenvalue for which many different eigenvector exist. The highest one gives the maximum importance of the node in the network and is said to be found by power iteration.

Power iteration can work for very large matrix and finds the highest eigenvector for Ax = Lambdax
By iteration we mean we start with an approximate eigenvector b0

In each iteration, we multiply the eigenvector with the adjacency matrix and normalize it. For each iteration, the following is suggested:
{
for (int i = 0; i < N; i++) {
    double tmp[i] = 0;
    for (j = 0; j < N; j++)
         tmp[i] += A[i][j] * b[j]
}
double normalized_square = 0;
for (int k = 0; k < n; k++)
{
     normalized_square += tmp[k] * tmp[k];
}

double normalized = sqrt(normalized_square);
b = tmp/normalized;
}

we will quickly review the implementation of the EigenVector

The implementation for power method varies in the normalization step in some cases.
A working implementation can be found here (https://github.com/ravibeta/csharpexamples/blob/master/PowerMethod/PowerMethod/Program.cs)

Tuesday, September 2, 2014

static void printTreeNodesLevelWise(const Node* node) const
{
    Node* root = node;
    queue<Node*> q = new queue<Node*>();
    if (!root) return;
    q.push_back(root);
    q.push_back(NULL);
    root = q.pop_back();
   while (root)
   {
      if (root->left) q.push_back(root->left);
      if (root->right) q.push_back(root->right);
      root = q.pop_front();
      if (root)
          printf("%s ", root->data);
     else
     {
         printf ("/n");
         q.push_back(root);
         root = q.pop_front();
      }
    }
}

#include <iostream>
using namespace std;
int GetDuplicatesCount(int num[], int val, int start, int end);
int main()
{
   cout << "Hello World" << endl;
   int abc[8] = {1,2,2,3,3,3,5,6};
   printf("%d", GetDuplicatesCount(abc, 3, 0, 7));
   return 0;
}
int GetDuplicatesCount(int num[], int val, int start, int end)
{
    if (start > end) return 0;
    if (start == end && num[start] == val ) return 1;
    if (start == end) return 0;
    if (num[start] > num[end]) return 0;
    if (num[start] == val && num[start] == val) return end - start + 1;

    int mid = (start + end) / 2
    if (num[mid] < val) start = mid;
    if (num[mid] > val) end = mid;
    if (num[start] < val && start < mid) start++;
    if (num[end] > val && mid < end) end--;
    return GetDuplicatesCount(num, val, start, mid) + GetDuplicatesCount(num, val, mid+1, end);
}
{2,2,3} {3,3,5}
{2} {2,3} {3,3}, {5}
{2} {2} {3} {3,3}, {5}

Today I want to discuss a few algorithms from the Algorithms and Data Structures. We will quickly review tree and graph algorithms.
To delete a node from a binary search tree:
we have to consider three cases:
If the current has no right child, then the current's left child becomes the node pointed to by the parent.
If the current's right child has no left child, then the current's right child replaces the current in the tree
If the current's right child has a left child, replace current with the current's right child's left-most node.
To insert a node to a binary search tree, insert as a leaf with the check for the parent.
An AVL is a self balancing binary search tree. AVL trees maintain the property that the height of the left and right subtree must differ by at most 1.
If a node has no sibling, it's height is considered -1 for that subtree.
To insert a node into the AVL tree, insert the node just as in the BST. Then as a next step, if the height is violated, rotate the tree to adjust the height at the node where the violation is detected. Check each node for violation up the parent chain because one rotation may not be sufficient.
If the node with the violation is A, then
If a node is inserted into the left subtree of the left child of A, then there is one rotation
If a node is inserted into the right subtree of the right child of A, then there is another rotation
If a node is inserted into the left subtree of the right child of A, then there is double rotation
If a node is inserted into the right subtree of the left child of A, then there is double rotation.
To delete a node from an AVL tree, is more involved than one rotation because each of the nodes in the parent chain needs to be checked for violations since imbalance may propagate upwards.
An AVL tree comes in useful when subsequent operations access the recently added node such as in the splay trees. Splaying means moving the node to the root by rotations. The tree remains roughly balanced.
A Red-black tree satisfies the additional properties that
Every node is either red or black.
The root and the leaves are black
If a node is red, then its children are black.
For each node, all simple paths from the node to the descendant leaves contain the same number of black nodes.
To insert a node in the red-black tree, insert as in a BST and then fixup the colors with three cases.
case 1 is when we recolor the nodes.
case 2 is when we left rotate the nodes
case 3 is when we right rotate the nodes.
To delete a node in the red-black tree, delete as in a BST and then fixup the colors with four cases
with one more case for the double rotation. Reverse the operations for right and left for the other subtree. In insertion we look up at the parent's sibling and in deletion we look at the sibling of the node and the children of the node for their color.
Graph algorithms are briefly enumerated below.
Breadth first search algorithms traverse reachable vertices with the smallest number of edges. Eg
Dijkstra's single source shortest path algorithm and
Prim's minimum spanning tree
Dijkstra's algorithm extracts the minimum shortest path estimate vertex and adds it to the set of vertices maintained. It then relaxes all the edges outbound from that vertex.
Prim's algorithm builds a minimum spanning tree by adding edges to the tree in a way that is safe and minimum cost for the tree.
Depth first search explores vertex outbound from the most recently added vertex. The nodes are initially white. We color the nodes gray when we discover them and after we have explored the edges, we color them black. The nodes are thus found as in a parenthesis structure and intervals of nodes can either be disjoint, descendant and contained one within the other.
We now look at some data structures.
a hash_map is an associative container that indexes based on the hash of the contained elements as opposed to the map that uses the less than operation on the elements contained. Collisions are resolved to buckets.
A multimap is like a map except that it allows duplicate entries. The range of duplicate entries can be looked up with equal_range. A multiset is a set that allows duplicate keys. Valarrays and bitset are specialized containers.

Monday, September 1, 2014

Some interview questions I came across online from Hanselman's blog and posting quick (and probably not sufficient ) answers below:
Q: From constructor to destructor (taking into consideration Dispose() and the concept of non-deterministic finalization), what the are events fired as part of the ASP.NET System.Web.UI.Page lifecycle. Why are they important? What interesting things can you do at each?
A: Page goes through Request -> Start-> Initialization-> Load->Validation->PostbackEventHandling->Rendering->Unloading.
Events fired are
PreInit() Check IsPostBack property to determine if the page is being processed the first time
Init () raised after all controls have been initialized and used to initialize control properties
InitComplete() raised by the Page object for initialization completion
PreLoad after this the page loads view state for itself and all controls.
Load corresponding to OnLoad event, Page recursively loads all controls.This is used to establish database connections
Control events: eg. TextChanged Button's Click events
LoadComplete: used for anything else to be loaded.
PreRender - Used to make final changes to the contents of the page or its controls.
SaveStateComplete - at this point viewstate has been saved.
Render - Page calls this method on each control
Unload - calls cleanup in the reverse sequence - controls, controls specific database connections, page itself, logging and request specific tasks.

Q: What are ASHX files? What are HttpHandlers? Where can they be configured?
ashx files are handlers, ascx files are controls, asax files are state handlers, asmx is for web services, aspx is for pages
Builtin http handlers target the above ashx aspx asmx trace.axd file extensions.
They are configured in IIS using Add/Edit Application Extension Mapping.
HttpHandlers' ProcessRequest method is invoked to return the response back.

Q: What is needed to configure a new extension for use in ASP.NET? For example, what if I wanted my system to serve ASPX files with a *.jsp extension?
First you need to map the extension in IIS and then map the custom handler to the extension in the application.The handler can also be reused for different extension mappings.

Q: What events fire when binding data to a data grid? What are they good for?
Page.DataBinding and Control.DataBinding methods bind the data to the data source.

Q: Explain how PostBacks work, on both the client-side and server-side. How do I chain my own JavaScript into the client side without losing PostBack functionality?
PostBacks are for posting back to the server some information such as login credentials or a selection from a control to retrieve the display to be shown Checking for the post back helps be efficient int he code. The client side posts back on page level as well as control level and is configured by Auto PostBack property of the controls. Callback events are raised for chaining.

Q: How does ViewState work and why is it either useful or evil?
ViewState is used to persist state across postbacks and is generally used to store any programmatic changes to the page's state. It can be useful for storing information but it can also be misused to stuff anything.The StateBag is like a Hashtable

Q: What is the OO relationship between an ASPX page and its CS/VB code behind file in ASP.NET 1.1? in 2.0?
The code behind is the Page-Controller pattern. You can specify the handlers in the code behind for the declarations of the page.

Q: What happens from the point an HTTP request is received on a TCP/IP port up until the Page fires the On_Load event?
Request goes through the HTTP pipeline ( IIS and ASP.Net) , passed through the http modules and handlers before the response is returned.

Q: How does IIS communicate at runtime with ASP.NET? Where is ASP.NET at runtime in IIS5? IIS6?
The request is dispatched to the asp.net engine via the aspnet_isapi.dll

Q: What is an assembly binding redirect? Where are the places an administrator or developer can affect how assembly binding policy is applied?
assembly binding redirect is useful to resolve an assembly by name at different locations. The locations are determined based on CAS policy. The policy can be set at AppDomain, application , machine and enterprise level. The first is by code and the rest are by config files.

Q: Compare and contrast LoadLibrary(), CoCreateInstance(), CreateObject() and Assembly.Load().
These are module initialization routines for native, COM, .Net and Assembly respectively.