Cluster computing

Thursday, September 4, 2014

This is another interview question, How do we convert a binary tree to a heap ?

function visitNode(root, data) {
if (root)
{
data.push(root.val);
visitNode(root.left, data);
visitNode(root.right, data);
}
}

function heapify(data) {
if (data && data.length > 0)
{
int i = 0;
var root = {val: data[0], left : null, right: null};
var parent = null;
for (int i = 0; i <= data.length / 2; i++)
{
root.left = {val: data[2i+1], left : null, right: null};
root.right = {val: data[2i+2], left : null, right: null};
If (parent)
{
if ( i % 2 == 0) parent.right = root;
else parent.left = root;
}
parent = root;
root = {val: data, left : null, right: null};

}

}

data = []
visitNode(root, data);
sort(data);
heapify(data);

Wednesday, September 3, 2014

We revert to our discussion on one easy way to finding the importance of a node in a network.
This is the eigenvector centrality.

For a given graph with vertex V and edges E and adjacency matrix A, the centrality score of the vertex v is computed as 1/ lambda Sum (Xt) which can be written in vector notation as Ax = Lambda X. Lambda is the eigenvalue for which many different eigenvector exist. The highest one gives the maximum importance of the node in the network and is said to be found by power iteration.

Power iteration can work for very large matrix and finds the highest eigenvector for Ax = Lambdax
By iteration we mean we start with an approximate eigenvector b0

In each iteration, we multiply the eigenvector with the adjacency matrix and normalize it. For each iteration, the following is suggested:
{
for (int i = 0; i < N; i++) {
    double tmp[i] = 0;
    for (j = 0; j < N; j++)
         tmp[i] += A[i][j] * b[j]
}
double normalized_square = 0;
for (int k = 0; k < n; k++)
{
     normalized_square += tmp[k] * tmp[k];
}

double normalized = sqrt(normalized_square);
b = tmp/normalized;
}

we will quickly review the implementation of the EigenVector

The implementation for power method varies in the normalization step in some cases.
A working implementation can be found here (https://github.com/ravibeta/csharpexamples/blob/master/PowerMethod/PowerMethod/Program.cs)

Tuesday, September 2, 2014

static void printTreeNodesLevelWise(const Node* node) const
{
    Node* root = node;
    queue<Node*> q = new queue<Node*>();
    if (!root) return;
    q.push_back(root);
    q.push_back(NULL);
    root = q.pop_back();
   while (root)
   {
      if (root->left) q.push_back(root->left);
      if (root->right) q.push_back(root->right);
      root = q.pop_front();
      if (root)
          printf("%s ", root->data);
     else
     {
         printf ("/n");
         q.push_back(root);
         root = q.pop_front();
      }
    }
}

#include <iostream>
using namespace std;
int GetDuplicatesCount(int num[], int val, int start, int end);
int main()
{
   cout << "Hello World" << endl;
   int abc[8] = {1,2,2,3,3,3,5,6};
   printf("%d", GetDuplicatesCount(abc, 3, 0, 7));
   return 0;
}
int GetDuplicatesCount(int num[], int val, int start, int end)
{
    if (start > end) return 0;
    if (start == end && num[start] == val ) return 1;
    if (start == end) return 0;
    if (num[start] > num[end]) return 0;
    if (num[start] == val && num[start] == val) return end - start + 1;

    int mid = (start + end) / 2
    if (num[mid] < val) start = mid;
    if (num[mid] > val) end = mid;
    if (num[start] < val && start < mid) start++;
    if (num[end] > val && mid < end) end--;
    return GetDuplicatesCount(num, val, start, mid) + GetDuplicatesCount(num, val, mid+1, end);
}
{2,2,3} {3,3,5}
{2} {2,3} {3,3}, {5}
{2} {2} {3} {3,3}, {5}

Today I want to discuss a few algorithms from the Algorithms and Data Structures. We will quickly review tree and graph algorithms.
To delete a node from a binary search tree:
we have to consider three cases:
If the current has no right child, then the current's left child becomes the node pointed to by the parent.
If the current's right child has no left child, then the current's right child replaces the current in the tree
If the current's right child has a left child, replace current with the current's right child's left-most node.
To insert a node to a binary search tree, insert as a leaf with the check for the parent.
An AVL is a self balancing binary search tree. AVL trees maintain the property that the height of the left and right subtree must differ by at most 1.
If a node has no sibling, it's height is considered -1 for that subtree.
To insert a node into the AVL tree, insert the node just as in the BST. Then as a next step, if the height is violated, rotate the tree to adjust the height at the node where the violation is detected. Check each node for violation up the parent chain because one rotation may not be sufficient.
If the node with the violation is A, then
If a node is inserted into the left subtree of the left child of A, then there is one rotation
If a node is inserted into the right subtree of the right child of A, then there is another rotation
If a node is inserted into the left subtree of the right child of A, then there is double rotation
If a node is inserted into the right subtree of the left child of A, then there is double rotation.
To delete a node from an AVL tree, is more involved than one rotation because each of the nodes in the parent chain needs to be checked for violations since imbalance may propagate upwards.
An AVL tree comes in useful when subsequent operations access the recently added node such as in the splay trees. Splaying means moving the node to the root by rotations. The tree remains roughly balanced.
A Red-black tree satisfies the additional properties that
Every node is either red or black.
The root and the leaves are black
If a node is red, then its children are black.
For each node, all simple paths from the node to the descendant leaves contain the same number of black nodes.
To insert a node in the red-black tree, insert as in a BST and then fixup the colors with three cases.
case 1 is when we recolor the nodes.
case 2 is when we left rotate the nodes
case 3 is when we right rotate the nodes.
To delete a node in the red-black tree, delete as in a BST and then fixup the colors with four cases
with one more case for the double rotation. Reverse the operations for right and left for the other subtree. In insertion we look up at the parent's sibling and in deletion we look at the sibling of the node and the children of the node for their color.
Graph algorithms are briefly enumerated below.
Breadth first search algorithms traverse reachable vertices with the smallest number of edges. Eg
Dijkstra's single source shortest path algorithm and
Prim's minimum spanning tree
Dijkstra's algorithm extracts the minimum shortest path estimate vertex and adds it to the set of vertices maintained. It then relaxes all the edges outbound from that vertex.
Prim's algorithm builds a minimum spanning tree by adding edges to the tree in a way that is safe and minimum cost for the tree.
Depth first search explores vertex outbound from the most recently added vertex. The nodes are initially white. We color the nodes gray when we discover them and after we have explored the edges, we color them black. The nodes are thus found as in a parenthesis structure and intervals of nodes can either be disjoint, descendant and contained one within the other.
We now look at some data structures.
a hash_map is an associative container that indexes based on the hash of the contained elements as opposed to the map that uses the less than operation on the elements contained. Collisions are resolved to buckets.
A multimap is like a map except that it allows duplicate entries. The range of duplicate entries can be looked up with equal_range. A multiset is a set that allows duplicate keys. Valarrays and bitset are specialized containers.

Monday, September 1, 2014

Some interview questions I came across online from Hanselman's blog and posting quick (and probably not sufficient ) answers below:
Q: From constructor to destructor (taking into consideration Dispose() and the concept of non-deterministic finalization), what the are events fired as part of the ASP.NET System.Web.UI.Page lifecycle. Why are they important? What interesting things can you do at each?
A: Page goes through Request -> Start-> Initialization-> Load->Validation->PostbackEventHandling->Rendering->Unloading.
Events fired are
PreInit() Check IsPostBack property to determine if the page is being processed the first time
Init () raised after all controls have been initialized and used to initialize control properties
InitComplete() raised by the Page object for initialization completion
PreLoad after this the page loads view state for itself and all controls.
Load corresponding to OnLoad event, Page recursively loads all controls.This is used to establish database connections
Control events: eg. TextChanged Button's Click events
LoadComplete: used for anything else to be loaded.
PreRender - Used to make final changes to the contents of the page or its controls.
SaveStateComplete - at this point viewstate has been saved.
Render - Page calls this method on each control
Unload - calls cleanup in the reverse sequence - controls, controls specific database connections, page itself, logging and request specific tasks.

Q: What are ASHX files? What are HttpHandlers? Where can they be configured?
ashx files are handlers, ascx files are controls, asax files are state handlers, asmx is for web services, aspx is for pages
Builtin http handlers target the above ashx aspx asmx trace.axd file extensions.
They are configured in IIS using Add/Edit Application Extension Mapping.
HttpHandlers' ProcessRequest method is invoked to return the response back.

Q: What is needed to configure a new extension for use in ASP.NET? For example, what if I wanted my system to serve ASPX files with a *.jsp extension?
First you need to map the extension in IIS and then map the custom handler to the extension in the application.The handler can also be reused for different extension mappings.

Q: What events fire when binding data to a data grid? What are they good for?
Page.DataBinding and Control.DataBinding methods bind the data to the data source.

Q: Explain how PostBacks work, on both the client-side and server-side. How do I chain my own JavaScript into the client side without losing PostBack functionality?
PostBacks are for posting back to the server some information such as login credentials or a selection from a control to retrieve the display to be shown Checking for the post back helps be efficient int he code. The client side posts back on page level as well as control level and is configured by Auto PostBack property of the controls. Callback events are raised for chaining.

Q: How does ViewState work and why is it either useful or evil?
ViewState is used to persist state across postbacks and is generally used to store any programmatic changes to the page's state. It can be useful for storing information but it can also be misused to stuff anything.The StateBag is like a Hashtable

Q: What is the OO relationship between an ASPX page and its CS/VB code behind file in ASP.NET 1.1? in 2.0?
The code behind is the Page-Controller pattern. You can specify the handlers in the code behind for the declarations of the page.

Q: What happens from the point an HTTP request is received on a TCP/IP port up until the Page fires the On_Load event?
Request goes through the HTTP pipeline ( IIS and ASP.Net) , passed through the http modules and handlers before the response is returned.

Q: How does IIS communicate at runtime with ASP.NET? Where is ASP.NET at runtime in IIS5? IIS6?
The request is dispatched to the asp.net engine via the aspnet_isapi.dll

Q: What is an assembly binding redirect? Where are the places an administrator or developer can affect how assembly binding policy is applied?
assembly binding redirect is useful to resolve an assembly by name at different locations. The locations are determined based on CAS policy. The policy can be set at AppDomain, application , machine and enterprise level. The first is by code and the rest are by config files.

Q: Compare and contrast LoadLibrary(), CoCreateInstance(), CreateObject() and Assembly.Load().
These are module initialization routines for native, COM, .Net and Assembly respectively.

Sunday, August 31, 2014

In Kullback-Leibler divergence that we mentioned in earlier posts, we saw that the divergence was measured word by word as the average probability of that word against the distribution. We calculate nw as the total number of terms where w appears and pw as the total number of terms where w appears divided by the total number of terms in the document and we measured the P(tk/q) as [nw / Sum-for-all-terms-x-in-q (nw)] . When the divergence was greater than a threshold, we selected it as a keyword. It is not necessary to measure the divergence of the term one by one against the background distribution in a document because the metric hold for any two distributions such as P(x) and Q(x) and their divergence is measured as P(x)-Q(x)logP(x)/Q(x). The term distribution of sample document is the compared with the distribution of categories that number C. The probability distribution of a term tk in a document dj is the ration of the term frequencies in that document compared to the overall term frequency across all documents in the case that the terms appear in the document otherwise zero. The term probability distribution across categories is normalized to 1.
The equation we refer to for the divergence comes in many forms.
Most equations use a default value for when a term doesn't appear in either P or Q. This is because the zero values skews the equation. The probability epsilon corresponds to an unknown word.

Today I will resume some discussion on Keyword extraction.
We discussed co-occurrence of terms as an indicator of the keywords. This has traditionally meant clustering keywords based on similarity. Similarity is often measured based on Jensen-Shannon divergence or Kullback-Leibler divergence. However similarity doesn't give an indication of relevance. Pair-wise co-occurrence or mutual information gives some indication of relevance.
Sometimes we need to use both or prefer one over the other based on chi-square goodness of fit.
In our case, co-occurrence of a term and a cluster means co-occurrence of the term and any term in the cluster although we could use nearest, farthest terms or the average from the cluster.
What we did was we populated co-occurrence matrix from the top occurring terms and their counts. For each of the terms, we count the co-occurrences with the frequent terms that we have selected. These frequent terms are based on a threshold we select.
When we classify, we take two terms and find the clusters they belong to. Words don't belong to any cluster initially. They are put in the same cluster based on the mutual information which is calculated as the ratio of the probability of co-occurring terms to the individual probabilities of the terms. We translate this to the counts and calculate each probability in terms of counts from the co-occurrence matrix.
We measure the cluster quality by calculating the chi-square. This we do by summing over all the components of the chi-square as measured for each word in the frequent terms. Each component is the square of the difference between the observed co-occurrence frequency and the expected frequency and divided by the expected frequency of co-occurrence. The expected frequency is calculated in turn as the combination of the expected probability pg of that word g from the frequent terms and the co-occurrence of the term w with frequent terms denoted by nw.
If the terms have a large chi-square value, then they are relatively more important. If the terms have a low chi-square value then they are relatively trivial. Chi-square gives a notion of the deviation from the mean indicating the contribution each cluster makes and hence its likelihood to bring out the salient keywords. For a look at the implementation, here it is. We attempted Kullback-Leibler divergence as a method for keyword extraction as well. Here we used one of the formula for the divergence.