Cluster computing

Saturday, April 6, 2013

interview questions answers continued

Tree Flattening
Question : How do you flatten a binary tree ?
A binary tree can be flattened into an array by storing the left and the right siblings at index positions 2i and 2i+1 as in the case of a heap.
A binary tree can be flattened into a linked list, breadth first by keeping a queue to keep the siblings.
So as you encounter nodes, enqueue it and as you dequeue the nodes, enqueue the siblings. You can have pointers to the next sibling at the same level. Or you can mark the leftmost nodes in the tree in your linked list so that all the siblings breadth wise at the same level in the tree occur between two such marked nodes.
There are some other solutions as well using recursion or storage by keeping track of positions in the tree.

Design a class library to writing game cards.
Game cards is a collection of distinct cards that lend themselves to various groupings and operations. Use a collection that works with LINQ like methods

How will you write a find and replace tool ?
A find and replace tool takes two input strings - one that is used for searching and the other that is used to replace the occurance. These two are independent operations. When the text is to be deleted, the replacement string is empty. In such cases, rest of the string after the occurance is moved or copied. Search for the next occurance can resume from the current location or all occurances can be found initially.

There is a very large array, in which all the numbers are repeated once except one number. Tell how will you find that number
We can maintain a candidate list and an exclusion list. Numbers that we have seen before are moved to the exclusion list and skipped.

Friday, April 5, 2013

interview questions answers continued

Question : How do you find the edit distance in strings ?
Answer : You calculate the Levenshtein distance between strings with memoization. This is how it works:
If one of the input strings is empty, return the length of the other string as the distance D.
D(i,0) = i
D(0,j) = j
In a zero based index, d(i,j):
1) D(i-1, j) + 1 ( i.e. consider with a new left string letter )
2) D(i, j-1) + 1 ( i.e. consider with a new right string letter )
3) D(i-1, j-1) + cost where cost = 1 if S[i] != S[j] else cost = 0;

Tail recursion could also be considered with the same three types of breakouts. Cost can be 1 if the elements don't match and 0 otherwise.

If we took a sample data like the following, we would have a distance matrix as follows:

2 T 2 1 0
1 A 1 0 1
0 C 1 1 2
       B A T
       0   1   2

memo = {}
#i is the start index of str1, j is the start index of str2
def LevenshteinDistance(str1, i, len1, str2, j, len2):
    l = [i,len1,j,len2]
    key = (',').join(str(l))
    if(memo.has_key(key)):
        return memo[key]
    if(len1 == 0):
        return len2
    if(len2 == 0):
        return len1
    cost = 0
    if(str1[i] != str2[j]):
        cost = 1;
    dele = LevenshteinDistance(str1, i+1,len1-1, str2,j,len2)+1
    inse = LevenshteinDistance(str1,i,len1,str2,j+1,len2-1)+1
    upda = LevenshteinDistance(str1,i+1,len1-1,str2,j+1,len2-1)+cost
    dist = min(dele, inse, upda)
    memo[key] = dist
    return dist
print LevenshteinDistance("cat", 0, 3, "bat", 0 ,3)
def min(a, b, c):
    d = a;
    if (d > b):
        d = b
    if (d > c):
        d = c
    return d

prints 1

C++ version:
map<string, int> m;

int getLD(char* str1,int i,int len1, char* str2, int j, int len2)
{
stringstream sout;
sout << i << "," << len1<< "," << j << "," << len2;
    key = sout.str();
    if(m.find(key) != m.end())
        return m[key];
    if(len1 == 0)
return len2;
    if(len2 == 0)
        return len1;
int cost = 0;
    if(str1[i] != str2[j])
        cost = 1;
    dele = GetLD(str1, i+1,len1-1, str2,j,len2)+1;
    inse = GetLD(str1,i,len1,str2,j+1,len2-1)+1;
    upda = GetLD(str1,i+1,len1-1,str2,j+1,len2-1)+cost;
int dist = min(dele, inse, upda);
    m[key] = dist;
return dist;
}

int min(int a, int b, int c):
{
int d = a;
    if (d > b)
        d = b;
    if (d > c)
        d = c;
    return d;
}

Thursday, April 4, 2013

interview question answers continued

Another Interview question is how do you find an integer in a circularly sorted integers, then tell how will you find an integer.
In this case there is one factor working for us - which is that the integer collection is sorted and the another that we could choose to ignore if we knew the length of the array. A circular array index is bounded by the modulo length. We can confirm if the array is circular if the link and skip level link meet at the same node.
Starting with any position, the integers in one direction are monotonically increasing or decreasing in nature and in the other direction may or may not be so. So you start with the first and last index positions, find the mid point and look in either half in each iteration, update the start and end to select a half until a find or if start and end cross over.

public int FindSorted(int[] input, int val)
{
int start = 0;
int end = input.Length - 1;
while (start < end)
{
int mid = start + (end - start) / 2;
if (input[mid] == val) return mid;
if (input[start] <= input [mid])
{
if ( input[start] <= val && val < input[mid])
end = mid - 1;
else
start = mid + 1;
}
else
{
if (input[mid] < val && val <= input[end])
start = mid + 1;
else
end = mid - 1;
}
}

if (input[start] == val) return start;

if (input[end] == val) return end;
return -1;
}

Wednesday, April 3, 2013

interview questions answers continued

This follows my earlier post on interview questions and answers:
Question: If you are given a string of length N and M strings each of length L, How will you find that each of the M strings occurred in the larger string ?
Answer: There are several approaches to this problem.
The first approach is to iterate for each of the M strings to see if it occurs in the original string.
For each pattern that we match, we can do one of the following:
1) Brute force method : Iterate through each character of the original string to see if its the start of the pattern. For the length of the pattern, match the subsequent characters to see if the pattern exists.
2) Finite - automata: This is an improvement on the pattern match part. We build a string matching automaton that lets us complete the pattern match in linear time.
3) All match at once: This is an improvement on the iteration of all the patterns. Instead of iterating over each of them one by one, each pattern match and automaton progress can be stored as the characters of the original string are iterated.

Tuesday, April 2, 2013

stored procedures for semantic or fulltext search

Semantic search stored procedures in SQL Server enable you to register and unregister a pre-populated Semantic Language Statistics database. Fulltext has to be enabled for this to work. To index office 2010 documents, filter packs are available.
User interface with a clean search page that lets users search for entities based on full-text can be built with a full text stored procedure. When full-text is enabled, the queries can specify contains or full-text to return the results. Full text indexing services runs separately and can be expected to be up to date with all changes to the domain objects. Hence the full text stored procedure merely needs to execute a full-text query. The reason we want to store the query in a procedure is because it will give better performance and security. This is especially relevant when the search page is vulnerable to SQL injection attacks.
When the search page uses form to query against one or more fields of domain entities, then there are patterns we can use with service calls or ORM queries or database queries. Some pattern are for example when searching for a name and a SSN, the UI could build the predicate in the context of the domain entity selected on the screen. Another pattern is to query against one or more entities based on common attributes. The goal with the patterns is to classify and streamline the different access patterns to the data. Sometimes a materialized view helps, other times, the ORM may have methods to lookup entities via some top level domain entity.
Search page presentation can be dedicated page or inset into various other pages. Search Results should be accordingly presented.

WPF UI control layering

The UI controls for an application are built from custom libraries over .Net which is in turn built over the windows framework. Controls that are not visible in a layer should still be accessible from the layers below.

Monday, April 1, 2013

BIRCH review

This is a review of "BIRCH: a new data clustering algorithm and its application".
BIRCH stands for "Balanced Iterative Reducing and Clustering using hierarchies". It is an efficient and scalable data clustering method. It is used to build an interactive pixel classification tool and generating the initial code-book for compression. Data clustering refers to the problem of dividing N data points into K groups so that the data points are This system works well with very large data-sets because it first generates a more compact summary that retains as much distribution information as possible, then clustering the data summary as much as possible. BIRCH lets other clustering algorithms be applied to its summary.
BIRCH proposes a data structure called CF-tree to efficiently store the summary as clusters. It builds the tree as it encounters the data points and inserts the data points based on one or more of distance function metrics.Then it applies clustering algorithms to the leaf nodes. As opposed to partitioning clustering, especially k-means algorithm, that is widely used, BIRCH is a hierarchical clustering algorithm. In partitioning,we pick initial centers of clusters and assign every data instance to the closest cluster and computer new centers of k-clusters. A hierarchical clustering is one in which there are multiple levels of a clustering like in a dendrogram. The level 1 has n clusters and level n has one cluster. First, we calculate the similarity between all possible combinations of two data instances. This makes one cluster in level 2 for k = 7 clusters. Similarly two other data points are chosen for level 3 k = 6 clusters and repeated for level 4 k = 5 clusters. In level 5 K = 4 clusters, the most similar of earlier formed two clusters forms a new cluster. In level 6 K = 3 clusters, calculate the similarity between new cluster and all remaining clusters.. In level 7, there are only two clusters. In level 8, there is only one cluster. Thus the running time of partition clustering is O(n) and that of a hierarchical clustering is O(n2lgn) but its the output for which the latter is chosen. Both need to store data in memory.
BIRCH does incremental and dynamic clustering of incoming objects with only one scan of data and subsequent application of clustering algorithms to leaf nodes.

BIRCH clustering works in the agglomerative style where the cluster centers are not decided initially. Only the maximum
number of cluster summaries and the threshold for any cluster is decided. These two factors are chosen because we want to keep the cluster summaries in memory even for arbitrarily large data sets.
The algorithm reads each of the data points sequentially and proceeds in the following manner:
Step 1: Compute the distance between record r and each of the cluster centers. Let i be the cluster index such that the distance between r and Center is the smallest.
Step 2: For that i'th cluster, recompute the radius as if the record r is inserted into it. If the cluster threshold is not exceeeded, we can proceed to the next record. If not, we start a new cluster with only record r
Step 3: Check that the step 2 does not exceed the maximum cluster summaries. If so, increase threshold such that existing clusters can accomodate more records or they can be merged to fewer cluster summaries.