Cluster computing: April 2013

Tuesday, April 30, 2013

Access control in middle tier continued

In addition to the earlier post, we further discuss row level security here. We wanted to do the following:
1) Specify the tables that define the label categories and markings.
2) Create roles for these marking values.
3) Add an integer column to the base table.
4) Define the view.

Defining the labeling policies, we start by creating a few tables that define the category, marking, markinghierarchy, uniquelabel etc.
After we define the structure of labels in our policy, we define the roles. We create a role for each possible value for nonheirarchical categories that use any or all comparision rule. For hierarchical categories, we nest the roles to model the hierarchy.
We also create a helper view for our controls.

Next we define the changes to the base tables. These are the tables to which we add row level security. We assign the labelID to the row. Creating a non-clustered index on the labelID column will be very helpful. As an aside, it should be noted that this column need not be added to the base table and a junction table can be specified with IDs of the base table and the IDs of the labels.

Now we are ready for the last step which is to create the base table view for the end user.

Next let's look at the Insert, update and delete operations. All that we have discussed so far enables us to select the rows pertaining to labels and are generally allowed for all users but CRUD operations may or may not be allowed for all users. Besides these DML statements may not only be adhoc but come from stored procedures. Moreover, security labels may not be allowed with these operations. Usually they are set to a default value and then modified separately. When the users can specify the security labels, they can be specified . In general, updateable views permit only those rows to be edited that a user can be allowed to see.

When talking about labels its important to observe that labels themselves get misused. More than one labels could mean the same thing, labels may differ merely in spelling, labels may be concatenated or translated, labels may not be consistently used etc. In most of these cases, XML typing the labels prevents these issues.

Other common tasks with labels include the following :
getting an ID to represent a label - this is helpful for CRUD operations where the label is set,
checking whether an ID already exists - this is helpful to see if an ID already exists,
another variant of the above to check if an ID exists but not to generate a new one, and
translating labels to ID and back or finding the label corresponding to a user.

Finally cell level security and encryption could be used in addition to row level security.
This article courtesy of the literature on msdn.

an easier approach to topic analysis

A quick look on how to implement Hang Yamanishi topic analysis and segmentation approach.
The steps involved are:
1) topic spotting
2) text segmentation
3) topic identification
The input is the given text and the output is a topic structure based on STMs.
In step 1, we select keywords from a given text based on the Shannon Information of each word.
I(w) = -N(w)logP(w) where N(w) denotes the frequency of w in t, and
P(w) the probability of the occurrence of w as given from the corpus data
I(w) is therefore the amount of information represented by w
P(w) can be evaluated as follows:
featuresets = [(word_features(n), g) for (n,g) in words]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
classifier.classify(word_features(word))

Clustering helps with topic spotting and this is done with additive merges from treating each word as a cluster.

Text Segmentation is independent.

Monday, April 29, 2013

Access Control in middle tier

Let's quickly review some techniques of securing domain objects with user access control.
First there are two different approaches:
1) Resource based : Individual resources are secured using windows ACLs
2) Role based : Users are partitioned into roles, members of a particular role share the same privilege within the application.

It used to be that role based access was preferred over resource based because the latter didn't scale and worked better for resources that can leverage the builtin RBAC access control out of underlying systems such as windows. Further role based access enables connection pooling and so on. However with AD integrations and improvements to database server, administrators and developers alike want individual user access to different resources.

For simplicity, let's take a base table for an entity that we want to secure and say we have a relational mapping to a control table with IDs for privileges. ID 0 means access to none, ID 1 means access to everyone etc. and intermediate groups in between. And the controls table may be a view in itself with multiple tables just like the base table can be a view where only the records visible to the user are available for insert, update and delete.

Then we perform security at each level of the stack in this way:
Authenticate users within your front end application
Map users to role
Authorize access to operations (not directly to resources) based on role membership
Access the necessary backend resources by using fixed service identities and privileges.

Access can be granted and revoked to data tables as a whole but if row level security is desired, one could add row level labels that are classified based on control table.

Sunday, April 28, 2013

probability distribution in topic analysis

Probability distribution is useful for analysis of topics. As from the Wartena's paper on topic detection, similarity distance function between terms is defined as a distance between distributions associated to keywords by counting co-occurences in documents. So we look into the elaboration on the probability distributions useful for topic analysis. Let’s say we have a collection of T terms that can each be found in exactly one document in d in a collection of D. We consider the natural probability distributions Q on C X T that measures the probability to randomly select an occurrence of a term, from a source document or both. Let n(d,t) be the number of occurrences of term t in d. Then the probability to randomly select an occurrence of a term t from a source document = n(t)/n on T. And if we consider N as the sum of all term occurrences in d, then the conditional probability q(t/d) = q(t) = n(d,t)/N(d) on T. Then we can measure similarity between two terms as a metric d(i,j) between the elements i, j after eliminating non-negativity, indiscernables and triangle inequality. Two elements or more are similar if they are closer. Similarity measure is defined by Jensen-Shannon divergence or information radius between two distributions p and q is defined as JSD(p | q) = 1/2D(p|m) + 1/2D(q|m) where m = ½(p+q) is the mean distribution and D(p|q) is the relative entropy defined by D(p|q) = Sum(pt, log(pt/qt)).
For a developer, implementing a probability distribution such as the above could be broken out into the following cases:
1) Probability distribution Q(d,t) = n(d,t) /n on C X T. Here the n(d,t) as mentioned earlier is the number of occurrences of term t in d. n is the number of distinct terms we started our analysis with.
2) Probability distribution Q(d) = N(d)/n on C. Here N(d) is the number of term occurrences on C and is defined as the sum of all n(d,t) for all terms t in document d.
3) Probability distribution q(t) = n(t)/n on T where n(t) is the number of occurrences of term t across all documents.
Note that n(t) on C and N(d) on D are similar in that they form the sum of all occurrences over their respective collections.
This we use for the respective conditional probability Q(d|t) = Qt(d) = n(d,t)/n(t) on C and
Q(t|d) = qd(t) = n(d,t)/N(d) on T
As a recap of the various terms we use the following table
Term Description
t                 is a distinct term whose occurrences is being studied ( typically these are representative of topics)
d                is one of the documents
C               is the collection of documents d1, d2, ... dm being studied
T               is the set of term occurrences t1, t2, .... tm such that each term can be found in exactly one source document
n(d,t)        is the number of occurrences of term t in d,
n(t)           is the cumulative number of occurences of term t
n               is the number of term occurrences
N(d)         is the cumulative number of term occurrences in d
Q(d,t)       is the distribution n(d,t)/n on C X T and pertains both to documents and terms
Q(d)         is the distribution N(d)/n on C and pertains to documents
q(t)           is the distribution n(t)/n on T and pertains to terms
Q(d|t)       is the source distribution of t and represents the probability that the randomly selected occurrence of term t has source d
Q(t|d)      is the term distribution of d and is the probability that a randomly selected term occurrence from document d is an instance of term t

Saturday, April 27, 2013

A developer’s guide to topic analysis of documents

Why Topic Analysis?

Topic analysis for a document has many different applications. In my search, I’ve not found any commercial product that does all of the text processing that’s outlined here. For example, one can construct an index of the text based on topic analysis. Topics can be scoped to text documents or even text blocks. Main topics and sub-topics can be indicated in the index and one can get a sense of the contents by looking at the topic structure. The methods mentioned here has many applications such as text mining, text summarization, information extraction, question answering and other text processing.

Why Developers?

The applications mentioned above have potential for use by millions of users. Text Summarization alone has had sensational news coverage for a product. This document attempts to list the implementations involved which is otherwise missed from the complex equations and models mentioned in the research papers. That said, this document derives most or all of the ideas from the research papers mentioned in the references section.

Besides, research focuses on finding metrics and measures to evaluate the document for keyword or topic extraction or clustering but developers treat them as factors or algorithms that can be interchanged. Further developers focus on the fast and efficient implementations such as for clustering. While any text mining document has references to clustering, which is a popular technique for information retrieval from documents, this document introduces clustering techniques to developers and at the same time making use of it in context of the goal for topic analysis.

Moreover, developers will find that there are many different implementations possible by simple interchange of algorithms and methods. Different techniques may need to be tried out. Besides, none of the techniques yields a perfect solution. Research calls this domain “a non-convex optimization”.

What are the steps involved?

The steps in the text processing involve:

1) Keyword detection. There are several well defined methods for keyword extraction.

One of this uses a factor called Term Frequency – Inverse Document Frequency (also known as TF-IDF). A TF-IDF value is calculated for each word in a text and those words having the largest TF-IDF values are chosen as keywords. Another way to select words would be to find their Shannon Information which is proportional to term frequency and the probability of the occurrence of the word in the corpus data. The corpus data is a collection of documents that have already been parsed and tagged and represents a comprehensive collection of topics.

There are other similarity or distance measures as well. For example, similarity can be based on statistical distribution of words in a corpus of texts. These lead to a weighted sum of terms, with the crucial difference that non-negative weights with sum 1 are allowed which have a natural interpretation as conditional probabilities.

Python’s NLTK framework allows us to experiment with these conditional probabilities with the following sample code from their online documentation.

featuresets = [(document_features(d), c) for (d,c) in documents]

train_set, test_set = featuresets[100:], featuresets[:100]

classifier = nltk.NaiveBayesClassifier.train(train_set)

print nltk.classify.accuracy(classifier, test_set) [1]

However, our approach will be based on an input of the word occurrence and co-occurrence data of terms and an output of a set of keywords in a cluster that indicate the topic of the document. We will get this data from a corpus of text in natural language processing kit that has this data.

2) Topic analysis by Clustering:

Clustering involves grouping of data points together based on their co-occurrences or on their similarity measures. If a cluster with a seed word w contains a co-occurring word w1 and another cluster with the seed word w1 has a word w, we can merge the clusters together. When we cluster based on similarity measures, we are looking for the similarity of a word with that of cluster centers. We could start with a set of finite clusters and group data points to these clusters adjusting their centers as data points are added to them. Alternatively, we can create a tree of clusters that can be traversed for each data point to find the appropriate cluster where the data point can be inserted. When building a hierarchical cluster, we can treat the data points as one cluster and split them into sub clusters repeatedly in a top down manner or merge them repeatedly from one data point per cluster in a bottom up manner.

There is a fast and efficient clustering described as BIRCH which stands for Balanced Iterative Reducing and Clustering using hierarchies. This clustering works in the agglomerative style where the cluster centers are not decided initially. Only the maximum number of cluster summaries and the threshold for any cluster is decided. These two factors are chosen because we want to keep the cluster summaries in memory even for arbitrarily large data sets.

The algorithm reads each of the data points sequentially and proceeds in the following manner:

Step 1: Compute the distance between record r and each of the cluster centers. Let i be the cluster index such that the distance between r and Center is the smallest.

Step 2: For that i'th cluster, re-compute the radius as if the record r is inserted into it. If the cluster threshold is not exceeded, we can proceed to the next record. If not, we start a new cluster with only record r

Step 3: Check that the step 2 does not exceed the maximum cluster summaries. If so, increase threshold such that existing clusters can accommodate more records or they can be merged to fewer cluster summaries.

The data structure used in BIRCH is a Clustering feature tree (CF-tree). Each entry in the CF-tree represents a cluster of objects and is characterized by a 3-tuple : (N, LS, SS) where N is the number of data points, LS is the linear sum of the N data points, SS is the sum of squares of the N data points. CF entries are additive so clusters can be merged incrementally and consistently.

In a CF-tree, each non-leaf node has at most B entries where B stands for branching factor. Each leaf node has at most L CF entries, each of which satisfies threshold T. Node size for both leaf and non-leaf is determined by input page size P. A leaf node is a collection of CF triplets and links to next and previous leaf nodes. Operations include CF-tree insertion and rebuilding.

A CF tree insertion proceeds like this:

First, traverse down from root recursively, find the appropriate leaf. Next, modify the leaf – if it cannot absorb otherwise make a new CF-entry. If there is no room for the new leaf, split the parent node and proceed up the tree if necessary. Lastly, traverse back and update the CF’s on the path or split nodes. To split the node, take two farthest CFs and create two leaf nodes, put the remaining CFs (including the new ones that caused the split) into the closest node.

A CF tree rebuilding addresses step 3 mentioned above. If the number of clusters is exceeded, increase threshold and push CFs over to a new group and reduce the tree clusters. Since rebuilding operates on the height h of the tree, at most h extra pages of memory are required.

3) Text Segmentation: A segmentation method called Text Tiling generally segments a text into blocks (paragraphs) in accord with topic changes within the text. Another is a cluster based approach. We could start with each sentence as candidates in a relatively short text of say two sentences. We could merge candidates to form clusters and maximize distances between two probability distributions.

All the three steps mentioned above are almost independent of each other and could have variations in terms of algorithms and measures involved.

Conclusion:

There are several methods for topic analysis often involving steps as indicated in this document. Implementing these methods prepares a product with applications in index generation, topic analysis, summarization etc.

References:

H. Li and K. Yamanishi. Topic analysis using a finite mixture model. Information Process Management 39(4):521–541, 2003.
Christian Wartena and Rogier Brussee. Topic Detection by Clustering Keywords. http://MultimediaN.NL
Zhang, Ramakrishnan, Livny. BIRCH: A New Data Clustering Algorithm and Its Applications. Data Mining and knowledge discovery. Also presentation on the same by Zhao Li was referenced.
Martin Warin. Using WordNet and Semantic Similarity to Disambiguate an Ontology.
Annette Hulth. Improved Automatic Keyword Extraction Given More Linguistic Knowledge.
S. A. Hojjatoleslami and J. Kittler. Region Growing: A New Approach. IEEE transactions on Image Processing.

Friday, April 26, 2013

WPF routed events and commands

When controls are added to a form, the event gets hooked up to the XAML markup and the handlers are added to the code behind. This should also be similar. When you right click on InitializeComponent method and go to definition, the editor will display the generated code file and you can see the routed event handlers added.
Each of the elements in the form are declared as a hierarchy of element forms which is called the logical tree. Most WPF controls are either ItemControls or ContentControls and they can have child elements.
Routed events get routed primarily on the visual tree. They support a routing strategy of Bubble, Tunnel or Direct. Bubble is the most common and events propagate up the visual tree. Tunnel goes in the other direction starting at the root and going down the leaf. Direct events behave like normal events and are handled only by the delegate attached to the event.
To walkthrough how a button click event is processed, we go through the following steps. A user will initiate a click event with a MouseLeftButtonDown event that happens inside the Image element. That event tunnels down from the root to the Image. If no handlers set the Handled flag to true for the preview event, the event then bubbles up from the image to the button. The button handles the event, set the Handled flag to true, and raises its own click event. Even when writing custom composite controls, the button handles these events to which the template is attached. In order to enable elements to handle events that are declared in a different element, you can attach events by specifying the element and the event in the markup.
Routed commands give a specific mechanism for hooking up UI controls such as toolbar buttons and menu items to handlers without introducing a lot of tight coupling or repetitive code in the application. Among the advantages of using routed commands as opposed to regular event handlers, the first advantage is that the source elements can be decoupled from the command targets. They do not need direct references to one another. Secondly, the commands will automatically enable or disable all associated UI controls when the handler indicates the command is disabled. Lastly, commands enable you to associate keyboard shortcuts and other forms of input gestures to invoke the command.

Commands are specified with the command property on the control. The command property is supported by a number of controls such as MenuItem, button, RadioButton, CheckBox, hyperlink, and a number of other controls. CommandBinding is specified for an element that acts as the event handler. The CanExecute and Executed properties of a command binding point to methods in the codebehind. The main thing here is that who calls whom is not important.
The CanExecute is called to determine whether or not the command should be enabled. When the executed method is set and the CanExecute is missing, the default behaviour is that the command is enabled.

To make this more concrete and to quickly see the benefits of routed commands, visualize a text box with controls that cuts and paste any selected text between two text boxes. The conventional approach would be to define the click handler and reference the two text boxes and call appropriate clipboard actions and this becomes harder and messier when the control is deep inside a custom control. Instead we could just specify a builtin command to do the same. After you select some text in one of the text boxes, the tool bar button becomes enabled. It could work for any text box anywhere in the UI. Under the covers, these routed commands use routed events to route messages between the command invokers and the command handlers. Once a single command handler is invoked, no other handlers will be called.

There are five builtin command classes in WPF and these are Application Commands, NavigationCommands, EditingCommands, MediaCommands, ComponentCommands etc. RoutedCommands implement the ICommand interface.

Courtesy: MSDN magazine article by Brian Noyes.

Wednesday, April 24, 2013

R-Tree and clustering to find geographical locations

We discussed R-Tree previously in this post here. Today we want to use it together with clustering to search for restaurants across the nation to find areas that have highest number of restaurants with best customer satisfaction ratings for a particular cuisine. Here we will use the R-Tree for storing geographic co-ordinates of the restaurants and to use the minimum bounding rectangle that comes with it. In addition, we will be using agglomerative clustering to group these restaurants into a hierarchy

Tuesday, April 23, 2013

Implementation of WPF event handlers

WPF event handlers can be scoped to the objects that are changing. For example, if there's a collection of items that changes and an event handler registered for OnChange then the event handler is invoked for any updates to the items of the collection. However, if the collection reference itself is updated via a new, this event handler subscription is lost. In such a case, it is better to add a event handler to the parent holding the collection and adding a event handler there as say OnRefreshed. This event handler should unsubscribe from the old OnChange event handler, update the new reference to the collection and add the new OnChange event handler.

Monday, April 22, 2013

Hang Yamanishi stochastic model

Hang Yamanishi propose a stochastic topic model where w denotes a set of words and K a set of topics. A probability distribution of words P(W|K). Here the value of P(W|K) will be zero if w is not included in K and the probability w in W is Sum(P(k)P(w/k)). Each such stochastic topic model (STM) is defined one for each block. Word clusters are created from a large data corpus. All words in a vocabulary are treated as seed words and for each seed word, matching words are found from the corpus that co-occur with the seed word and these are grouped in a cluster. The match is based on a value deltaSC which exceeds a threshold gamma and P(w|s) > P(w) is satisfied as that which occurs significantly frequently. Since the (s,w) pairs are built from a large corpus data, this does not have to rebuild for every candidate in a given document. From the results of the clustering, select any cluster topic whose seed word is included among the selected keywords and merge two clusters ki and kj where the seed word of ki is included in kj and vice versa. Merge only if the words belong to different word sets. Lastly, topic segmentation is performed on n candidates based on similarity measures by finding local minima between similarity measures of consecutive candidates say at i and then finding two most distant candidates P1, P2 adjacent to i and on either side such that P1-S(i) > theta and P2-S(i) > theta and then segmenting there.

Sunday, April 21, 2013

Birch clustering

BIRCH clustering works in the agglomerative style where the cluster centers are not decided initially. Only the maximum
number of cluster summaries and the threshold for any cluster is decided. These two factors are chosen because we want to keep the cluster summaries in memory even for arbitrarily large data sets.
The algorithm reads each of the data points sequentially and proceeds in the following manner:
Step 1: Compute the distance between record r and each of the cluster centers. Let i be the cluster index such that the distance between r and Center is the smallest.
Step 2: For that i'th cluster, recompute the radius as if the record r is inserted into it. If the cluster threshold is not exceeeded, we can proceed to the next record. If not, we start a new cluster with only record r
Step 3: Check that the step 2 does not exceed the maximum cluster summaries. If so, increase threshold such that existing clusters can accomodate more records or they can be merged to fewer cluster summaries.

Friday, April 19, 2013

Building a search page for a web application.

Let us assume that we have a three tier stack to create the search page for a web application. Next let us abstract the store for the data and focus on the controller and views instead. We will require a view model and some WPF controls. We could consider an editable control that filters the list of entities from the data store for each character input. This control merely has a textbox for search. The ViewModels handle the bindings to the Name attribute of entities. Autocomplete feature could also be added subsequently as a separate handler on each character input. This is possible because we can check against IsEditing and IsEditingAllowed flags on what the controls should do.
Changes will be made not only to the view model but also to the views. Initially, this search could appear in a separate page but this is something that could be shared across different views. Moreover, this view could merely hold a single custom control. Since the results of the search and the search control can be a template for different domain entities.
The control for displaying the search text box will be static and it can have a default text ink visible. The text control will have a dependencyproperty and setup for bi-directional flow. Each time the text changes in the control, the handlers evaluate against the entities enumerated from the datastore. The execute button is not required in this case.
Initially, the search results can all be similar and dependent on the type of the results to be returned. This is simpler and allows for users to reuse the search across different domain entities.
However, the search results need not be homogoneous and can have hierarchical data or grouping. The results view is independent of the search performed and can display a list of items in a tree format based on some pre-determined format.
If the data store provider were to be expanded to include full text search that returns results across domain entities, then the entities could be described in a grid but they won't be double clickable as in the simpler version above. The list items shown in the type based search mentioned earlier will each map to the datastore item and can be double-clicked to open a details view.
Many of the steps mentioned above may alread be implemented in the web application for various other purposes. Since the search results filter the enumeration of datastore item based on their names against the text input, the functionality is simple and moreover, the form of the search pane and results can follow the same layout as the rest of the application.

Thursday, April 18, 2013

Interview question answers continued

Q: design a file system object model
A: a file system exposes files and folders in a tree form. Both these are file system objects that are manageable in terms of resolving the location, securing their access and iterating over their contents. Internally, their is a layer of file manager, file, block and disk based access.

class File : private FileSystemObject
{
public:
// data operations
byte[] ReadAllBytes();
string[] ReadAllLines();
fstream Read(FileMode m);
void Encrypt();
void Decrypt();

// metadata operations
void SetAccessControl(FileSecurity s)
void SetCreationTime(DateTime t)
}

class Directory : private FileSystemObject
{
public:
// access operations
void GetFiles();
void GetDirectories();
// metadata operations
void SetAccessControl(DirectorySecurity s);
void SetCreationTime(DateTime t)
}

Wednesday, April 17, 2013

interview question answers continued

Q: How do you determine if a binary tree is binary search tree ?
A: bool IsBST( node* root, int* prev)
{
bool ret = true;
if (root == null) return ret;

ret = IsBST(root->left, prev);
if (ret==false) return ret;

if (*prev >= root->val) return false;
*prev = root->val;

ret = IsBST(root->right, prev);
if (ret==false) return ret;

return ret;
}

Tuesday, April 16, 2013

Chapter 24 DB systems textbook

To build a decision tree:
Input : node n, parition D, split selection method S
Output: decision tree for D rooted at node n

Top-Down decision tree induction schema:
BuildTree(Node n, data partition D, split selection method S)
   Apply S to D to find the splitting criterion
   if ( a good splitting criterion is found)
       Create two children nodes n1 and n2 of n
       Partition D into D1 and D2
       BuildTree(n1, D1, S)
       BuildTree(n2, D2, S)
   endif

Clustering:
Input : node n, partition D, split selection method S
Output: decision tree for D rooted at node n

Top-Down decision Tree induction Schema:
BuildTree(Node n, data partition D, split selection method S)
(1a) Make a scan over D and construct the AttributeValue Classlabels (AVC) group of n in-memory.
(1b) Apply S to the AVC group to find the splitting criterion

A partitioning clustering algorithm partitions the data into k groups such that some criterion that evaluates the clustering quality is optimized. k is specified by the user. A hierarchical clustering algorithm generates a sequence of partitions of the records. Starting with a partition in which each cluster consists of one single record, the algorithm merges two partitions in each step until one single partition remains in the end.

Monday, April 15, 2013

keyword clustering

Wartena 08 describes topic detection by clustering methods. The paper suggests that the topic for the whole text can be found based on a probabilistic latent semantic analysis. Keywords are clustered based on statistical similarity measure. The conditional probability distributions that indicate the latent topic are the clusters that this approach finds. Clustering is based on a hierarchical top down method which requires that the center is recomputed in each step. Two elements are selected that have the largest distance and are used as seeds for the clustering. Next all other items are assigned to the center closest to one of the two seeds. After all the items are assigned, the centers are recomputed. The new centers serve as new seeds and the process is repeated until two centers are converged. If the clusters are larger than a threshold, the whole procedure is repeated on that cluster and thus we find a binary tree of clusters. The choice of distance functions used in this paper for two terms t and s are the cosine similarity of the document distribution, the cosine similarity of vector tf, idf values of keywords, the Jensen Shannon distributions between the document distributions and the Jensen Shannon distributions between the term distributions pt and ps.
When clustering using any distance function, the cluster centers could be chosen among the data points and then centers recomputed in each step.

Saturday, April 13, 2013

interview question answers continued

Q: How do you update the game of a black and white coin game such as Othello ?
A: In the update mode, find the longest sequence of the coin opposite of the color to be played. If this sequence has options to be bounded, place the coins at one of the ends. Choice of ends can be based on the side that has odd number of open spaces. The set of sequence can be based on the state of the game. and can generally be identified by expanding out from the center.

Q: Given the root of a binary search tree along with a node inside it, find the successor node for the given node.
A: The structure of a binary search tree lets us determine the successor of a node without ever comparing keys. If the right subtree of node x is non-empty, then the successor of x is the leftmost node of the right subtree.If the right subtree of node x is empty and x has a successor y, then y is the lowest ancestor of x whose left child is also an ancestor of x.

Q: Given a lot of intervals [ai, bi], you have to find a point with the intersection of most number of intervals.
A: We can maintain a count of inclusions in intervals for each point as we traverse the numberline from the minimum ai to the maximum bi.

Friday, April 12, 2013

interview question answers continued

Q: Give the four basics of object oriented Programming.
A: These are some of the tenets of object oriented programming
1. Encapsulation - define a concept by keeping the data and its operations together
2. Inheritance - defines an "is-a" relationship between derived and base class.
3. Composition - defines a "has-a" relationship between a container and contained class.
4. Polymorphism - defines the same method for different contexts

Q: Given two integers m, n, find the first integer r that is divisible by both m and n.
A: Find the factors of m and n. From the factors choose two or more factors such that the product is bigger than both m and n and less than or equal to m * n.

Q: List and compare the various sorting algorithms
A: 1. Bubble sort : make multiple pass and fix the order of elements. It has an average O(n^2) complexity/
2. Selection sort : finds the minimum and swaps with the first and repeat for consecutives. Another O(n^2) algorithm
3. Insertion sort : played like the cards in a deck, take elements one by one and insert them in position in the sorted list. Average O(n^2)
4. Heap sort : Use the array as a heap and use the heap algorithm to sort. O(nlogn) complexity
5. Merge sort : bottom up merge between pairs O(nlogn) complexity
6. Quick sort : divide and conquer based on partitioning on a pivot value. O(nlogn) complexity

Wednesday, April 10, 2013

interview question answers continued

Q: Given an integer array find if there exists three integers that add up to zero.
A: This is another integer array problem where we are interested in a particular sum. Since we are looking for three integers the naiive solution looks O(n^3) but this is not the case. First we could just sort the elements, then for each element we encounter we know the complement that makes the sum zero.For each complement, we know the fractions has to be earlier than the number we have just encountered. Thus we can proceed better in this case if the elements are sorted.

Q: Given a set of login and logout pairs of users, find the number of users online at any given point of time.
A: From the start of the list of logins and logouts, populate a dictionary of users and their active logins by incrementing a counter for every login of that user and decrement for every logout of that user until the last entry. This gives the number of active logins for each user.

Q: Another integer array is given where there are repeating zeros. Compact the array by erasing the zeros.
A: The problem hints on using the same storage to get the resultant array. So we can lean on multiple passes on the array if we want. Hence, we could make one pass to find the index of all occurances of zero. Then for each occurance, shift all the elements following it. We could shift more than one at the same time if the occurances are consecutive. Another improvement could be to shift only the elements upto the next occurance and keep the zeros consecutive at the next occurance.

Q: Given a set of non-overlapping integer range and an input integer, now design a data structure to store them and have operations of insert, delete and search
A: A double dimensional array seems sufficient for the operations where the integer ranges could traverse one or more of the jagged array elements. For each element we keep track of the end of a range and the start of another.

Q:How can you tell if a given string is of the form aZb where a is the reverse of b ?
A: If a and b can be empty strings almost all strings can satisfy that form otherwise compare start and end index elements and update the indexes on a match.

Monday, April 8, 2013

interview question answers continued

Q: In Java differentiate between final, finally and finalize ?
A: The keyword final in Java are used to denote that which cannot be inherited further. The finally is used for mandatory invocation after try catch exception handling and the finalize is used by the runtime for garbage collection of objects.

Q. Given two arrays, how will you find the intersection ?
A. If intersections is defined as the set of elements that appear in both arrays in no particular order, and both the arrays can be arbitrarily large or have elements from an arbitrary range, then one approach could be to sort both arrays and walk down both arrays element by element and put the common ones in a bag until one of the arrays gets over.

Q. Given two squares by their four corners coordinates, find the coordinates of their intersection.
A. Take the minimum and maximum of all x coordinates and the same for y coordinates. If these are of the same square, one is contained in the other. Otherwise they are either overlapping or disjoint, which you can tell from the diagnoally opposite ends of the min and max. If they do overlap, we already have the co-ordinates for two diaognally opposite corners and other two are common points to both squares.

Q. What is the difference between a hash table and a hash map.
A. A hash table stores values based on the hash computed from the value. A hash map stores key value pairs based on the hash of the key. In C++ a hash map is used where < is not defined or unsuitable for the intended key.   A good hash function is a pre-requisite for both or there can be collisions for some entries.   For example :
template<class T> size_t Hash<T>::operator() (const T& key) const
{
size_t ers = 0;
size_t len = sizeof(T);
const char* p = reinterpret_case<const char*> (&key);
while (len--) res = (res << 1) ^*p++;
return res;
}

Q: What is the significance of the term "dead beef"
A: Initialized memory is usually set to 0xEF bit pattern to indicate the object at this memory has not been used. Similarly 0xCC is used to denote uninitialized memory or no man's land.

Q: Write a C code to find the depth of a binary tree.
A: int GetDepth(Node* root)
     {
       if (root == null) return 0;
       int depth = 1;
       if (root->left == null && root->right == null) return depth;
       int subtree = GetDepth(root->left);
       if (subtree >= depth) depth = subtree+ 1;
       subtree = GetDepth(root->right);
       if(subtree >= depth) depth = subtree + 1;
       return depth;
     }

Sunday, April 7, 2013

interview question answers continued

Q: Given an array of integers ( both positive and negative in random order ), find the subarray with the largest sum.
A: This has a surprisingly simple solution which keeps track of a local and global sum. If the inclusion of the next number does not improve our sum so far, we reset our sum to this number and proceed for the length of the array.

Q: How are cookies passed in the HTTP protocol ?
A: The server sends a set-cookie:name=value to the client, in response to which the client sends a cookie: name=value in its request header.

Q: How will you search for a word in a very large database ?
A: There are builtin operators that lets you search for text such as Contains, like, and fulltext. If we don't know the table we need to look in, we can use full text search.

Q: Explain the functioning or working of a search engine like Google :
A: As explained in the anatomy of a search engine paper, here are some of the steps:
1. A URL server sends lists of URLs to be fetched to the crawlers.
2. Several distributed crawlers download web pages based on this list.
3. The web pages that are fetched are then sent to the store server.
4. The store server then compresses and stores the web pages in a repository.
5. Every web page has an associated ID number called a docID which is parsed whenever a new URL is parsed out of a web page
6. An indexer and sorter, reads the repository, uncompresses the documents and parses them to index the words occurrences called hits.The indexer distributes these words into a set of barrels. It also stores the links in an anchor file
7. The URL resolver reads the anchors file and converts relative URL into absolute URL and into docID. It also generates a database of URL pairs to compute page ranks.
8. The sorter takes the barrels sorted by docID and resorts them by wordID to generate the inverted index.The sorter also produces a list of wordIDs and offsets into the inverted index.
9. A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher.
10. The searcher is run by a web server and uses the lexicon built by the DumpLexicon together with the inverted index and page ranks to answer queries.

Saturday, April 6, 2013

interview questions answers continued

Tree Flattening
Question : How do you flatten a binary tree ?
A binary tree can be flattened into an array by storing the left and the right siblings at index positions 2i and 2i+1 as in the case of a heap.
A binary tree can be flattened into a linked list, breadth first by keeping a queue to keep the siblings.
So as you encounter nodes, enqueue it and as you dequeue the nodes, enqueue the siblings. You can have pointers to the next sibling at the same level. Or you can mark the leftmost nodes in the tree in your linked list so that all the siblings breadth wise at the same level in the tree occur between two such marked nodes.
There are some other solutions as well using recursion or storage by keeping track of positions in the tree.

Design a class library to writing game cards.
Game cards is a collection of distinct cards that lend themselves to various groupings and operations. Use a collection that works with LINQ like methods

How will you write a find and replace tool ?
A find and replace tool takes two input strings - one that is used for searching and the other that is used to replace the occurance. These two are independent operations. When the text is to be deleted, the replacement string is empty. In such cases, rest of the string after the occurance is moved or copied. Search for the next occurance can resume from the current location or all occurances can be found initially.

There is a very large array, in which all the numbers are repeated once except one number. Tell how will you find that number
We can maintain a candidate list and an exclusion list. Numbers that we have seen before are moved to the exclusion list and skipped.

Friday, April 5, 2013

interview questions answers continued

Question : How do you find the edit distance in strings ?
Answer : You calculate the Levenshtein distance between strings with memoization. This is how it works:
If one of the input strings is empty, return the length of the other string as the distance D.
D(i,0) = i
D(0,j) = j
In a zero based index, d(i,j):
1) D(i-1, j) + 1 ( i.e. consider with a new left string letter )
2) D(i, j-1) + 1 ( i.e. consider with a new right string letter )
3) D(i-1, j-1) + cost where cost = 1 if S[i] != S[j] else cost = 0;

Tail recursion could also be considered with the same three types of breakouts. Cost can be 1 if the elements don't match and 0 otherwise.

If we took a sample data like the following, we would have a distance matrix as follows:

2 T 2 1 0
1 A 1 0 1
0 C 1 1 2
       B A T
       0   1   2

memo = {}
#i is the start index of str1, j is the start index of str2
def LevenshteinDistance(str1, i, len1, str2, j, len2):
    l = [i,len1,j,len2]
    key = (',').join(str(l))
    if(memo.has_key(key)):
        return memo[key]
    if(len1 == 0):
        return len2
    if(len2 == 0):
        return len1
    cost = 0
    if(str1[i] != str2[j]):
        cost = 1;
    dele = LevenshteinDistance(str1, i+1,len1-1, str2,j,len2)+1
    inse = LevenshteinDistance(str1,i,len1,str2,j+1,len2-1)+1
    upda = LevenshteinDistance(str1,i+1,len1-1,str2,j+1,len2-1)+cost
    dist = min(dele, inse, upda)
    memo[key] = dist
    return dist
print LevenshteinDistance("cat", 0, 3, "bat", 0 ,3)
def min(a, b, c):
    d = a;
    if (d > b):
        d = b
    if (d > c):
        d = c
    return d

prints 1

C++ version:
map<string, int> m;

int getLD(char* str1,int i,int len1, char* str2, int j, int len2)
{
stringstream sout;
sout << i << "," << len1<< "," << j << "," << len2;
    key = sout.str();
    if(m.find(key) != m.end())
        return m[key];
    if(len1 == 0)
return len2;
    if(len2 == 0)
        return len1;
int cost = 0;
    if(str1[i] != str2[j])
        cost = 1;
    dele = GetLD(str1, i+1,len1-1, str2,j,len2)+1;
    inse = GetLD(str1,i,len1,str2,j+1,len2-1)+1;
    upda = GetLD(str1,i+1,len1-1,str2,j+1,len2-1)+cost;
int dist = min(dele, inse, upda);
    m[key] = dist;
return dist;
}

int min(int a, int b, int c):
{
int d = a;
    if (d > b)
        d = b;
    if (d > c)
        d = c;
    return d;
}

Thursday, April 4, 2013

interview question answers continued

Another Interview question is how do you find an integer in a circularly sorted integers, then tell how will you find an integer.
In this case there is one factor working for us - which is that the integer collection is sorted and the another that we could choose to ignore if we knew the length of the array. A circular array index is bounded by the modulo length. We can confirm if the array is circular if the link and skip level link meet at the same node.
Starting with any position, the integers in one direction are monotonically increasing or decreasing in nature and in the other direction may or may not be so. So you start with the first and last index positions, find the mid point and look in either half in each iteration, update the start and end to select a half until a find or if start and end cross over.

public int FindSorted(int[] input, int val)
{
int start = 0;
int end = input.Length - 1;
while (start < end)
{
int mid = start + (end - start) / 2;
if (input[mid] == val) return mid;
if (input[start] <= input [mid])
{
if ( input[start] <= val && val < input[mid])
end = mid - 1;
else
start = mid + 1;
}
else
{
if (input[mid] < val && val <= input[end])
start = mid + 1;
else
end = mid - 1;
}
}

if (input[start] == val) return start;

if (input[end] == val) return end;
return -1;
}

Wednesday, April 3, 2013

interview questions answers continued

This follows my earlier post on interview questions and answers:
Question: If you are given a string of length N and M strings each of length L, How will you find that each of the M strings occurred in the larger string ?
Answer: There are several approaches to this problem.
The first approach is to iterate for each of the M strings to see if it occurs in the original string.
For each pattern that we match, we can do one of the following:
1) Brute force method : Iterate through each character of the original string to see if its the start of the pattern. For the length of the pattern, match the subsequent characters to see if the pattern exists.
2) Finite - automata: This is an improvement on the pattern match part. We build a string matching automaton that lets us complete the pattern match in linear time.
3) All match at once: This is an improvement on the iteration of all the patterns. Instead of iterating over each of them one by one, each pattern match and automaton progress can be stored as the characters of the original string are iterated.

Tuesday, April 2, 2013

stored procedures for semantic or fulltext search

Semantic search stored procedures in SQL Server enable you to register and unregister a pre-populated Semantic Language Statistics database. Fulltext has to be enabled for this to work. To index office 2010 documents, filter packs are available.
User interface with a clean search page that lets users search for entities based on full-text can be built with a full text stored procedure. When full-text is enabled, the queries can specify contains or full-text to return the results. Full text indexing services runs separately and can be expected to be up to date with all changes to the domain objects. Hence the full text stored procedure merely needs to execute a full-text query. The reason we want to store the query in a procedure is because it will give better performance and security. This is especially relevant when the search page is vulnerable to SQL injection attacks.
When the search page uses form to query against one or more fields of domain entities, then there are patterns we can use with service calls or ORM queries or database queries. Some pattern are for example when searching for a name and a SSN, the UI could build the predicate in the context of the domain entity selected on the screen. Another pattern is to query against one or more entities based on common attributes. The goal with the patterns is to classify and streamline the different access patterns to the data. Sometimes a materialized view helps, other times, the ORM may have methods to lookup entities via some top level domain entity.
Search page presentation can be dedicated page or inset into various other pages. Search Results should be accordingly presented.

WPF UI control layering

The UI controls for an application are built from custom libraries over .Net which is in turn built over the windows framework. Controls that are not visible in a layer should still be accessible from the layers below.

Monday, April 1, 2013

BIRCH review

This is a review of "BIRCH: a new data clustering algorithm and its application".
BIRCH stands for "Balanced Iterative Reducing and Clustering using hierarchies". It is an efficient and scalable data clustering method. It is used to build an interactive pixel classification tool and generating the initial code-book for compression. Data clustering refers to the problem of dividing N data points into K groups so that the data points are This system works well with very large data-sets because it first generates a more compact summary that retains as much distribution information as possible, then clustering the data summary as much as possible. BIRCH lets other clustering algorithms be applied to its summary.
BIRCH proposes a data structure called CF-tree to efficiently store the summary as clusters. It builds the tree as it encounters the data points and inserts the data points based on one or more of distance function metrics.Then it applies clustering algorithms to the leaf nodes. As opposed to partitioning clustering, especially k-means algorithm, that is widely used, BIRCH is a hierarchical clustering algorithm. In partitioning,we pick initial centers of clusters and assign every data instance to the closest cluster and computer new centers of k-clusters. A hierarchical clustering is one in which there are multiple levels of a clustering like in a dendrogram. The level 1 has n clusters and level n has one cluster. First, we calculate the similarity between all possible combinations of two data instances. This makes one cluster in level 2 for k = 7 clusters. Similarly two other data points are chosen for level 3 k = 6 clusters and repeated for level 4 k = 5 clusters. In level 5 K = 4 clusters, the most similar of earlier formed two clusters forms a new cluster. In level 6 K = 3 clusters, calculate the similarity between new cluster and all remaining clusters.. In level 7, there are only two clusters. In level 8, there is only one cluster. Thus the running time of partition clustering is O(n) and that of a hierarchical clustering is O(n2lgn) but its the output for which the latter is chosen. Both need to store data in memory.
BIRCH does incremental and dynamic clustering of incoming objects with only one scan of data and subsequent application of clustering algorithms to leaf nodes.

BIRCH clustering works in the agglomerative style where the cluster centers are not decided initially. Only the maximum
number of cluster summaries and the threshold for any cluster is decided. These two factors are chosen because we want to keep the cluster summaries in memory even for arbitrarily large data sets.
The algorithm reads each of the data points sequentially and proceeds in the following manner:
Step 1: Compute the distance between record r and each of the cluster centers. Let i be the cluster index such that the distance between r and Center is the smallest.
Step 2: For that i'th cluster, recompute the radius as if the record r is inserted into it. If the cluster threshold is not exceeeded, we can proceed to the next record. If not, we start a new cluster with only record r
Step 3: Check that the step 2 does not exceed the maximum cluster summaries. If so, increase threshold such that existing clusters can accomodate more records or they can be merged to fewer cluster summaries.