Cluster computing

Wednesday, April 24, 2013

R-Tree and clustering to find geographical locations

We discussed R-Tree previously in this post here. Today we want to use it together with clustering to search for restaurants across the nation to find areas that have highest number of restaurants with best customer satisfaction ratings for a particular cuisine. Here we will use the R-Tree for storing geographic co-ordinates of the restaurants and to use the minimum bounding rectangle that comes with it. In addition, we will be using agglomerative clustering to group these restaurants into a hierarchy

Tuesday, April 23, 2013

Implementation of WPF event handlers

WPF event handlers can be scoped to the objects that are changing. For example, if there's a collection of items that changes and an event handler registered for OnChange then the event handler is invoked for any updates to the items of the collection. However, if the collection reference itself is updated via a new, this event handler subscription is lost. In such a case, it is better to add a event handler to the parent holding the collection and adding a event handler there as say OnRefreshed. This event handler should unsubscribe from the old OnChange event handler, update the new reference to the collection and add the new OnChange event handler.

Monday, April 22, 2013

Hang Yamanishi stochastic model

Hang Yamanishi propose a stochastic topic model where w denotes a set of words and K a set of topics. A probability distribution of words P(W|K). Here the value of P(W|K) will be zero if w is not included in K and the probability w in W is Sum(P(k)P(w/k)). Each such stochastic topic model (STM) is defined one for each block. Word clusters are created from a large data corpus. All words in a vocabulary are treated as seed words and for each seed word, matching words are found from the corpus that co-occur with the seed word and these are grouped in a cluster. The match is based on a value deltaSC which exceeds a threshold gamma and P(w|s) > P(w) is satisfied as that which occurs significantly frequently. Since the (s,w) pairs are built from a large corpus data, this does not have to rebuild for every candidate in a given document. From the results of the clustering, select any cluster topic whose seed word is included among the selected keywords and merge two clusters ki and kj where the seed word of ki is included in kj and vice versa. Merge only if the words belong to different word sets. Lastly, topic segmentation is performed on n candidates based on similarity measures by finding local minima between similarity measures of consecutive candidates say at i and then finding two most distant candidates P1, P2 adjacent to i and on either side such that P1-S(i) > theta and P2-S(i) > theta and then segmenting there.

Sunday, April 21, 2013

Birch clustering

BIRCH clustering works in the agglomerative style where the cluster centers are not decided initially. Only the maximum
number of cluster summaries and the threshold for any cluster is decided. These two factors are chosen because we want to keep the cluster summaries in memory even for arbitrarily large data sets.
The algorithm reads each of the data points sequentially and proceeds in the following manner:
Step 1: Compute the distance between record r and each of the cluster centers. Let i be the cluster index such that the distance between r and Center is the smallest.
Step 2: For that i'th cluster, recompute the radius as if the record r is inserted into it. If the cluster threshold is not exceeeded, we can proceed to the next record. If not, we start a new cluster with only record r
Step 3: Check that the step 2 does not exceed the maximum cluster summaries. If so, increase threshold such that existing clusters can accomodate more records or they can be merged to fewer cluster summaries.

Friday, April 19, 2013

Building a search page for a web application.

Let us assume that we have a three tier stack to create the search page for a web application. Next let us abstract the store for the data and focus on the controller and views instead. We will require a view model and some WPF controls. We could consider an editable control that filters the list of entities from the data store for each character input. This control merely has a textbox for search. The ViewModels handle the bindings to the Name attribute of entities. Autocomplete feature could also be added subsequently as a separate handler on each character input. This is possible because we can check against IsEditing and IsEditingAllowed flags on what the controls should do.
Changes will be made not only to the view model but also to the views. Initially, this search could appear in a separate page but this is something that could be shared across different views. Moreover, this view could merely hold a single custom control. Since the results of the search and the search control can be a template for different domain entities.
The control for displaying the search text box will be static and it can have a default text ink visible. The text control will have a dependencyproperty and setup for bi-directional flow. Each time the text changes in the control, the handlers evaluate against the entities enumerated from the datastore. The execute button is not required in this case.
Initially, the search results can all be similar and dependent on the type of the results to be returned. This is simpler and allows for users to reuse the search across different domain entities.
However, the search results need not be homogoneous and can have hierarchical data or grouping. The results view is independent of the search performed and can display a list of items in a tree format based on some pre-determined format.
If the data store provider were to be expanded to include full text search that returns results across domain entities, then the entities could be described in a grid but they won't be double clickable as in the simpler version above. The list items shown in the type based search mentioned earlier will each map to the datastore item and can be double-clicked to open a details view.
Many of the steps mentioned above may alread be implemented in the web application for various other purposes. Since the search results filter the enumeration of datastore item based on their names against the text input, the functionality is simple and moreover, the form of the search pane and results can follow the same layout as the rest of the application.

Thursday, April 18, 2013

Interview question answers continued

Q: design a file system object model
A: a file system exposes files and folders in a tree form. Both these are file system objects that are manageable in terms of resolving the location, securing their access and iterating over their contents. Internally, their is a layer of file manager, file, block and disk based access.

class File : private FileSystemObject
{
public:
// data operations
byte[] ReadAllBytes();
string[] ReadAllLines();
fstream Read(FileMode m);
void Encrypt();
void Decrypt();

// metadata operations
void SetAccessControl(FileSecurity s)
void SetCreationTime(DateTime t)
}

class Directory : private FileSystemObject
{
public:
// access operations
void GetFiles();
void GetDirectories();
// metadata operations
void SetAccessControl(DirectorySecurity s);
void SetCreationTime(DateTime t)
}

Wednesday, April 17, 2013

interview question answers continued

Q: How do you determine if a binary tree is binary search tree ?
A: bool IsBST( node* root, int* prev)
{
bool ret = true;
if (root == null) return ret;

ret = IsBST(root->left, prev);
if (ret==false) return ret;

if (*prev >= root->val) return false;
*prev = root->val;

ret = IsBST(root->right, prev);
if (ret==false) return ret;

return ret;
}