Cluster computing

Thursday, March 14, 2013

Red black trees revisited

Among the data structures in a text book for computer science, red and black trees possibly capture the imagination most with its colorful transformations during rotations and recolorings of sub trees.
So let's revisit this data structure and the insert and delete operations for these.
For a red-black tree:
1) Every node is either red or black.
2) The root node is black.
3) The sentinel nodes are black.
4) if a nodes is red, then both its children are black.
5) For each node, all paths from that node down to the leaves have the same number of black nodes.

Left Rotate operation: The right sibling becomes the parent. The left subtree of this sibling becomes the right subtree of the node displaced.

Right Rotate operation: The left sibling becomes the parent.The right subtree of this sibling becomes the left subtree of the node displaced.

Insert operation : goes very much like a tree insert followed by a RB tree fixup. To insert a node z in a tree T, we walk down the tree with x as the root and y as the parent. Then we handle the case of appending the z as the left or the right child of y. We color the new node z as red. Then we fix up the tree.Fix up requires iterations as long as z's parent is red. If z's uncle is red, we recolor the nodes and move the pointer up the tree. If z and its parents are both are red but z's uncle is black, then if z is the right child of z.p then we perform a left rotation else if z is the left child of z.p, then we perform a right rotation.

Delete operation: also goes very much like a tree delete followed by a RB tree fixup. If z is the node to be deleted from a tree T, we maintain a node y as the node either to be removed or to be moved within the tree. We also maintain y's original color since y's color could change. We also maintain x as the sole sibling of z so that we can transplant it. If z has both siblings, we choose y as the tree minimum on z's right subtree and x as the sibling of y so that we may perform the same transplant operation on x as earlier and then we transplant z. We set the y's color to z's color. If the original color of y was black, we perform RB-tree fixup for delete. We begin the fixup with x. We iterate while x is not the root and x's color is black. In each iteration, we take x and w as siblings and perform rotations and recolorings for fix up . If x is the left child and w's color is red, we color it black and do a left rotation of x's parent. If w's left color is black and w's right color is black, we trivially color w to red and set x to its parent. Else if only the right's color is black we set w's left color to black and right rotate w and keep w to the right of x's parent. If x's sibling w is black and w's right child is red, we set w's color to x's parent and color both x's parent and w's right to black, then we left rotate on x's parent. Since we started out with x being the left child, the same cases apply to the x being the right child but with left and right exchanged.

Data Structures

Heap data structure is very useful for organizing data for constant time retrieval of maximum or minimum entries and logarithmic to find elements. The max heapify method works like this: You take the left sibling, right sibling and the root to find the maximum among the three. If the largest is not the root, you swap it with the root and recurse down to the swapped node subtree.

Insertion is also logarithmic. If you insert the element as the last, it can be floated up.
So for index ranging from N to 1, heapify the tree at index.

Wednesday, March 13, 2013

web API and mobile application clients

In my experience with working on a retail company's API for launching their mobile applications on Android and iOS devices, there were a few items worth recalling:
1) The APIs were external facing and hence they required authentication and authorization over the internet.
2) The APIs could be signed on from mobile applications as well as desktop application.
3) The APIs supported both encrypted as well as unencrypted access over Internet.
4) The APIs were versioned and had a versioning policy.
5) The APIs were categorized based on services they offered. They exposed resources for these services.
6) The APIs had a cache for improving performance. The cache used URI hashing as well as periodic refreshes from data sources.
7) The APIs were published through a web proxy that provided monitoring services.
8) The APIs were called by clients with different APIkeys.
9) The APIs were implemented to access data from different data sources.
10) The APIs had SLA defined and honored. These covered timeout and availability.
11) The APIs were performance tested with different workloads.
12) The API URIs were qualified with the resource and query parameters necessary for shopping.

Tuesday, March 12, 2013

keyword extraction using naive Bayes

Keywords could be considered local to the document they appear in. Consequently, keywords not only have an attribute via term frequency but also in their appearance in a given document as opposed to others. This has been utilized in papers such as Yasin et al in keyword extraction using naive Bayes to identify whether a word belongs to the class of ordinary words or keywords. The metric is called TFxIDF which combines Term Frequency and Inverse Document Frequency. TF*IDF(P,D) = P(word in D is W) x [ -log P(W in a document) ]. Assuming feature values are independent, Naive Bayes classifier has been proposed in thsi paper with the following model:
P(key | T, D, PT, PS) = P(T|key) x P(D|Key) x P(PT|key) x P(PS|Key) / P(T, D, PT, PS) where P(key) denotes the prior probability that the word is a key, P(T|key) denotes the probability of having TFxIDF score T given the word is a key, P(D|Key) denotes the probability of having neighbor distance D to the previous occurance of the same word, P(PT|Key) denotes the probability of having relative distance D to the previous occurance of the same word given the word is a key.

Saturday, March 9, 2013

conceptual schema versus physical schema

The conceptual schema sometimes called the logical schema is the data model in the DBMS. It describes all the relations that are stored in the database. These relationships could contain information in the form of entities and relationships. For example, entities such as employees and departments could have relationships such as employees working for a department. Each collection of entities and each collection of schema can be described as a relation, leading to the conceptual schema. The process of arriving at a choice of relations and fields for each relation is called conceptual database design.

The physical schema specifies additional storage details. So it describes further how the conceptual schema is stored in the disk. And file organizations used to store the relations and auxilary data structures such as indexes that speed up data retrieval operations are all part of physical schema.

Friday, March 8, 2013

Computer Vision 2

There are other techniques to image segmentation. Some of them are enumerated below:
Region growing methods: Here seeds mark each of the objects to be segmented. The regions are iteratively grown by comparing all unallocated neighbouring pixels to the region. The difference between a pixel's intensity value and the region's mean is used as a similarity measure.
Split and merge methods: This method splits the image to four quadrants and then they can be merged if they are found to be homogeneous.
Partial Differential equation based methods: PDE methods involve setting up an equation such as a curve propagation or Lagrangian equation that parameterizes the contours. To choose the contour also called snake derivatives are computed using finite differences and derivatives. Between similar choices, the steepest gradient descent is chosen for minimizing the energy. Thus this leads to fast and efficient processing.
Level Set Methods: The level set method was proposed to track moving interfaces. It can be used to efficiently address the problem of curve/surface propagation in an implicit manner. The central idea is to represent the evolving contour using a signed function where its zero level corresponds to the actual contour. Then according to the motion equation of the contour, one can easily derive a similar flow for the implicit surface that when applied to the zero level will reflect the propagation of the contour.
Fast marching methods specify the evolution of a closed curve as a function of time T and speed F(x) in the normal direction at a point x on the curve. The speed function is specified, and the time at which the contour crosses a point x is obtained by solving the equation.
Graph Partitioning methods can effectively be used for image segmentation. In these methods,, the image is modeled as a weighted undirected graph.

Thursday, March 7, 2013

Image processing

Edge detection and segmentation is a form of image processing where the changes in intensity at the edges are used to detect lines or curves. The edges identified by edge detection are often disconnected. They are connected to form closed region boundaries which are then segmented. The simplest method of image segmentation is called thesholding method. In this a theshold value is used to turn a gray scale image into a binary image. Clustering methods like K-means algorithm can also be used iteratively for selecting different thresholds. Here the distance between the pixel and the cluster center is is minimized and the cluster center is recomputed by averaging all of the pixels in the cluster. Compression based methods are used to choose the optimal segmentation based on the coding length of the data. While segmentation tries to find patterns in an image and any regularity in the image can be used to compress it. For example, the contours of an image is represented as a chain code and the smoother the boundary, the shorter the encoding. Another approach is to use histogram based methods that are very efficient because they require only one pass over the pixels. The peaks and valleys in the histograms can be used to detect clusters.