Cluster computing

Thursday, March 14, 2013

Data Structures

Heap data structure is very useful for organizing data for constant time retrieval of maximum or minimum entries and logarithmic to find elements. The max heapify method works like this: You take the left sibling, right sibling and the root to find the maximum among the three. If the largest is not the root, you swap it with the root and recurse down to the swapped node subtree.

Insertion is also logarithmic. If you insert the element as the last, it can be floated up.
So for index ranging from N to 1, heapify the tree at index.

Wednesday, March 13, 2013

web API and mobile application clients

In my experience with working on a retail company's API for launching their mobile applications on Android and iOS devices, there were a few items worth recalling:
1) The APIs were external facing and hence they required authentication and authorization over the internet.
2) The APIs could be signed on from mobile applications as well as desktop application.
3) The APIs supported both encrypted as well as unencrypted access over Internet.
4) The APIs were versioned and had a versioning policy.
5) The APIs were categorized based on services they offered. They exposed resources for these services.
6) The APIs had a cache for improving performance. The cache used URI hashing as well as periodic refreshes from data sources.
7) The APIs were published through a web proxy that provided monitoring services.
8) The APIs were called by clients with different APIkeys.
9) The APIs were implemented to access data from different data sources.
10) The APIs had SLA defined and honored. These covered timeout and availability.
11) The APIs were performance tested with different workloads.
12) The API URIs were qualified with the resource and query parameters necessary for shopping.

Tuesday, March 12, 2013

keyword extraction using naive Bayes

Keywords could be considered local to the document they appear in. Consequently, keywords not only have an attribute via term frequency but also in their appearance in a given document as opposed to others. This has been utilized in papers such as Yasin et al in keyword extraction using naive Bayes to identify whether a word belongs to the class of ordinary words or keywords. The metric is called TFxIDF which combines Term Frequency and Inverse Document Frequency. TF*IDF(P,D) = P(word in D is W) x [ -log P(W in a document) ]. Assuming feature values are independent, Naive Bayes classifier has been proposed in thsi paper with the following model:
P(key | T, D, PT, PS) = P(T|key) x P(D|Key) x P(PT|key) x P(PS|Key) / P(T, D, PT, PS) where P(key) denotes the prior probability that the word is a key, P(T|key) denotes the probability of having TFxIDF score T given the word is a key, P(D|Key) denotes the probability of having neighbor distance D to the previous occurance of the same word, P(PT|Key) denotes the probability of having relative distance D to the previous occurance of the same word given the word is a key.

Saturday, March 9, 2013

conceptual schema versus physical schema

The conceptual schema sometimes called the logical schema is the data model in the DBMS. It describes all the relations that are stored in the database. These relationships could contain information in the form of entities and relationships. For example, entities such as employees and departments could have relationships such as employees working for a department. Each collection of entities and each collection of schema can be described as a relation, leading to the conceptual schema. The process of arriving at a choice of relations and fields for each relation is called conceptual database design.

The physical schema specifies additional storage details. So it describes further how the conceptual schema is stored in the disk. And file organizations used to store the relations and auxilary data structures such as indexes that speed up data retrieval operations are all part of physical schema.

Friday, March 8, 2013

Computer Vision 2

There are other techniques to image segmentation. Some of them are enumerated below:
Region growing methods: Here seeds mark each of the objects to be segmented. The regions are iteratively grown by comparing all unallocated neighbouring pixels to the region. The difference between a pixel's intensity value and the region's mean is used as a similarity measure.
Split and merge methods: This method splits the image to four quadrants and then they can be merged if they are found to be homogeneous.
Partial Differential equation based methods: PDE methods involve setting up an equation such as a curve propagation or Lagrangian equation that parameterizes the contours. To choose the contour also called snake derivatives are computed using finite differences and derivatives. Between similar choices, the steepest gradient descent is chosen for minimizing the energy. Thus this leads to fast and efficient processing.
Level Set Methods: The level set method was proposed to track moving interfaces. It can be used to efficiently address the problem of curve/surface propagation in an implicit manner. The central idea is to represent the evolving contour using a signed function where its zero level corresponds to the actual contour. Then according to the motion equation of the contour, one can easily derive a similar flow for the implicit surface that when applied to the zero level will reflect the propagation of the contour.
Fast marching methods specify the evolution of a closed curve as a function of time T and speed F(x) in the normal direction at a point x on the curve. The speed function is specified, and the time at which the contour crosses a point x is obtained by solving the equation.
Graph Partitioning methods can effectively be used for image segmentation. In these methods,, the image is modeled as a weighted undirected graph.

Thursday, March 7, 2013

Image processing

Edge detection and segmentation is a form of image processing where the changes in intensity at the edges are used to detect lines or curves. The edges identified by edge detection are often disconnected. They are connected to form closed region boundaries which are then segmented. The simplest method of image segmentation is called thesholding method. In this a theshold value is used to turn a gray scale image into a binary image. Clustering methods like K-means algorithm can also be used iteratively for selecting different thresholds. Here the distance between the pixel and the cluster center is is minimized and the cluster center is recomputed by averaging all of the pixels in the cluster. Compression based methods are used to choose the optimal segmentation based on the coding length of the data. While segmentation tries to find patterns in an image and any regularity in the image can be used to compress it. For example, the contours of an image is represented as a chain code and the smoother the boundary, the shorter the encoding. Another approach is to use histogram based methods that are very efficient because they require only one pass over the pixels. The peaks and valleys in the histograms can be used to detect clusters.

Wednesday, March 6, 2013

REST API Design.

This link here talks about REST API design.
http://blog.apigee.com/detail/slides_for_restful_api_design_second_edition_webinar)
eg: /owners/bob/dogs/search?q=fluffy
/services/data/v1.0/sobjects/Account
/v1/dogs/
/dogs?id=123