Cluster computing

Saturday, March 9, 2013

conceptual schema versus physical schema

The conceptual schema sometimes called the logical schema is the data model in the DBMS. It describes all the relations that are stored in the database. These relationships could contain information in the form of entities and relationships. For example, entities such as employees and departments could have relationships such as employees working for a department. Each collection of entities and each collection of schema can be described as a relation, leading to the conceptual schema. The process of arriving at a choice of relations and fields for each relation is called conceptual database design.

The physical schema specifies additional storage details. So it describes further how the conceptual schema is stored in the disk. And file organizations used to store the relations and auxilary data structures such as indexes that speed up data retrieval operations are all part of physical schema.

Friday, March 8, 2013

Computer Vision 2

There are other techniques to image segmentation. Some of them are enumerated below:
Region growing methods: Here seeds mark each of the objects to be segmented. The regions are iteratively grown by comparing all unallocated neighbouring pixels to the region. The difference between a pixel's intensity value and the region's mean is used as a similarity measure.
Split and merge methods: This method splits the image to four quadrants and then they can be merged if they are found to be homogeneous.
Partial Differential equation based methods: PDE methods involve setting up an equation such as a curve propagation or Lagrangian equation that parameterizes the contours. To choose the contour also called snake derivatives are computed using finite differences and derivatives. Between similar choices, the steepest gradient descent is chosen for minimizing the energy. Thus this leads to fast and efficient processing.
Level Set Methods: The level set method was proposed to track moving interfaces. It can be used to efficiently address the problem of curve/surface propagation in an implicit manner. The central idea is to represent the evolving contour using a signed function where its zero level corresponds to the actual contour. Then according to the motion equation of the contour, one can easily derive a similar flow for the implicit surface that when applied to the zero level will reflect the propagation of the contour.
Fast marching methods specify the evolution of a closed curve as a function of time T and speed F(x) in the normal direction at a point x on the curve. The speed function is specified, and the time at which the contour crosses a point x is obtained by solving the equation.
Graph Partitioning methods can effectively be used for image segmentation. In these methods,, the image is modeled as a weighted undirected graph.

Thursday, March 7, 2013

Image processing

Edge detection and segmentation is a form of image processing where the changes in intensity at the edges are used to detect lines or curves. The edges identified by edge detection are often disconnected. They are connected to form closed region boundaries which are then segmented. The simplest method of image segmentation is called thesholding method. In this a theshold value is used to turn a gray scale image into a binary image. Clustering methods like K-means algorithm can also be used iteratively for selecting different thresholds. Here the distance between the pixel and the cluster center is is minimized and the cluster center is recomputed by averaging all of the pixels in the cluster. Compression based methods are used to choose the optimal segmentation based on the coding length of the data. While segmentation tries to find patterns in an image and any regularity in the image can be used to compress it. For example, the contours of an image is represented as a chain code and the smoother the boundary, the shorter the encoding. Another approach is to use histogram based methods that are very efficient because they require only one pass over the pixels. The peaks and valleys in the histograms can be used to detect clusters.

Wednesday, March 6, 2013

REST API Design.

This link here talks about REST API design.
http://blog.apigee.com/detail/slides_for_restful_api_design_second_edition_webinar)
eg: /owners/bob/dogs/search?q=fluffy
/services/data/v1.0/sobjects/Account
/v1/dogs/
/dogs?id=123

Monday, March 4, 2013

RAID levels

Disks are potential bottlenecks for system performance and storage system reliability. If the disk fails, the data is lost. A disk array is used to increase performance and reliability through data striping and redundancy. Instead of having a single copy of data, redundant information is maintained and carefully organized so that in the case of a disk failure, it can be used to reconstruct the contents of the failed disk. These redundant array of independent disk organizations are referred as RAID levels and each level represents a tradeoff between reliability and performance.
In Data Striping, the data is segmented into equal-size partitions that are distributed over multiple disks. The size of the partition is called the striping unit. The partitions are usually distributed using a round robin mechanism.
In Redundancy, if the mean time to failure of a single disk is about a few years, it is smaller for a disk array. Hence check disks and parity schemes are involved to improve reliability. The check disk contains information that can be used to recover from failure of any one disk in the array.This group of data disks and check disks together constitute reliability groups.
Here is a list of the RAID levels:
Level 0: Non-redundant: A RAID level 0 system uses data striping to increase maximum bandwidth available. No redundnant information is maintained. This is usually the least expensive solution.
Level 1: Mirrored: Instead of having one copy of the data, two copies of the data are maintained. This type of redundancy is often called mirroring. Every write on a disk block involves write on both disk.This allows parallel reads between disk blocks but is usually the most expensive solution.
Level 0+1: Striping and Mirroring: Like in level 1, read requests can be scheduled to both the disk and its mirror image and bandwidth for contiguous blocks is improved from the aggregation of all the disks.
Level 2: Error-correcting codes: In RAID level 2, the striping unit is a single bit. The redundancy scheme used is the Hamming code. The number of check disks grows logarithmically with the number of data disks.
Level 3: Bit-Interleaved Parity: Redundancy schema in RAID level 2 is better in terms of cost than RAID level 1 but it keeps more redundant information than is necessary. Instead of using several disks to store hamming code that informs which disk has failed, we rely on that information from the disk controller and use a single check disk with parity information which is the lowest overhead possible.
Level 4: Block-Interleaved parity. RAID level 4 has a striping unit of a disk block, instead of a single bit as in RAID level 3. Block-level striping has the advantage that read requests the size of a disk block can be served entirely by the disk where the requested block resides. The write of a single block still requires a read-modify-write cycle, but only one data disk and the check disk are involved and the difference between the old data block and the new data block is noted.
Level 5: Block Interleaved distributed parity: This level improves upon the previous level by distributing the parity blocks uniformly over all disks, instead of sorting them on a single check disk. This has two advantages. First, several write requests potentially be processed in parallel, since the bottleneck of a unique check is removed. Second, read requests have a higher degree of parallelism. This level usually has the best performance.
Level 6: P+Q redundancy: Recovery from the failure of a single disk is usually not sufficient in very large disk arrays. First, a second disk might fail before replacement and second the probability of a second disk failing is not negligible. A RAID level 6 system uses Reed-Solomon codes to be able to recover from two simultaneous disk failures.

Saturday, March 2, 2013

Page numbers for a book index

After the index words have been chosen, they are usually displayed with the page numbes. The trouble with the page numbers is that they are dependent on the rendering software. The page sizes could change for an electronic document, hence the page numbers for the index may need to be regenerated. This change in page size could be tracked with a variety of options and mostly after the index words have been selected.We will consider the trade offs here. First, the page number for the different pages can be associated with the index words with a linear search for every occurance of the word. This is probably a naiive approach. Second, we already have word offsets from start for each of the words. Hence we can translate the offsets to page numbers if we know the last word offset of each page. So we can build an integer array where the index gives the page number and the data gives the offset of the last word on the page. Then for each word that we list in the index, we can also print the associated page numbers. Third, we can utilize any information the document rendering system provides including methods and data structures for page lookup. From these approaches, it should be clear that the best bet for the page number data source is the rendering engine which is external to the index generator program.

For rendering of word documents or for the retrieving the page numbers, we could use either microsoft.office.interop.Word or aspose.words library.

To determine the renderer, the program could look up the file extension type to use the appropriate library. The library can be loaded as the input file is read or used via dependency injection. Wrapper over the libraries to provide only the page information method may be sufficient or a design pattern to abstract away the implementations on a file format basis can be used.

Friday, March 1, 2013

Web service helper methods

Here are some common service provider helper methods for web services or applications:

WCFSerializationHelpers:

This uses a data contract serializer and deserializer to serialize the object and return it as string. It supports more than one format such as XML or JSON output. The object should be checked for serializability before performing the operation. This could be tested with:

public static T WcfRoundtripSerialize<T>(T instance) where T : class, new()

RestServiceAsyncClientHelpers:

This is a helper class to make ‘GET’ and ‘POST’ calls asynchronously using HttpWebRequest.BeginGetResponse like methods. XmlSerializer can be used to serialize / de-serialize the streams. Method and ContentType as ‘text/xml’ should also be specified

SOAPMessageInspector :

This is a helper class to dump the SOAP messages or for modifying them. These can be hooked to the SOAP stream by implementing SOAPExtension. Helper methods can be invoked based on the stage of the processing as discerned from the soap server message.