Cluster computing

Monday, March 4, 2013

RAID levels

Disks are potential bottlenecks for system performance and storage system reliability. If the disk fails, the data is lost. A disk array is used to increase performance and reliability through data striping and redundancy. Instead of having a single copy of data, redundant information is maintained and carefully organized so that in the case of a disk failure, it can be used to reconstruct the contents of the failed disk. These redundant array of independent disk organizations are referred as RAID levels and each level represents a tradeoff between reliability and performance.
In Data Striping, the data is segmented into equal-size partitions that are distributed over multiple disks. The size of the partition is called the striping unit. The partitions are usually distributed using a round robin mechanism.
In Redundancy, if the mean time to failure of a single disk is about a few years, it is smaller for a disk array. Hence check disks and parity schemes are involved to improve reliability. The check disk contains information that can be used to recover from failure of any one disk in the array.This group of data disks and check disks together constitute reliability groups.
Here is a list of the RAID levels:
Level 0: Non-redundant: A RAID level 0 system uses data striping to increase maximum bandwidth available. No redundnant information is maintained. This is usually the least expensive solution.
Level 1: Mirrored: Instead of having one copy of the data, two copies of the data are maintained. This type of redundancy is often called mirroring. Every write on a disk block involves write on both disk.This allows parallel reads between disk blocks but is usually the most expensive solution.
Level 0+1: Striping and Mirroring: Like in level 1, read requests can be scheduled to both the disk and its mirror image and bandwidth for contiguous blocks is improved from the aggregation of all the disks.
Level 2: Error-correcting codes: In RAID level 2, the striping unit is a single bit. The redundancy scheme used is the Hamming code. The number of check disks grows logarithmically with the number of data disks.
Level 3: Bit-Interleaved Parity: Redundancy schema in RAID level 2 is better in terms of cost than RAID level 1 but it keeps more redundant information than is necessary. Instead of using several disks to store hamming code that informs which disk has failed, we rely on that information from the disk controller and use a single check disk with parity information which is the lowest overhead possible.
Level 4: Block-Interleaved parity. RAID level 4 has a striping unit of a disk block, instead of a single bit as in RAID level 3. Block-level striping has the advantage that read requests the size of a disk block can be served entirely by the disk where the requested block resides. The write of a single block still requires a read-modify-write cycle, but only one data disk and the check disk are involved and the difference between the old data block and the new data block is noted.
Level 5: Block Interleaved distributed parity: This level improves upon the previous level by distributing the parity blocks uniformly over all disks, instead of sorting them on a single check disk. This has two advantages. First, several write requests potentially be processed in parallel, since the bottleneck of a unique check is removed. Second, read requests have a higher degree of parallelism. This level usually has the best performance.
Level 6: P+Q redundancy: Recovery from the failure of a single disk is usually not sufficient in very large disk arrays. First, a second disk might fail before replacement and second the probability of a second disk failing is not negligible. A RAID level 6 system uses Reed-Solomon codes to be able to recover from two simultaneous disk failures.

Saturday, March 2, 2013

Page numbers for a book index

After the index words have been chosen, they are usually displayed with the page numbes. The trouble with the page numbers is that they are dependent on the rendering software. The page sizes could change for an electronic document, hence the page numbers for the index may need to be regenerated. This change in page size could be tracked with a variety of options and mostly after the index words have been selected.We will consider the trade offs here. First, the page number for the different pages can be associated with the index words with a linear search for every occurance of the word. This is probably a naiive approach. Second, we already have word offsets from start for each of the words. Hence we can translate the offsets to page numbers if we know the last word offset of each page. So we can build an integer array where the index gives the page number and the data gives the offset of the last word on the page. Then for each word that we list in the index, we can also print the associated page numbers. Third, we can utilize any information the document rendering system provides including methods and data structures for page lookup. From these approaches, it should be clear that the best bet for the page number data source is the rendering engine which is external to the index generator program.

For rendering of word documents or for the retrieving the page numbers, we could use either microsoft.office.interop.Word or aspose.words library.

To determine the renderer, the program could look up the file extension type to use the appropriate library. The library can be loaded as the input file is read or used via dependency injection. Wrapper over the libraries to provide only the page information method may be sufficient or a design pattern to abstract away the implementations on a file format basis can be used.

Friday, March 1, 2013

Web service helper methods

Here are some common service provider helper methods for web services or applications:

WCFSerializationHelpers:

This uses a data contract serializer and deserializer to serialize the object and return it as string. It supports more than one format such as XML or JSON output. The object should be checked for serializability before performing the operation. This could be tested with:

public static T WcfRoundtripSerialize<T>(T instance) where T : class, new()

RestServiceAsyncClientHelpers:

This is a helper class to make ‘GET’ and ‘POST’ calls asynchronously using HttpWebRequest.BeginGetResponse like methods. XmlSerializer can be used to serialize / de-serialize the streams. Method and ContentType as ‘text/xml’ should also be specified

SOAPMessageInspector :

This is a helper class to dump the SOAP messages or for modifying them. These can be hooked to the SOAP stream by implementing SOAPExtension. Helper methods can be invoked based on the stage of the processing as discerned from the soap server message.

Wednesday, February 27, 2013

A comparision of cloud hosting for web applications

There are quite a few metrics to evaluate and compare cloud hosting as provided by different vendors such as Amazon, Azure, Rackspace, Uhuru. Some of these are listed here:
1) Setting up the environment. Cloud hosting service providers may offer storage, application hosting, web site hosting and many other features. In order to create a cloud, one may need to specify one or more of these choices for the environment. Amazon and Azure for instance provide various environment options. Some of them are easy to deploy run of the mill web applications while the others are there to serve more customizations such as different forms of storage - table, blobs, database et al.
2) Deployment. Application deployment is a repeated practice and hence the UI for deployment should consider ease of use. Time taken for the deployment or to go live is also an important consideration In this case, its not just the application developer who wants to be able bounce the production server and look at the stats but also the customers who might have to wait when the application is still deploying. Application deployment is probably the single most repeated interaction between a cloud service provider and the application developer.
3) Application deployment options is another big differentiator between the vendors. Some vendors allow for specifying the region where the servers are located. Some also allow configuring the network and the DNS registration of the application. Some allow remote desktop to the server itself. Some others allow for configuring security on the servers and application. Some allow more than one way of uploading the software. For example this could either be from packages or source control based deployment.
4) Another important consideration is the variety of web applications supported by the service provider. For example, some vendors may allow .Net software application hosting, others may allow php, ruby on rails, etc. Deployments for these might also require different servers to be cut as VM slices - different both in terms of operating system as well as the hosting stack.
5) Ease of use for end to end flow. This is probably the single most important factor in making these popular. In that respect, the Uhuru web application hosting experience is a breeze and delight. However, I haven't looked into .Net application deployment there.