Cluster computing

Wednesday, February 27, 2013

A comparision of cloud hosting for web applications

There are quite a few metrics to evaluate and compare cloud hosting as provided by different vendors such as Amazon, Azure, Rackspace, Uhuru. Some of these are listed here:
1) Setting up the environment. Cloud hosting service providers may offer storage, application hosting, web site hosting and many other features. In order to create a cloud, one may need to specify one or more of these choices for the environment. Amazon and Azure for instance provide various environment options. Some of them are easy to deploy run of the mill web applications while the others are there to serve more customizations such as different forms of storage - table, blobs, database et al.
2) Deployment. Application deployment is a repeated practice and hence the UI for deployment should consider ease of use. Time taken for the deployment or to go live is also an important consideration In this case, its not just the application developer who wants to be able bounce the production server and look at the stats but also the customers who might have to wait when the application is still deploying. Application deployment is probably the single most repeated interaction between a cloud service provider and the application developer.
3) Application deployment options is another big differentiator between the vendors. Some vendors allow for specifying the region where the servers are located. Some also allow configuring the network and the DNS registration of the application. Some allow remote desktop to the server itself. Some others allow for configuring security on the servers and application. Some allow more than one way of uploading the software. For example this could either be from packages or source control based deployment.
4) Another important consideration is the variety of web applications supported by the service provider. For example, some vendors may allow .Net software application hosting, others may allow php, ruby on rails, etc. Deployments for these might also require different servers to be cut as VM slices - different both in terms of operating system as well as the hosting stack.
5) Ease of use for end to end flow. This is probably the single most important factor in making these popular. In that respect, the Uhuru web application hosting experience is a breeze and delight. However, I haven't looked into .Net application deployment there.

indexer

http://indexer1.cloudapp.net

indexer

http://indexer1.cloudapp.net

ADI

Text Search and keyword lookup

Keyword lookup is a form of search which includes cases where the topic words for the provided text may not be part of the text. This is usually done based on some semantic dictionary or thesaurus involved and lookups. Unfortunately, such thesaurus is specific to the domain for which the text is written. As an example, let's take a look at the keywords used for the internet. Search engine optimization professionals use keyword research to find and research actual search terms people enter into the search engines when conducting a search. They lookup the relevancy of the keywords to a particular site. They find the ranks of the keywords from various search engines for competing sites. They buy sample campaign for the keywords from search engines and they predict the click-through rate. Although search engine optimization is not related to the automated keyword discovery, it serves to indicate that relevancy and choice of keywords is a complex task and concordance is not enough. Additional metadata in the form of distance vectors or connectivity between words or their hiearchy in wordnet is required. Building a better dictionary or thesaurus for the domain is one consideration. Selecting the metrics to evaluate the keywords is another. Using the statistics for the keywords encountered so far is a third consideration.

Tuesday, February 26, 2013

FullText revisited

Let's take a look at full text search again. Fulltext is about indexing and searching. Fulltext is executed on a document or a full text database. Full text search is differentiated from searches based on metadata or on parts of the original texts represented in the databases because it tries to match all of the words in all the documents for the pattern mentioned by the user. It builds a concordance of all the words ever encountered and then executes the search on this catalog. The catalog is refreshed in a background task. Words may be stemmed or filtered before pushing it into the database. This also means that there can be many false positives for a seach, an expression used to denote the results that are returned but not relevant to the intended search. Clustering techniques based on Bayesian algorithms can help reduce false positives.
Depending on the occurences of words relevant to the categories, a search term can be placed in one or more of the categories. There are a set of metrics used to describe the search results - precision and recall. Recall measures the relevancy of the results returned by a search and precision is the measure of the quality of the results returned. Some of the tools aim to improve querying so as to improve relevancy of the results. These tools utilize different forms of searches such as keyword search, field-restricted search, boolean queries, phrase search, concept search, concordance search, proximity search, regular expression, fuzzy search and wildcard search.

Forensics

Forensic Science has a number of examples of statistical approaches to text search. These include
Factor analysis:
This method describes variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. The variations in three or four observed variables mainly reflect the variations in fewer observed variables. Factor analysis searches for such joint variations in response to unobserved latent variables.
Bayesian statistics: This method uses degree of belief, also called Bayesian probabilities, to express a model of the world. These are based on interprtation using probabilities.
Poisson Distribution: This is used to model a number of events occuring within a time event.
Multivariate analysis: This involves observation and analysis of more than one statistical outcome variable at a time.
Discriminant function analysis: This involves predicting a grouping variable based on one or more independent variable.
Cusum analysis : This is a cumulative sum method for finding a two or three letter words in a sentence that forms the habit of the speaker and thus can be used to detect tampered sections