Cluster computing

Friday, May 24, 2013

Service Oriented Architecture
As a practice for reusable single point maintenance code, we write services. These are used by the various components in our applications or for client calls. These services typically hide the data providers downstream and serve as a single point of communication for all client needs. This way the server code can live in these services and exercise better control over the data and its users. Typically there are many clients and more than one data providers justifying the need for a single service to broker in between.
However, when we decide to work with more than one services, we organize our service in a hierarchy where different components deliver different functionalities but to the outside world, there's only one service. This is one of the ways we define a service oriented architecture.
Next we define the scope of this service and elevate it all the way to an enterprise. This is where the true value of such an architecture lies. It abstracts several clients from the heterogeneous data sources and provide a value proposition for the enterprise.
Communication with these services can be formalized via message based paradigms. Messages enable us to define address, binding and contracts in a way that brings together the benefits of declarative communication protocols and keep the server and clients focus on the business. This is where WCF or Windows Communication Foundation comes into play.
Services enabled myriad of clients to connect. Some of these can be hand held devices such as mobile phones and enable rich applications to be written. The same services can also be used by desktop applications or over the browser. Clients can be thin or rich while the same service can cater to both. The ability to support applications via the browser makes SOA all the more appealing with its ubiquity and availability.
Services abstract the provisioning of servers and resources that are required to handle the traffic from the web. These servers and resources can be VM slices and in the cloud or with extensive support in data centers. This lowers the cost of operations for these services and increases their availability and reach. Services provisioned in the cloud have a great appeal for rotating data and applications from one server to the other with little or no downtimes thus improving maintenance.
Services also enable rich diagnostics and caller statistics via the http traffic through http proxies. Such reports not only improve the health of the code but also enable monitoring and meeting the needs of online traffic. Diagnostics help identify the specific methods and issues so little time is spent on reproducing the issues.
Services written this way are very scalable and can meet the traffic generated for anniversaries or from all over the world. Such services can use clusters and support distributed processing. Service also enable integration of data with code and tight coupling of the business logic so that the callers cannot interpret the trade secrets of the business offered by the services.
Applications improve the usability of these services and can bring additional traffic to the company.
Services have an enduring appeal across political and business changes and these can serve to offer incremental value propositions to the company. Finally, services make it easier for functionalities to be switched in and out without disrupting the rest of the system. Even internally, services can be replaced with mocks for testing.

So far from our posts we have seen that there are several tools for text mining. For example, we used machine based learning with tagged corpus and ontology. Vast collection of text has been studied and prepared in this corpus and a comprehensive collection of words has been included in the ontology. This gives us great resource to work with any document. Next we define different distance vectors and use clustering techniques to group and extract keywords and topics. We have refined the distance vectors and data points to be more representative of the content of the text. There have been several ways to measure distance or similarity between words and we have seen articulation of probability based measures. We have reviewed the way we cluster these data points and found out methods that we prefer over others.
We want to remain focused on keyword extraction even though we have seen similar usages in topic analysis and some interesting areas as text segmentation. We don't want to resort to a large corpus for light weight application plugins but we don't mind a large corpus for database searches. We don't want processing that is better than O(N^2) in working with the data to extract keywords and we have the luxury to have a pipeline of steps to get to the keywords.

Thursday, May 23, 2013

Writing powershell commands
Powershell lets you invoke CmdLets on the command line. Custom CmdLets are an instance of a .Net class. A CmdLet processes its input from an object pipeline instead of text. A CmdLet processes one object at a time. CmdLets are attributed with a CmdLetAttribute and named with a verb-noun pair. The class derives from PSCmdLet which gives you access to PS runtime. The custom cmdLet class could also derive from CmdLet in which case it's more light weight. CmdLets don't handle argument parsing and error handling. These are done consistently across all Powershell CmdLets.
CmdLets support ShouuldProcess Parameter which lets the class have access to runtime behavior parameters - Confirm and WhatIf. Confirm specifies whether user confirmation is required. WhatIf informs the user what changes would have been made when the CmdLet is invoked.
Common methods to override include BeginProcessing which provides pre-processing functionality for the cmdlet, ProcessRecord which can be called any number of times, EndProcessing for post-processing functionality and StopProcessing when the user stops the cmdLet asynchronously.
CmdLet parameters allow the user to provide input into the CmdLet. This is done by adding properties to the class that implements the CmdLet and adding ParameterAttribute to them.
ProcessRecord generally does the work of creating new entries for data.
Parameters must be explicitly marked as public. Parameters can be positional or named. If the parameter is positional, only the value is provided with the CmdLet invocation. In addition, parameters can be marked as mandatory which means that they have a value assigned.
Some parameters are reserved and are often referred to as Common parameters. Another group of parameters are called the ShouldProcess parameters which give access to the Confirm and WhatIf runtime support. Parameters Sets are also supported by Powershell which refers to a grouping of the parameters.
For exception handling, a try catch can be added to the class method invocation. These should be to add more information when the error happens. If you don't want to stop the pipeline on error, then do not throw with ThrowTerminatingError.
Results are reported through objects. Powershell is emphatic on the way results are displayed and there's a lot of flexibility in what you want to include in your result objects. WriteObject is what is used to emit the results. These results can be returned to the pipeline. As with parameters, there should be consistency in the usage of both results and parameters.
There should be support for diagnostics when things go wrong so that the problem can be identified quickly and resolved. There is builtin support to send messages to the host application which could be powershell.exe and that displays the messages to the pipeline.
CmdLets can also be grouped so that the parameters or results need not be repeated. This is very convenient when there are fine grained CmdLets required but they essentially belong to the same group. A snap in can also be created with PSSnapIn so that the CmdLets are registered for usage. These are available from the System.Management.Automation namespace. Installing a snap in is done via InstallUtil.exe which creates some registry entries. Make sure that System.Management.Automation.dll is available from the SDK or the Global Assembly Cache (GAC).

Wednesday, May 22, 2013

I learned today that expressions and queries should be treated different. Even though you can have the predicate in a query as an expression tree and vice versa, there are several reasons to use one or the other specifically in certain scenarios.

Tuesday, May 21, 2013

Nltk classifier modules
The nltk decision tree module. This is a classifier model. A decision tree comprises of non-terminal nodes for conditions on feature values and the terminal nodes for the labels. The tree evaluates to a label for a given token.
The module requires feature names and labeled feature sets. Different thresholds such as for depth cut off, entropy cut off and support cut off can also be specified. Entropy refers to degree of randomness or variations in the results while support refers to the number of feature sets used for evaluation.

The term feature is used to refer to some property of an unlabeled token. Typically a token is a word from a text that we have not seen before. If the text is seen before and has already been labeled, it is a training set. Training set helps train our model so that we can pick the labels better for the tokens we encounter. As an example, the proper nouns for names may be labeled male or female. We start with a large collection of already tagged names, we call training data. We build a model where we say if the name ends with a certain set of suffixes, the name is that of a male. Then we run our model on the training data to see how accurate we were and we adjust our model to improve our accuracy. Next we can run our model on a test data. If a name from the test data is labeled by this model as a male, we know its likelihood to be correct.

The property of labeled tokens are also helpful. We call these as joint-features and we distinguish it from the feature we just talked about by referring to the latter as input-features. So joint-features belong to training data and input-features belong to test data. For some classifiers such as the maxent classifier we refer to these as features and contexts respectively. The maxent stands for maximum entropy model where joint-features are required to have numeric values and input-features are mapped to a set of joint-features.

There are other types of classifiers as well. For example, the mallet package uses the external mallet machine learning package. The megam module uses the external megam maxent optimization package. The naïve Bayes module is a module that assigns probability for a label. The P(label/features) is computed as P(label) * P(features/label) / P(features). The 'naive' assumption is that all features are independent. The positivenaivebayes module is a variant of the Bayes classifier

that performs binary classification based on two complementary classes where we have labeled examples only for one of the classes. The there are classifiers based exclusively on the corpus that they are trained on. The rte_classify module is a simple classifier for the RTE corpus. It calculates the overlap in words and named entities between text and hypothesis. Most of the classifiers discussed are built on top of the scikit machine learning library.

Nltk cluster package
This module contains a number of basic clustering algorithms . Clustering is unsupervised machine learning to group similar items with a large collection There are the k-means clustering, E-M clustering and a group average agglomerative clustering. The K-means clustering starts with the k arbitrary chosen means and assigns each vector to the cluster with the closest mean. The centroid of the cluster is recalculated as the means of each cluster. The process is repeated until the clusters stabilize. This may converge to a local maximum so this method is repeated for other random initial means and the most common occurring output is chosen. The Gaussian EM clustering starts with k arbitrarily chosen means, prior probabilities and co-variance matrices which forms the parameters for the Gaussian source. The membership probabilities is then calculated for each vector in each of the clusters - this is the E-step. The parameters are then updated in the M-step using the maximum likelihood estimate from the clustering membership probabilities. This process continues until the likelihood of the data does not significantly increase. The GAAC clustering starts with each of the N vectors as singleton clusters and then iteratively merges pairs of clusters which have the closest centroids. This continues until there is only one cluster. The order of merges is useful in finding the membership of a given number of clusters because earlier merges are lower than the depth c in the resulting tree.

Usage of WixSharp to build MSI for installing and uninstalling applications.
WixSharp makes it easy to author the logic for listing all the dependencies of your application for deployment. It converts the C# code to wxs file which in turn is compiled to build the MSI. There are a wide variety of samples in the WixSharp toolkit. Some of them require very few lines to be written for a wide variety of deployment time actions. The appeal in using such libraries is to be able to get the task done sooner with few lines of code. The earlier way of writing and editing WXS file was error prone and tedious.

Monday, May 20, 2013

nifty way to find file format

Here's something I found on the internet. If you wanted know the file format and there is little or no documentation of the proprietary format, you can look up the header of the file and other data structures with the corresponding symbols from kernel PDB file. You don't need the source to look up format.
There are tools that can render the information in the PDB to be navigated through the UI. This is possible via the DIA SDKs. The format of PDB files is also not open hence their access is via debugger sdk. The SDK is available via COM. So you may have to register the DIA dll.
The debuggers make a local cache of the symbols when requested and they download the symbols from the symbol server, so you can expect the PDB to be made available to you by the debugger in the directory you specify.
If you look at the kernel PDB, you will find the structures we are looking for, start with the name MINIDUMP and these can be walked from the header onwards.
To find the stack trace, we follow the header to the directory or stream for the exception and read the exception record and address where the exception occurred. Both of these are given in the MINIDUMP_EXCEPTION data structure. The exception stream also gives the thread context. The context gives the processor specific register data. When we dump the stack pointer, we get the stack trace. We resolve the symbols of the stack trace with the pdbs either explicitly through DIA or implicitly through the managed mdbgeng library of debugger sdk. The minidump actually has all the information in the file itself. For example, you can list all the loaded modules with the mindump module list as well as the mindump module structures. Module name, function name, line number and stack frame are available via IMAGEHLP data structures. The various types of streams in the minidump are :
Thread list stream given by the MINIDUMP_THREAD_LIST structure
Module list stream given by the MINIDUMP_MODULE_LIST structure
Memory list stream given by the MINIDUMP_MEMORY_LIST structure
Exception stream given by the MINIDUMP_EXCEPTION_STREAM structure
System Info stream given by the MINIDUMP_SYSTEM_INFO structure
ThreadExList stream given by the MINIDUMP_THREAD_EX_LIST structure
Memory64 list stream given by the MINIDUMP_MEMORY64_LIST structure
Comment stream
Handle Data stream given by the MINIDUMP_HANDLE_DATA_STREAM structure
Function table stream given by the MINIDUMP_FUNCTION_TABLE structure
Unloaded module list stream given by the MINIDUMP_UNLOADED_MODULE_LIST structure
Misc Info List stream given by the MINIDUMP_MISC_INFO structure
Thread Info List stream given by the MINIDUMP_THREAD_INFO_LIST structure
Handle operation list stream given by the MINIDUMP_HANDLE_OPERATION_LIST structure
For diagnostics, we may choose to display messages to the output or error stream. More features can be built into the tool that retrieves the stack trace from the dump. This can be done in an extensible manner where tool runs a set of commands from the user by way of command line. Internally we can have a command pattern to implement the different debugger like functionalities of the tool. Also the tool can be deployed via MSI. This ensures cleanliness during install and uninstall.