Cluster computing

Sunday, May 26, 2013

Re-assess our approach in clustering methods

We discussed out of box support for data mining in server products in the previous post and prior to that we have discussed methods of text mining that involves clustering. We discussed choices of clustering methods. We favored clustering because it let us evaluate topics and keywords based on similarity measures and because we could not determine predictive parameters for keyword extraction.
If we explore the approach that keywords have a predictive parameter in and by themselves as they appear in an input text, then we can explore significant optimization and an easier approach. The parameter could be based on a large trained data set or by exploring graphs in word thesaurus or ontology. However, that said, if we were to find words similar to those that occur in input text, we resort to clustering.

SQL Analysis services provides the ability to write mining models to make predictions or analyze your data. Mining model content comprises of the metadata about the model, statistics about the data, and patterns discovered by the mining algorithm. The content may include regression formulas, definition of rules and item sets or weights and other statistics depending on the algorithm used. The structure of the model content can be browsed with the Microsoft Generic Content Tree Viewer provided in SQL Server Data Tools
The content of each model is presented as a series of nodes. Nodes can contain count of cases, statistics, coefficients and formulas, definition of rules and lateral pointers and XML fragments representing the data. Nodes are arranged in a tree and display information based on the algorithm used. If a decision tree model is used, the model can contain multiple trees, all connected to the model root. If a neural network model is used, the model may contain one or more networks and a statistics node. There are around thirty different mining content node types.
The mining models can use a variety of algorithms and are classified as such. These can be association rule models, clustering models, decision tree models, linear regression models, logistic regression models, naïve Bayes models, neural network models, sequence clustering and time series models.
Queries run on these models can make predictions on new data by applying the model, getting a statistical summary of the data used for training, extracting patterns and rules, extracting regression formulas and other calculations, getting the cases that fit a pattern, retrieving details about the individual cases used in the model and retaining a model by adding new data or performing cross-prediction.
One specific mining model is the clustering model and is represented by a simple tree structure. It has a single parent node that represents the model and its metadata, and each parent node has a flat list of clusters. The nodes carry a count of the number of cases in the cluster and the distribution of values that distinguish this cluster from other clusters. For example, if we were to describe the distribution of customer demographics, the table for node distribution could have attribute names such as age and gender, attribute values such as number, male or female, support and probability for discrete value types and variance for continuous data types. Model content also gives information on the name of the database that has this model, the number of clusters in the model, the number of cases that support a given node and others. In clustering, there's no one predictable attribute in the model. Analysis services also provides a clustering algorithm and this is a segmentation algorithm. The cases in a data set are iterated and separated into clusters that contain similar characteristics. After defining clusters, the algorithm calculates how well the clusters represent the groups of data points and then redefines the cluster to better represent the data. The clustering behavior can be tuned with parameters such as the maximum number of clusters or changing the amount of support required to create a cluster.
Data for clustering usually have a simple one key column, one or more input columns and other predictable columns. Analysis services also ships a Microsoft Cluster Viewer that shows the clusters in a diagram.
The model is generally trained on a set of data before it can be used to make predictions. Queries help to make predictions and to get descriptive information on the clusters.
Courtesy : msdn

Saturday, May 25, 2013

Expressions and Query work in similar ways. I referred to them in an earlier post but I'm trying to cover it here. Expressions are represented as trees. Expression trees are immutable. If you want to modify the expression tree, you can construct a new expression tree by copying the existing ones and replacing nodes. You can nest expressions and define the precedence based on the structure of the tree. Different parts of the tree can be tagged so that they can be processed differently. The leaves of the trees are usually the constants or null representing the data on which expression tree evaluates. There are around 45 expression tree types that can be a node in the tree. New operations can be added almost anywhere in the tree however adding them to the leaf means we keep it as close to the data. Only some leaves may require this new operation in which case the changes are not pervasive through out the expression tree. This is especially helpful given that the expression could be used anywhere, nested and recursive. The size of data used and the size of the tree can be arbitrarily large, so considering performance is helpful. Query works similarly except that the expression can be part of the predicate in a query. Predicate push down allows query to be passed through different systems. The servers typically don't interpret what the expression is, if it's user defined and operates on their data. For the ones that the server needs to keep track of, these expressions are compiled and have an execution plan. Execution plan helps to improve and control execution because the expressions are translated into a language that the system can work on. Query and Expressions have their purpose and are often interchangeable and there are usually many ways to solve a problem using either or both. You can traverse the expression tree with an expression tree visitor.
Queries when they conform to the conventions that LINQ proposes can be executed by more than one systems such as the Entity Framework and the database server. LINQ is Language integrated queries and it defines queries in a way that can be executed against different data stores such as XML, Database server, ado.net datasets etc. These queries typically take the form of standard query operator methods such as Where, Select, Count, Max and such others. Typically LINQ queries are not executed until the query variable is iterated over. This is why we use Lambdas Queries are generally more readable than their corresponding method syntax. IQueryable queries are compiled to expression trees while the IEnumerable queries are compiled to the delegates. The compilers provide support to parse the lambdas in the statement. The LINQ expressions have a compile method that compiles the code represented by an expression tree into an executable delegate. There is an expression tree viewer application in the Visual Studio samples.
LINQ queries make the queries part of the programming constructs available in a language while they hide the data that they operate on. In this case, it is important to mention that different data sources may have different requirements or syntax for expressing their queries. LINQ to XML for example may need the XML queries be written in XPATH. This is different from the relational queries which are more like the LINQ constructs themselves. Queries against any data store can be captured and replayed independent of the caller that makes these queries.
LINQ queries and expressions that have Lambdas have the benefit that the Lambdas are evaluated only when the results are needed.

Friday, May 24, 2013

Service Oriented Architecture
As a practice for reusable single point maintenance code, we write services. These are used by the various components in our applications or for client calls. These services typically hide the data providers downstream and serve as a single point of communication for all client needs. This way the server code can live in these services and exercise better control over the data and its users. Typically there are many clients and more than one data providers justifying the need for a single service to broker in between.
However, when we decide to work with more than one services, we organize our service in a hierarchy where different components deliver different functionalities but to the outside world, there's only one service. This is one of the ways we define a service oriented architecture.
Next we define the scope of this service and elevate it all the way to an enterprise. This is where the true value of such an architecture lies. It abstracts several clients from the heterogeneous data sources and provide a value proposition for the enterprise.
Communication with these services can be formalized via message based paradigms. Messages enable us to define address, binding and contracts in a way that brings together the benefits of declarative communication protocols and keep the server and clients focus on the business. This is where WCF or Windows Communication Foundation comes into play.
Services enabled myriad of clients to connect. Some of these can be hand held devices such as mobile phones and enable rich applications to be written. The same services can also be used by desktop applications or over the browser. Clients can be thin or rich while the same service can cater to both. The ability to support applications via the browser makes SOA all the more appealing with its ubiquity and availability.
Services abstract the provisioning of servers and resources that are required to handle the traffic from the web. These servers and resources can be VM slices and in the cloud or with extensive support in data centers. This lowers the cost of operations for these services and increases their availability and reach. Services provisioned in the cloud have a great appeal for rotating data and applications from one server to the other with little or no downtimes thus improving maintenance.
Services also enable rich diagnostics and caller statistics via the http traffic through http proxies. Such reports not only improve the health of the code but also enable monitoring and meeting the needs of online traffic. Diagnostics help identify the specific methods and issues so little time is spent on reproducing the issues.
Services written this way are very scalable and can meet the traffic generated for anniversaries or from all over the world. Such services can use clusters and support distributed processing. Service also enable integration of data with code and tight coupling of the business logic so that the callers cannot interpret the trade secrets of the business offered by the services.
Applications improve the usability of these services and can bring additional traffic to the company.
Services have an enduring appeal across political and business changes and these can serve to offer incremental value propositions to the company. Finally, services make it easier for functionalities to be switched in and out without disrupting the rest of the system. Even internally, services can be replaced with mocks for testing.

So far from our posts we have seen that there are several tools for text mining. For example, we used machine based learning with tagged corpus and ontology. Vast collection of text has been studied and prepared in this corpus and a comprehensive collection of words has been included in the ontology. This gives us great resource to work with any document. Next we define different distance vectors and use clustering techniques to group and extract keywords and topics. We have refined the distance vectors and data points to be more representative of the content of the text. There have been several ways to measure distance or similarity between words and we have seen articulation of probability based measures. We have reviewed the way we cluster these data points and found out methods that we prefer over others.
We want to remain focused on keyword extraction even though we have seen similar usages in topic analysis and some interesting areas as text segmentation. We don't want to resort to a large corpus for light weight application plugins but we don't mind a large corpus for database searches. We don't want processing that is better than O(N^2) in working with the data to extract keywords and we have the luxury to have a pipeline of steps to get to the keywords.

Thursday, May 23, 2013

Writing powershell commands
Powershell lets you invoke CmdLets on the command line. Custom CmdLets are an instance of a .Net class. A CmdLet processes its input from an object pipeline instead of text. A CmdLet processes one object at a time. CmdLets are attributed with a CmdLetAttribute and named with a verb-noun pair. The class derives from PSCmdLet which gives you access to PS runtime. The custom cmdLet class could also derive from CmdLet in which case it's more light weight. CmdLets don't handle argument parsing and error handling. These are done consistently across all Powershell CmdLets.
CmdLets support ShouuldProcess Parameter which lets the class have access to runtime behavior parameters - Confirm and WhatIf. Confirm specifies whether user confirmation is required. WhatIf informs the user what changes would have been made when the CmdLet is invoked.
Common methods to override include BeginProcessing which provides pre-processing functionality for the cmdlet, ProcessRecord which can be called any number of times, EndProcessing for post-processing functionality and StopProcessing when the user stops the cmdLet asynchronously.
CmdLet parameters allow the user to provide input into the CmdLet. This is done by adding properties to the class that implements the CmdLet and adding ParameterAttribute to them.
ProcessRecord generally does the work of creating new entries for data.
Parameters must be explicitly marked as public. Parameters can be positional or named. If the parameter is positional, only the value is provided with the CmdLet invocation. In addition, parameters can be marked as mandatory which means that they have a value assigned.
Some parameters are reserved and are often referred to as Common parameters. Another group of parameters are called the ShouldProcess parameters which give access to the Confirm and WhatIf runtime support. Parameters Sets are also supported by Powershell which refers to a grouping of the parameters.
For exception handling, a try catch can be added to the class method invocation. These should be to add more information when the error happens. If you don't want to stop the pipeline on error, then do not throw with ThrowTerminatingError.
Results are reported through objects. Powershell is emphatic on the way results are displayed and there's a lot of flexibility in what you want to include in your result objects. WriteObject is what is used to emit the results. These results can be returned to the pipeline. As with parameters, there should be consistency in the usage of both results and parameters.
There should be support for diagnostics when things go wrong so that the problem can be identified quickly and resolved. There is builtin support to send messages to the host application which could be powershell.exe and that displays the messages to the pipeline.
CmdLets can also be grouped so that the parameters or results need not be repeated. This is very convenient when there are fine grained CmdLets required but they essentially belong to the same group. A snap in can also be created with PSSnapIn so that the CmdLets are registered for usage. These are available from the System.Management.Automation namespace. Installing a snap in is done via InstallUtil.exe which creates some registry entries. Make sure that System.Management.Automation.dll is available from the SDK or the Global Assembly Cache (GAC).

Wednesday, May 22, 2013

I learned today that expressions and queries should be treated different. Even though you can have the predicate in a query as an expression tree and vice versa, there are several reasons to use one or the other specifically in certain scenarios.