Cluster computing

Wednesday, May 29, 2013

The web interface for the example mentioned in the previous post could be a simple list view with MVC framework. HTML5 and CSS can be used for the views. The stack trace bucket viewer application could be a visual tool to see and edit individual stack trace records read from dumps as well as a way to force retries by the producer to read the stack trace from the dump. The dump entries could carry an additional flag to denote the state such as new, in progress and completed and processed in that order. If the state is reverted, the processing is required. If there are no intermediary states required such as for updates then the insertion and deletion of record suffices to trigger reprocessing. The producer service should watch for dump files and keep an association between the dump and the entry in the database. If the dump entry is not in the database, the dump is re-read. The lookup between the database and the dump for processing can be quick since the service could look up the dump based on path and filename.
The file watcher and the service bus are often used together. The service bus helps to queue the dumps for processing. It also helps with error conditions and retries. The queuing goes by other names as well such MSMQ and others. However depending on the workload, this may or may not be required. The benefits of queuing is that it can be processed asynchronously and enable retries. This can be handled by the service itself since it works on one file at a time.
The table for dumps read and processed can grow arbitrarily large as many different dumps are processed. Depending on the number of dumps processed in a day and the size of their metadata that we store, the table can grow large enough to require aging policy and archiving of older records. The archival can be batched to the start of every month and during maintenance window. The archival requires a table similar to the source, possibly in a different database than the live one. The archival stored procedure could read the records a few at a time from the source, insert into the destination and delete the copied from the source. If the source is not a single table but a set of related tables, the archival will do this step for every table in the order that inserts are allowed. The order of deletes will be in the reverse order since the constraints may need to be handled first. The insertion and deletes would not be expected to fail since we will select the records that are in the source but not in the destination. This way we will be in a good state between each incremental move of records. This helps when there is a large number of records that makes the stored procedure run long and become prone to interruptions or failures. The archival can resume from where it left off.
These services work with files and other windows resources so they may require that security is tightened and that dumps are handled only by a service account that has been authorized for read and writes on the folder. This security account may be different for production but may require full access to all folders and sub-folders. File handling exceptions often affect the success rate of such file based services. Internally, the same service account should be enabled access to the database where the parsed dump information is stored. Exceptions handled by the services could be logged or stored in the database. For the consumer side of the store the users will use their own credentials. Their actions can be authenticated and authorized. This way we can tell apart the changes made by either side.
Since the services and dependencies are hosted separately, they may have to tolerate connectivity failures. From an end to end perspective, the file IO operations could all be isolated and made local to the machine with the dumps while all subsequent processing is with the database.

Tuesday, May 28, 2013

Cloud computing gives us the ability to develop applications that are virtualized across hardware and software stacks. Applications are no longer monolithic but sharded into different modules, each of which can reside on a different VM with its own software and hardware stack. Virtual machines, operating systems, server products and hosts, can be different for each module. These modules can still enable the same experience for a user as if the user was interacting with a single application. Sign on for example could be only once while the user visits different modules. Application storage, caching and services are now supported on dedicated resources.
If we want to provide APIs for our services, then they can be scoped to services and different services can meet different needs. APIs can be REST based and these will expand its reachability.
Let us take the example of provisioning a stack trace service that iterates over the dump files in a collection folder and populates a data store with stack traces read from each dump. In this case, we could expect the following APIs from the stackTrace service
IEnumerable<string> GetStackTrace(stream dumpFileStream); // retuns the stack trace associated with a path
IEnumerable<string> ResolveSymbols(IEnumerable<string> stackTrace, IEnumerable<string> symbolPath) to pretty print
IEnumerable<string> GetStackTrace(string pathToDumpAndSymbols); that combines the above operations

Next for the datatable that we populate called StackTraces, we will have attributes such as source information, bucket information and stack trace.

So we can enable all LINQ based operations on this entity.

This entity will be displayed by a service or front end that is independent from the stack trace population service. The front end could be read only that allows users to aggregate, search and sort stack traces from dumps.

In this case we have therefore separated out the producer consumer modules of our system and they are ready to be hosted on different VMs. For example, the producer service could sit on the same server as the collection folder and have a large storage since the dumps can be in the order of Gigabytes and there collections could be arbitrarily large. The consumer is more web appication tier 3 solution and can be hosted on an app server. The data table can be in a cloud datastore on a yet another VM or storage account.

Two services one table can scale to add other functionalities but together they have adequate information shared in the data table for diagnostics, audit and tracking.

Monday, May 27, 2013

XPath query language

XPath is a query language for XML. XML is structured data where the document is organized as a tree. The relative position of an element with respect to the root is called a Path and there's only one path from that element to the root.The selection criteria for nodes is called predicate. The different ways of slicing a tree or the line to follow is called Axes and these can be for example parent, child or self with child being the default. Paths can be nested inside predicates and predicates can be nested inside paths. Queries are expressed as full or partial paths with selection. They are also expressed in short forms. The position in the xml tree at which the next processing should take place is tracked with a context node. Nodes can have attributes, namespace and text. Elements positions are 1-based and in document order.
The expressions to denote the path usually describe a starting point, the context node, a line of search if not the child, and other absolute or relative paths. Paths have steps from one level to another and can include a mix or nesting of predicates and grouping via parameters. Query results are returned in document order and XPath does not modify the nodes. Standard operators such as union and standard functions such as count or id, sum, and startswith are available to use with the query. XPath queries can return all elements in a document based on path-steps as given in the Path expressions.Queries can return attributes that begin anywhere as long as the name is matched. Queries can use wild cards say to denote all children of any element matching a given path. Queries can evaluate selection conditions where the text is compared to a constant. Queries can evaluate conditions based on attributes within the elements.

Sunday, May 26, 2013

Various usages of a tree.
1) Use the structure of a tree to find the relative position of a node with the root and to fully qualify that node.
2) Use the structure of a tree to discover siblings that share the same parent.
3) Use the structure of a tree to iterate over the siblings by walking laterally between them.
3) Use the structure of a tree to recursively travese the tree to repeat same operations.
4) Use the structure of a binary tree to traverse the tree in preorder, inorder and post order traversal.
5) Use the structure of a tree to find the common ancestor of two nodes
6) Use the structure of a tree to find the predecessor of a node in a binary search tree
7) Use the structure of a tree to find the successor of a node in a binary search tree
8) Use the structure of a tree to find if a node exists in the binary search tree
9) Use the structure of a tree to identify a dendrogram among a flat list of data points
10) Use the structure of a tree and schema definition to validate the data
11) Use the stucture of a tree to select the elements from a document
12) Use the structure of a tree to slice it in different ways
13) Use the structure of a tree to nest it as an expression in another
14) Use the structure of a tree to make a clone or copy a sub-tree
15) Use the structure of a tree to visit elements to perform operations without affecting the tree
16) Use the structure of a tree to do breadth first search or depth first search
17) Use the structure of a tree to color nodes as red or black in specific ways for specific tasks.
18) Use the structure of a tree to organize conditions where each element of a collection is evaluated against a tree
19) Use the structure of a tree to cluster and organize large sets of data
20) Use the structure of a tree for efficiently retrieving spatial data
21) Use the structure of a tree for scoping operations via inheritance and composition

Re-assess our approach in clustering methods

We discussed out of box support for data mining in server products in the previous post and prior to that we have discussed methods of text mining that involves clustering. We discussed choices of clustering methods. We favored clustering because it let us evaluate topics and keywords based on similarity measures and because we could not determine predictive parameters for keyword extraction.
If we explore the approach that keywords have a predictive parameter in and by themselves as they appear in an input text, then we can explore significant optimization and an easier approach. The parameter could be based on a large trained data set or by exploring graphs in word thesaurus or ontology. However, that said, if we were to find words similar to those that occur in input text, we resort to clustering.

SQL Analysis services provides the ability to write mining models to make predictions or analyze your data. Mining model content comprises of the metadata about the model, statistics about the data, and patterns discovered by the mining algorithm. The content may include regression formulas, definition of rules and item sets or weights and other statistics depending on the algorithm used. The structure of the model content can be browsed with the Microsoft Generic Content Tree Viewer provided in SQL Server Data Tools
The content of each model is presented as a series of nodes. Nodes can contain count of cases, statistics, coefficients and formulas, definition of rules and lateral pointers and XML fragments representing the data. Nodes are arranged in a tree and display information based on the algorithm used. If a decision tree model is used, the model can contain multiple trees, all connected to the model root. If a neural network model is used, the model may contain one or more networks and a statistics node. There are around thirty different mining content node types.
The mining models can use a variety of algorithms and are classified as such. These can be association rule models, clustering models, decision tree models, linear regression models, logistic regression models, naïve Bayes models, neural network models, sequence clustering and time series models.
Queries run on these models can make predictions on new data by applying the model, getting a statistical summary of the data used for training, extracting patterns and rules, extracting regression formulas and other calculations, getting the cases that fit a pattern, retrieving details about the individual cases used in the model and retaining a model by adding new data or performing cross-prediction.
One specific mining model is the clustering model and is represented by a simple tree structure. It has a single parent node that represents the model and its metadata, and each parent node has a flat list of clusters. The nodes carry a count of the number of cases in the cluster and the distribution of values that distinguish this cluster from other clusters. For example, if we were to describe the distribution of customer demographics, the table for node distribution could have attribute names such as age and gender, attribute values such as number, male or female, support and probability for discrete value types and variance for continuous data types. Model content also gives information on the name of the database that has this model, the number of clusters in the model, the number of cases that support a given node and others. In clustering, there's no one predictable attribute in the model. Analysis services also provides a clustering algorithm and this is a segmentation algorithm. The cases in a data set are iterated and separated into clusters that contain similar characteristics. After defining clusters, the algorithm calculates how well the clusters represent the groups of data points and then redefines the cluster to better represent the data. The clustering behavior can be tuned with parameters such as the maximum number of clusters or changing the amount of support required to create a cluster.
Data for clustering usually have a simple one key column, one or more input columns and other predictable columns. Analysis services also ships a Microsoft Cluster Viewer that shows the clusters in a diagram.
The model is generally trained on a set of data before it can be used to make predictions. Queries help to make predictions and to get descriptive information on the clusters.
Courtesy : msdn

Saturday, May 25, 2013

Expressions and Query work in similar ways. I referred to them in an earlier post but I'm trying to cover it here. Expressions are represented as trees. Expression trees are immutable. If you want to modify the expression tree, you can construct a new expression tree by copying the existing ones and replacing nodes. You can nest expressions and define the precedence based on the structure of the tree. Different parts of the tree can be tagged so that they can be processed differently. The leaves of the trees are usually the constants or null representing the data on which expression tree evaluates. There are around 45 expression tree types that can be a node in the tree. New operations can be added almost anywhere in the tree however adding them to the leaf means we keep it as close to the data. Only some leaves may require this new operation in which case the changes are not pervasive through out the expression tree. This is especially helpful given that the expression could be used anywhere, nested and recursive. The size of data used and the size of the tree can be arbitrarily large, so considering performance is helpful. Query works similarly except that the expression can be part of the predicate in a query. Predicate push down allows query to be passed through different systems. The servers typically don't interpret what the expression is, if it's user defined and operates on their data. For the ones that the server needs to keep track of, these expressions are compiled and have an execution plan. Execution plan helps to improve and control execution because the expressions are translated into a language that the system can work on. Query and Expressions have their purpose and are often interchangeable and there are usually many ways to solve a problem using either or both. You can traverse the expression tree with an expression tree visitor.
Queries when they conform to the conventions that LINQ proposes can be executed by more than one systems such as the Entity Framework and the database server. LINQ is Language integrated queries and it defines queries in a way that can be executed against different data stores such as XML, Database server, ado.net datasets etc. These queries typically take the form of standard query operator methods such as Where, Select, Count, Max and such others. Typically LINQ queries are not executed until the query variable is iterated over. This is why we use Lambdas Queries are generally more readable than their corresponding method syntax. IQueryable queries are compiled to expression trees while the IEnumerable queries are compiled to the delegates. The compilers provide support to parse the lambdas in the statement. The LINQ expressions have a compile method that compiles the code represented by an expression tree into an executable delegate. There is an expression tree viewer application in the Visual Studio samples.
LINQ queries make the queries part of the programming constructs available in a language while they hide the data that they operate on. In this case, it is important to mention that different data sources may have different requirements or syntax for expressing their queries. LINQ to XML for example may need the XML queries be written in XPATH. This is different from the relational queries which are more like the LINQ constructs themselves. Queries against any data store can be captured and replayed independent of the caller that makes these queries.
LINQ queries and expressions that have Lambdas have the benefit that the Lambdas are evaluated only when the results are needed.