Friday, May 31, 2013

Pixels in a photograph are translated to color codes and the entire image can be considered a matrix. With this matrix and some matrix manipulation software such as Matlab, these can be toyed with identify edges and regions. There are smoothing techniques and normalization we may need to perform on the images, and use algorithms for segmentation and then we can transform the image into something we can work with. We may not be able to eliminate noise but we can do a lot with the transformed image. 
Some examples of such image processing techniques include image enhancement, image restoration and image compression. Image enhancement is the technique by which different features of the image are accentuated such that they can be prepared for further analysis. Increasing contrast, gray level, contrast manipulation, noise reduction, edge detection and sharpening, filtering, interpolation and magnification are all part of enhancing the image. Image restoration is a technique that works the other way in that it tries to reduce the changes to the image by studying the extent and kinds of changes that have happened. Image  compression is about storing the image with a reduced number of bits. This is very helpful when image size is a concern for example in the storage and retrieval of a large number of images such as for broadcasting, teleconferencing, medical images and other transmissions. If you have taken pictures in different formats, you would have noticed the significant improvement in size with JPEG format. The Joint Photography group came up with this format to reduce the image size among other things.
Among the image enhancement techniques, edge detection is commonly used to quickly identify the objects of interest from a still image. This is helpful in tracking the changes to the edges in a series of frames from a moving camera. Such cameras capture over 30 frames per second and the algorithms used for image processing are costly. So we make adjustments with what we want to detect and make interpretations from that.
Another such application is region growing where we decompose the picture into regions of interest. In the seeded region growing method for example, a set of data points are taken as starting points for the objects to be demarcated along with the input image.  The regions are iteratively grown by comparing all unmarked pixels and including them into the regions. The difference between the pixel intensity and the region's mean is used for the measure of similarity. This way the pixels are included in the regions and the regions grow. 
Another interesting technique is called balanced histogram threshold method in which the image foreground and background are hued differently so that we see the outline of the foreground. The entire image is converted to a histogram of intensities of all the pixels. Then the method tries to find the threshold for which the histogram divides into two groups. This method literally balances the histogram to find the threshold value.  It weights which of the two groups is heavier and adjusts the weights iteratively until the histogram balances. This method is appealing for its simplicity but it does not work well with very noisy images because there are outliers that distort finding the threshold.  This we can workaround by ignoring the outliers.
Thus we have seen some interesting applications of image processing. 

cloud storage

When we deploy applications and database to the cloud, we have to pay attention to the size of the data stored. If the data is in the form of files or in the database, they can arbitrarily large. When the data exceeds several Gigabytes and grows at a considerable rate, the local storage in a VM does not suffice. At that point, a dedicated Storage area network is usually the norm because it can provide a capacity much larger than any disks. Typically for production database servers SAN is preferred. But this is true even for file shares. There are storage appliances that provide large secondary and tertiary storage while making them available as a mounted share visible to all Active Directory users. This is different from the Network Access storage that is spread out in the form of VMs. Data in the case of NAS does not reside on a single virtual host and there is management incurred in finding the VM with the data requested.
That means we need to plan rollout of our software with appropriate configurations. The storage media, the local or remoteness of the storage, the escalation path for incident reports, all need to be decided before deployment. This is important not just for planning but for the way the application software is written. For example, data traffic and network chattiness can be reduced if the round trips and redundancies between the application and input was reduced. Redundancy in operations is often ignored or undetected for want of features. For example, data files are copied from one location to another and then to a third location. If this were to be eliminated by either reducing the data to the metadata that we are interested in and/or copying the file only once to the target destination, we avoid redundancy.

Thursday, May 30, 2013

Unit-testing data provider calls requires that the underlying DataContext is substituted with a dummy and then we call it mocked. The trouble with that is there are no fakes or mocks that can be generated for the DataContext type. This is primarily because the DataContext object derives from a base DbContext object. If it were to implement an interface, then it can be mocked or faked.  However, adding an interface that calls out the salient methods for mocking stubs is not usually a one time solution. The DataContext object is itself auto generated. Each time the DataContext object is regenerated, the interface would likely have to be added again or the tests won't pass. This adds a lot of maintenance for otherwise unchanging code. There is a solution with Text Template Transformation toolkit (T4) templates which comes with Entity Data Modeler in EF 4.0. The T4 template creates the interface we desire.
We generate this the same way that we generate our data objects from EF. We right click on the Entity framework model and set the code generation strategy to none, and then we add a new code generation item and select the ADO.Net mocking context generator.
This way we have enabled testing the data provider classes by mocking the context classes underneath. So our testing can stay limited to this layer without polluting the layers beneath or reaching the database.
Testing the data provider class comes in helpful when we want to build the business logic layer on top of it. The business layer can now assume that layers underneath it know how to send the data across.
In the previous example, we consider a server client for reading stack trace from dumps. If this logic were to be implemented, both server and client side could use EF and data provider classes as described above. The interfaces only help to make it more testable at which point this common code can be factored out and included as a core project with the server and client code. The reason for the server and client code to be written separately is because both of them can be developed separately.
As discussed earlier, the server code handles all the data for the population of the tables. The interface for the server is the same as from a powershell client or UI or a file watcher service. By exposing the object that read the stack trace from a dump, directly in powershell we improve the automation.The file watcher service invokes the same interface for each of the files watched. The code could also keep the data local with a local database file that asp.net and EF can understand. This way we can even do away with the consumer side for the first phase and then add it subsequently in the second phase.  At that point the local database can be promoted to a shared database server.
Finally, the processing of the dump files may involve a debugger process to be launched. This means there has to be diagnosability of the process failures and appropriate messages to the invoker. Since the process invoking the debugger might be handling all exceptions and not passing the exception information across the process boundary, failures to read the dump may be hard to diagnose. If the number of such failures are high such that there is a backlog of unprocessed or partly processed dumps, then the overall success rate of the solution is affected. Some simple ways to handle this would be to stream all exceptions to the error output stream and read it from the process invoker.
Some ways to test a service that relies on a file watcher service and service bus. Service bus is merely a message delivery mechanism that guarantees messages will not be dropped and the sender and receiver (typically the client and the server) process these messages asynchronously.
1) Start copying a large file into the watched folder that takes a considerable time to copy.
2) Remove 'Everyone' or public access from the security settings of the watched folder.
3) Copy a file with extensions inclusive of the pattern being watched
4) Copy a file that triggers an exception in the servie such as a file with junk contents different from what is being expected
5) Send a message that causes an exception in the serve. If the server is expecting the path to a newly added file, send a path which denies access
6) Check the number of retries for the message in step 5)
7) For messages that repeatedly fail, check that the message made it to the poison queue.
8) Turn off the server that the messages are being received by and check the expiration of the messages and their move to dead letter queue.
9) Check the database for duplicate entries with same file name as the one in the watched folder
10) Check that the database entry for the same file input is refreshed on each attempt.
11) Check if this happens in a transaction where all the steps from moving a file, to clearing the message queue, to purging the database entry all occur in a rollback.
12) Check that the server processing the messages does not hang if the input doesn't contain what the server is looking for.
13) Check that the security settings on the file, folder, queue access, database and other dependencies are changed and don't cause service failures.
14) Check that queue cannot be flooded with unprocessed messages or denial of service attacks.
15) Check that watched folder does not have a backlog of files.

Wednesday, May 29, 2013

The web interface for the example mentioned in the previous post could be a simple list view with MVC framework. HTML5 and CSS can be used for the views. The stack trace bucket viewer application could be a visual tool to see and edit individual stack trace records read from dumps as well as a way to force retries by the producer to read the stack trace from the dump. The dump entries could carry an additional flag to denote the state such as new, in progress and completed and processed in that order. If the state is reverted, the processing is required. If there are no intermediary states required  such as for updates then the insertion and deletion of record suffices to trigger reprocessing. The producer service should watch for dump files and keep an association between the dump and the entry in the database. If the dump entry is not in the database, the dump is re-read. The lookup between the database and the dump for processing can be quick since the service could look up the dump based on path and filename.
The file watcher and the service bus are often used together. The service bus helps to queue the dumps for processing. It also helps with error conditions and retries. The queuing goes by other names as well such MSMQ and others. However depending on the workload, this may or may not be required. The benefits of queuing is that it can be processed asynchronously and enable retries. This can be handled by the service itself since it works on one file at a time.
The table for dumps read and processed can grow arbitrarily large as many different dumps are processed. Depending on the number of dumps processed in a day and the size of their metadata that we store, the table can grow large enough to require aging policy and archiving of older records. The archival can be batched to the start of every month and during maintenance window. The archival requires a table similar to the source, possibly in a different database than the live one. The archival stored procedure could read the records a few at a time from the source, insert into the destination and delete the copied from the source. If the source is not a single table but a set of related tables, the archival will do this step for every table in the order that inserts are allowed. The order of deletes will be in the reverse order since the constraints may need to be handled first. The insertion and deletes would not be expected to fail since we will select the records that are in the source but not in the destination. This way we will be in a good state between each incremental move of records. This helps when there is a large number of records that makes the stored procedure run long and become prone to interruptions or failures. The archival can resume from where it left off.
These services work with files and other windows resources so they may require that security is tightened and that dumps are handled only by a service account that has been authorized for read and writes on the folder. This security account may be different for production but may require full access to all folders and sub-folders. File handling exceptions often affect the success rate of such file based services. Internally, the same service account should be enabled access to the database where the parsed dump information is stored. Exceptions handled by the services could be logged or stored in the database. For the consumer side of the store the users will use their own credentials. Their actions can be authenticated and authorized. This way we can tell apart the changes made by either side.
Since the services and dependencies are hosted separately, they may have to tolerate connectivity failures. From an end to end perspective, the file IO operations could all be isolated and made local to the machine with the dumps while all subsequent processing is with the database.

 

Tuesday, May 28, 2013

Cloud computing gives us the ability to develop applications that are virtualized across hardware and software stacks. Applications are no longer monolithic but sharded into different modules, each of which can reside on a different VM with its own software and hardware stack. Virtual machines, operating systems, server products and hosts, can be different for each module. These modules can still enable the same experience for a user as if the user was interacting with a single application. Sign on for example could be only once while the user visits different modules.  Application storage, caching and services are now supported on dedicated resources.
If we want to provide APIs for our services, then they can be scoped to services and different services can meet different needs. APIs can be REST based and these will expand its reachability.
Let us take the example of provisioning a stack trace service that iterates over the dump files in a collection folder and populates a data store with stack traces read from each dump. In this case, we could expect the following APIs from the stackTrace service
IEnumerable<string> GetStackTrace(stream dumpFileStream); // retuns the stack trace associated with a path
IEnumerable<string> ResolveSymbols(IEnumerable<string> stackTrace, IEnumerable<string> symbolPath) to pretty print
IEnumerable<string> GetStackTrace(string pathToDumpAndSymbols); that combines the above operations

Next for the datatable that we populate called StackTraces, we will have attributes such as source information, bucket information and stack trace.

So we can enable all LINQ based operations on this entity.

This entity will be displayed by a service or front end that is independent from the stack trace population service. The front end could be read only that allows users to aggregate, search and sort stack traces from dumps.

In this case we have therefore separated out the producer consumer modules of our system and they are ready to be hosted on different VMs. For example, the producer service could sit on the same server as the collection folder and have a large storage since the dumps can be in the order of Gigabytes and there collections could be arbitrarily large. The consumer is more web appication tier 3 solution and can be hosted on  an app server. The data table can be in a cloud datastore on a yet another VM or storage account.

Two services one table can scale to add other functionalities but together they have adequate information shared in the data table for diagnostics, audit and tracking.

Monday, May 27, 2013

XPath query language

XPath is a query language for XML. XML is structured data where the document is organized as a tree. The relative position of an element with respect to the root is called a Path and there's only one path from that element to the root.The selection criteria for nodes is called predicate. The different ways of slicing a tree or the line to follow is called Axes and these can be for example parent, child or self with child being the default. Paths can be nested inside predicates and predicates can be nested inside paths. Queries are expressed as full or partial paths with selection. They are also expressed in short forms. The position in the xml tree at which the next processing should take place is tracked with a context node.  Nodes can have attributes, namespace and text. Elements positions are 1-based and in document order.
The expressions to denote the path usually describe a starting point, the context node, a line of search if not the child, and other absolute or relative paths. Paths have steps from one level to another and can include a mix or nesting of predicates and grouping via parameters. Query results are returned in document order and XPath does not modify the nodes.  Standard operators such as union and standard functions such as count or id, sum, and startswith are available to use with the query. XPath queries can return all elements in a document based on path-steps as given in the Path expressions.Queries can return attributes that begin anywhere as long as the name is matched. Queries can use wild cards say to denote all children of any element matching a given path. Queries can evaluate selection conditions where the text is compared to a constant. Queries can evaluate conditions based on attributes within the elements.

 

Sunday, May 26, 2013

Various usages of a tree.
1) Use the structure of a tree to find the relative position of a node with the root and to fully qualify that node.
2) Use the structure of a tree to discover siblings that share the same parent.
3) Use the structure of a tree to iterate over the siblings by walking laterally between them.
3) Use the structure of a tree to recursively travese the tree to repeat same operations.
4) Use the structure of a binary tree to traverse the tree in preorder, inorder and post order traversal.
5) Use the structure of a tree to find the common ancestor of two nodes
6) Use the structure of a tree to find the predecessor of a node in a binary search tree
7) Use the structure of a tree to find the successor of a node in a binary search tree
8) Use the structure of a tree to find if a node exists in the binary search tree
9) Use the structure of a tree to identify a dendrogram among a flat list of data points
10) Use the structure of a tree and schema definition to validate the data
11) Use the stucture of a tree to select the elements from a document
12) Use the structure of a tree to slice it in different ways
13) Use the structure of a tree to nest it as an expression in another
14) Use the structure of a tree to make a clone or copy a sub-tree
15) Use the structure of a tree to visit elements to perform operations without affecting the tree
16) Use the structure of a tree to do breadth first search or depth first search
17) Use the structure of a tree to color nodes as red or black in specific ways for specific tasks.
18) Use the structure of a tree to organize conditions where each element of a collection is evaluated against a tree
19) Use the structure of a tree to cluster and organize large sets of data
20) Use the structure of a tree for efficiently retrieving spatial data
21) Use the structure of a tree for scoping operations via inheritance and composition

Re-assess our approach in clustering methods

We discussed out of box support for data mining in server products in the previous post and prior to that we have discussed methods of text mining that involves clustering. We discussed choices of clustering methods. We favored clustering because it let us evaluate topics and keywords based on similarity measures and because we could not determine predictive parameters for keyword extraction.
If we explore the approach that keywords have a predictive parameter in and by themselves as they appear in an input text, then we can explore significant optimization and an easier approach. The parameter could be based on a large trained data set or by exploring graphs in word thesaurus or ontology. However, that said, if we were to find words similar to those that occur in input text, we resort to clustering.
SQL Analysis services provides the ability to write mining models to make predictions or analyze your data. Mining model content comprises of the metadata about the model, statistics about the data, and patterns discovered by the mining algorithm. The content may include regression formulas, definition of rules and item sets or weights and other statistics depending on the algorithm used. The structure of the model content can be browsed with the Microsoft Generic Content Tree Viewer provided in SQL Server Data Tools
The content of each model is presented as a series of nodes. Nodes can contain count of cases, statistics, coefficients and formulas, definition of rules and lateral pointers and XML fragments representing the data. Nodes are arranged in a tree and display information based on the algorithm used. If a decision tree model is used, the model can contain multiple trees, all connected to the model root. If a neural network model is used, the model may contain one or more networks and a statistics node. There are around thirty different mining content node types.
The mining models can use a variety of algorithms and are classified as such. These can be association rule models, clustering models, decision tree models, linear regression models, logistic regression models, naïve Bayes models, neural network models, sequence clustering and time series models.
Queries run on these models can make predictions on new data by applying the model, getting a statistical summary of the data used for training, extracting patterns and rules, extracting regression formulas and other calculations, getting the cases that fit a pattern, retrieving details about the individual cases used in the model and retaining a model by adding new data or performing cross-prediction.
One specific mining model is the clustering model and is represented by a simple tree structure. It has a single parent node that represents the model and its metadata, and each parent node has a flat list of clusters. The nodes carry a count of the number of cases in the cluster and the distribution of values that distinguish this cluster from other clusters. For example, if we were to describe the distribution of customer demographics, the table for node distribution could have attribute names such as age and gender, attribute values such as number, male or female, support and probability for discrete value types and variance for continuous data types. Model content also gives information on the name of the database that has this model, the number of clusters in the model, the number of cases that support a given node and others. In clustering, there's no one predictable attribute in the model. Analysis services also provides a clustering algorithm and this is a segmentation algorithm. The cases in a data set are iterated and separated into clusters that contain similar characteristics. After defining clusters, the algorithm calculates how well the clusters represent  the groups of data points and then redefines the cluster to better represent the data. The clustering behavior can be tuned with parameters such as the maximum number of clusters or changing the amount of support required to create a cluster.
Data for clustering usually have a simple one key column, one or more input columns and other predictable columns. Analysis services also ships a Microsoft Cluster Viewer that shows the clusters in a diagram.
The model is generally trained on a set of data before it can be used to make predictions. Queries help to make predictions and to get descriptive information on the clusters.
Courtesy : msdn

Saturday, May 25, 2013

Expressions and Query work in similar ways. I referred to them in an earlier post but I'm trying to cover it here. Expressions are represented as trees. Expression trees are immutable. If you want to modify the expression tree, you can construct a new expression tree by copying the existing ones and replacing nodes. You can nest expressions and define the precedence based on the structure of the tree. Different parts of the tree can be tagged so that they can be processed differently. The leaves of the trees are usually the constants or null representing the data on which expression tree evaluates. There are around 45 expression tree types that can be a node in the tree. New operations can be added almost anywhere in the tree however adding them to the leaf means we keep it as close to the data. Only some leaves may require this new operation in which case the changes are not pervasive through out the expression tree. This is especially helpful given that the expression could be used anywhere, nested and recursive. The size of data used and the size of the tree can be arbitrarily large, so considering performance is helpful. Query works similarly except that the expression can be part of the predicate in a query.  Predicate push down allows query to be passed through different systems. The servers typically don't interpret what the expression is, if it's user defined and operates on their data. For the ones that the server needs to keep track of, these expressions are compiled and have an execution plan. Execution plan helps to improve and control execution because the expressions are translated into a language that the system can work on. Query and Expressions have their purpose and are often interchangeable and there are usually many ways to solve a problem using either or both.  You can traverse the expression tree with an expression tree visitor.
Queries when they conform to the conventions that LINQ proposes can be executed by more than one systems such as the Entity Framework and the database server. LINQ is Language integrated queries and it defines queries in a way that can be executed against different data stores such as XML, Database server, ado.net datasets etc. These queries typically take the form of standard query operator methods such as Where, Select, Count, Max and such others. Typically LINQ queries are not executed until the query variable is iterated over. This is why we use Lambdas  Queries are generally more readable than their corresponding method syntax. IQueryable queries are compiled to expression trees while the IEnumerable queries are compiled to the delegates.  The compilers provide support to parse the lambdas in the statement. The LINQ expressions have a compile method that compiles the code represented by an expression tree into an executable delegate. There is an expression tree viewer application in the Visual Studio samples.
LINQ queries make the queries part of the programming constructs available in a language while they hide the data that they operate on. In this case, it is important to mention that different data sources may have different requirements or syntax for expressing their queries. LINQ to XML for example may need the XML queries be written in XPATH. This is different from the relational queries which are more like the LINQ constructs themselves. Queries against any data store can be captured and replayed independent of the caller that makes these queries.
LINQ queries and expressions that have Lambdas have the benefit that the Lambdas are evaluated only when the results are needed.

Friday, May 24, 2013

Service Oriented Architecture
As a practice for reusable single point maintenance code, we write services. These are used by the various components in our applications or for client calls. These services typically hide the data providers downstream and serve as a single point of communication for all client needs. This way the server code can live in these services and exercise better control over the data and its users. Typically there are many clients and more than one data providers justifying the need for a single service to broker in between.
However, when we decide to work with more than one services, we organize our service in a hierarchy where different components deliver different functionalities but to the outside world, there's only one service. This is one of the ways we define a service oriented architecture.
Next we define the scope of this service and elevate it all the way to an enterprise. This is where the true value of such an architecture lies. It abstracts several clients from the heterogeneous data sources and provide a value proposition for the enterprise.
Communication with these services can be formalized via message based paradigms. Messages enable us to define address, binding and contracts in a way that brings together the benefits of declarative communication protocols and keep the server and clients focus on the business. This is where WCF or Windows Communication Foundation comes into play.
Services enabled myriad of clients to connect. Some of these can be hand held devices such as mobile phones and enable rich applications to be written. The same services can also be used by desktop applications or over the browser. Clients can be thin or rich while the same service can cater to both. The ability to support applications via the browser makes SOA all the more appealing with its ubiquity and availability.
Services abstract the provisioning of servers and resources that are required to handle the traffic from the web. These servers and resources can be VM slices and in the cloud or with extensive support in data centers. This lowers the cost of operations for these services and increases their availability and reach. Services provisioned in the cloud have a great appeal for rotating data and applications from one server to the other with little or no downtimes thus improving maintenance.
Services also enable rich diagnostics and caller statistics via the http traffic through http proxies. Such reports not only improve the health of the code but also enable monitoring and meeting the needs of online traffic.  Diagnostics help identify the specific methods and issues so little time is spent on reproducing the issues.
Services written this way are very scalable and can meet the traffic generated for anniversaries or from all over the world. Such services can use clusters and support distributed processing. Service also enable integration of data with code and tight coupling of the business logic so that the callers cannot interpret the trade secrets of the business offered by the services.
Applications improve the usability of these services and can bring additional traffic to the company.
Services have an enduring appeal across political and business changes and these can serve to offer incremental value propositions to the company. Finally, services make it easier for functionalities to be switched in and out without disrupting the rest of the system. Even internally, services can be replaced with mocks for testing.
So far from our posts we have seen that there are several tools for text mining. For example, we used machine based learning with tagged corpus and ontology. Vast collection of text has been studied and prepared in this corpus and a comprehensive collection of words has been included in the ontology. This gives us great resource to work with any document. Next we define different distance vectors and use clustering techniques to group and extract keywords and topics. We have refined the distance vectors and data points to be more representative of the content of the text. There have been several ways to measure distance or similarity between words and we have seen articulation of probability based measures. We have reviewed the way we cluster these data points and found out methods that we prefer over others.
We want to remain focused on keyword extraction even though we have seen similar usages in topic analysis and some interesting areas as text segmentation. We don't want to resort to a large corpus for light weight application plugins but we don't mind a large corpus for database searches.  We don't want processing that is better than O(N^2) in working with the data to extract keywords and we have the luxury to have a pipeline of steps to get to the keywords.

Thursday, May 23, 2013

Writing powershell commands
Powershell lets you invoke CmdLets on the command line. Custom CmdLets are an instance of a .Net class. A CmdLet processes its input from an object pipeline instead of text. A CmdLet processes one object at a time. CmdLets are attributed with a CmdLetAttribute and named with a  verb-noun pair. The class derives from PSCmdLet which gives you access to PS runtime. The custom cmdLet class could also derive from CmdLet in which case it's more light weight. CmdLets don't handle argument parsing and error handling. These are done consistently across all Powershell CmdLets.
CmdLets support ShouuldProcess Parameter which lets the class have access to runtime behavior parameters - Confirm and WhatIf. Confirm specifies whether user confirmation is required. WhatIf informs the user what changes would have been made when the CmdLet is invoked.
Common methods to override include BeginProcessing which provides pre-processing functionality for the cmdlet, ProcessRecord which can be called any number of times, EndProcessing for post-processing functionality and StopProcessing when the user stops the cmdLet asynchronously.
CmdLet parameters allow the user to provide input into the CmdLet. This is done by adding properties to the class that implements the CmdLet and adding ParameterAttribute to them.
ProcessRecord generally does the work of creating new entries for data.
Parameters must be explicitly marked as public.  Parameters can be positional or named. If the parameter is positional, only the value is provided with the CmdLet invocation.  In addition, parameters can be marked as mandatory which means that they have a value assigned.
Some parameters are reserved  and are often referred to as Common parameters.  Another group of parameters are called the ShouldProcess parameters which give access to the Confirm and WhatIf runtime support. Parameters Sets are also supported by Powershell which refers to a grouping of the parameters.
For exception handling, a try catch can be added to the class method invocation. These should be to add more information when the error happens. If you don't want to stop the pipeline on error, then do not throw with ThrowTerminatingError.
Results are reported through objects. Powershell is emphatic on the way results are displayed and there's a lot of flexibility in what you want to include in your result objects. WriteObject is what is used to emit the results. These results can be returned to the pipeline. As with parameters, there should be consistency in the usage of both results and parameters.
There should be support for diagnostics when things go wrong so that the problem can be identified quickly and resolved. There is builtin support to send messages to the host application which could be powershell.exe and that displays the messages to the pipeline.
CmdLets can also be grouped so that the parameters or results need not be repeated. This is very convenient when there are fine grained CmdLets required but they essentially belong to the same group.  A snap in can also be created with PSSnapIn so that the CmdLets are registered for usage. These are available from the System.Management.Automation namespace. Installing a snap in is done via InstallUtil.exe  which creates some registry entries. Make sure that System.Management.Automation.dll is available from the SDK or the Global Assembly Cache (GAC).

Wednesday, May 22, 2013

I learned today that expressions and queries should be treated different. Even though you can have the predicate in a query as an expression tree and vice versa, there are several reasons to use one or the other specifically in certain scenarios.

Tuesday, May 21, 2013

Nltk classifier modules
The nltk decision tree module. This is a classifier model. A decision tree comprises of non-terminal nodes for conditions on feature values and the terminal nodes for the labels. The tree evaluates to a label for a given token.
The module requires feature names and labeled feature sets. Different thresholds such as for depth cut off, entropy cut off and support cut off can also be specified. Entropy refers to degree of randomness or variations in the results while support refers to the number of feature sets used for evaluation.
The term feature is used to refer to some property of an unlabeled token. Typically a token is a word from a text that we have not seen before. If the text is seen before and has already been labeled, it is a training set. Training set helps train our model so that we can pick the labels better for the tokens we encounter. As an example, the proper nouns for names may be labeled male or female. We start with a large collection of already tagged names, we call training data. We build a model where we say if the name ends with a certain set of suffixes, the name is that of a male. Then we run our model on the training data to see how accurate we were and we adjust our model to improve our accuracy. Next we can run our model on a test data. If a name from the test data is labeled by this model as a male, we know its likelihood to be correct.
The property of labeled tokens are also helpful. We call these as joint-features and we distinguish it from the feature we just talked about by referring to the latter as input-features. So joint-features belong to training data and input-features belong to test data. For some classifiers such as the maxent classifier we refer to these as features and contexts respectively. The maxent stands for maximum entropy model where joint-features are required to have numeric values and input-features are mapped to a set of joint-features.
 There are other types of classifiers as well. For example, the mallet package uses the external mallet machine learning package. The megam module uses the external megam maxent optimization package. The naïve Bayes module is a module that assigns probability for a label. The P(label/features) is computed as P(label) * P(features/label) / P(features).  The 'naive' assumption is that all features are independent.  The positivenaivebayes module is a variant of the Bayes classifier
 that performs binary classification based on two complementary classes  where we have labeled examples only for one of the classes. The there are classifiers based exclusively on the corpus that they are trained on. The rte_classify module is a simple classifier for the RTE corpus.  It calculates the overlap in words and named entities between text and hypothesis. Most of the classifiers discussed are built on top of the scikit machine learning library.

Nltk cluster package
This module contains a number of basic clustering algorithms . Clustering is unsupervised machine learning to group similar items with a large collection There are the k-means clustering, E-M clustering and a group average agglomerative clustering. The K-means clustering starts with the k arbitrary chosen means and assigns each vector to the cluster with the closest mean. The centroid of the cluster is recalculated as the means of each cluster. The process is repeated until the clusters stabilize. This may converge to a local maximum so this method is repeated for other random initial means and the most common occurring output is chosen. The Gaussian EM clustering starts with k arbitrarily chosen means, prior probabilities and co-variance matrices which forms the parameters for the Gaussian source. The membership probabilities is then calculated for each vector in each of the clusters - this is the E-step. The parameters are then updated in the M-step using the maximum likelihood estimate from the clustering membership probabilities. This process continues until the likelihood of the data does not significantly increase.  The GAAC clustering starts with each of the N vectors as singleton clusters and then iteratively merges pairs of clusters which have the closest centroids. This continues until there is only one cluster. The order of merges is useful in finding the membership of a given number of clusters because earlier merges are lower than the depth c in the resulting tree.
Usage of WixSharp to build MSI for installing and uninstalling applications.
WixSharp makes it easy to author the logic for listing all the dependencies of your application for deployment. It converts the C# code to wxs file which in turn is compiled to build the MSI. There are a wide variety of samples in the WixSharp toolkit. Some of them require very few lines to be written for a wide variety of deployment time actions. The appeal in using such libraries is to be able to get the task done sooner with few lines of code. The earlier way of writing and editing WXS file was error prone and tedious.

Monday, May 20, 2013

nifty way to find file format

Here's something I found on the internet. If you wanted know the file format and there is little or no documentation of the proprietary format, you can look up the header of the file  and other data structures with the corresponding symbols from kernel PDB file. You don't need the source to look up format.
There are tools that can render the information in the PDB to be navigated through the UI. This is possible via the DIA SDKs. The format of PDB files is also not open hence their access is via debugger sdk. The SDK is available via COM. So you may have to register the DIA dll.
The debuggers make a local cache of the symbols when requested and they download the symbols from the symbol server, so you can expect the PDB to be made available to you by the debugger in the directory you specify.
If you look at the kernel PDB, you will find the structures we are looking for, start with the name MINIDUMP and these can be walked from the header onwards.
To find the stack trace, we follow the header to the directory or stream for the exception and read the exception record and address where the exception occurred. Both of these are given in the MINIDUMP_EXCEPTION data structure. The exception stream also gives the thread context. The context gives the processor specific register data. When we dump the stack pointer, we get the stack trace. We resolve the symbols of the stack trace with the pdbs either explicitly through DIA or implicitly through the managed mdbgeng library of debugger sdk. The minidump actually has all the information in the file itself. For example, you can list all the loaded modules with the mindump module list as well as the mindump module structures. Module name, function name, line number and stack frame are available via IMAGEHLP data structures. The various types of streams in the minidump are :
Thread list stream given by the MINIDUMP_THREAD_LIST structure
Module list stream given by the MINIDUMP_MODULE_LIST structure
Memory list stream given by the MINIDUMP_MEMORY_LIST structure
Exception stream given by the MINIDUMP_EXCEPTION_STREAM structure
System Info stream given by the MINIDUMP_SYSTEM_INFO structure
ThreadExList stream given by the MINIDUMP_THREAD_EX_LIST structure
Memory64 list stream given by the MINIDUMP_MEMORY64_LIST structure
Comment stream
Handle Data stream given by the MINIDUMP_HANDLE_DATA_STREAM structure
Function table stream given by the MINIDUMP_FUNCTION_TABLE structure
Unloaded module list stream given by the MINIDUMP_UNLOADED_MODULE_LIST structure
Misc Info List stream given by the MINIDUMP_MISC_INFO structure
Thread Info List stream given by the MINIDUMP_THREAD_INFO_LIST structure
Handle operation  list stream given by the MINIDUMP_HANDLE_OPERATION_LIST structure
For diagnostics, we may choose to display messages to the output or error stream. More features can be built into the tool that retrieves the stack trace from the dump. This can be done in an extensible manner where tool runs a set of commands from the user by way of command line. Internally we can have a command pattern to implement the different debugger like functionalities of the tool. Also the tool can be deployed via MSI. This ensures cleanliness during install and uninstall. 

Sunday, May 19, 2013

Review : Paper on comparision of document clustering techniques by Steinbach, Karypis and Kumar from Univerity of Minnesota.
This paper compares k-means clustering methods to hierarchical clustering methods. The paper suggests that bisecting k-means technique is better than the standard k-means technique which is in turn better than hierarchical techniques.
Hierarchical clustering has quadratic time complexity where as K-means have a time complexity that's linear. There are mixed approaches too.
There are two metrics used for cluster quality analysis. Entropy is one which provides a measure of goodness for single level clusters. F-measure is the other which measures the effectiveness of hierarchical clustering.
The bisecting k-means clustering is explained as follows:
Step 1 pick a cluster to split
Step 2 find two clusters using the basic k-means algorithm
Step 3 Repeat step 2 for a fixed number of times and take the split that produces the clustering with the highest overall similarity
Step 4 Repeat step 1, 2 and 3 until the desired number of clusters are reached.
Splitting the largest cluster also works.

Agglomerative hierarchical clustering have the following variations:
1) Intra cluster similarity : This hierarchical clustering looks at the similarity of all documents in the cluster to the centroid where the similarity distance is given as the sum of cosines. The pair of clusters that when merged leads to smallest decrease in similarity.

2) Centroid similarity technique: This works in a similar way but it takes the similarity distance as the cosine between the centroids of the two clusters.

3) The UPGMA scheme is based on the cluster similarity measure which takes the sum of the cosine distances between two documents of different clusters divided by the product of the sizes of their clusters,.

An explanation is given for why the agglomerative hierarchical clustering performs poorly when compared with bisecting k-means that mentions that the former puts documents of the same class in the same cluster and this is done early on and generally not reversed.
 

Saturday, May 18, 2013

IPSEC

I was going to be posting on text indexing but I will make a post on IPSEC before that.

IPSEC is a suite of protocols for securing network connections. IP packets are authenticated and encrypted for the duration of a session. A variety of protocols can be used for authentication and encryption. It provides several controls for network connections and is generally better organized than many other networking protocols. This is an end to end IP connectivity between two endpoints that are either host to host, network to network or network to host.
The IPSec suite is an open standard. It uses the following protocols to perform various operations.
Authentication headers (AH): This guarantees that the sender of IP packets is who the packet says it is from and that the packet has not been tampered with. This prevents spoofing and replay attacks. This is achieved by computing a hash value called the Integrity Check Value and a sequence number. The sequence number helps to use a sliding window to determine the packets that are old and can be discarded.
Encapsulating Security Payloads (ESP): This provides confidentiality protection of packets. It supports both encryption and authentication configurations for direct IP connectivity as well as tunnel based connectivity. A tunnel is used to describe communication between two endpoints over a public network such that two endpoints can talk to each other without letting any of the other hosts on the public network know. A common example is when people connect to their office from home. This is implemented by slapping on another IP header over the original. This way the public network routes the packets based on the first header but the source and destination look at the inner IP packets to know that the packets are from each other. ESP unlike AH does not support integrity and authentication for the entire IP Packet.
Security Association This is the group of algorithms and parameters such as keys that is being used to encrypt and authenticate a particular flow in one direction.  A pair of security associations is required to secure bidirectional traffic. These groupings are well organized and policies are enforced using a policy agent.
There are two modes of transport for IPSEC depending on host to host configuration or those involving network tunnels and are referred to as the transport mode and tunnel mode respectively.
In the transport mode, only the payload of the IP packet is usually encrypted or authenticated and the IP header is preserved. The limitation of this mode is that the IP addresses cannot be translated when the authentication header is used as it will invalidate the hash value.
In the tunnel mode, an entire IP packet is  encrypted and/or authenticated because it is encapsulated into a new IP packet with a header.
The algorithms used for encrypting the packets include SHA1 for integrity protection and authenticity and  Triple DES and AES for confidentiality. The key negotiation for authentication is usually included with the IPSEC implementation from a vendor.
IPSEC implementation in earlier windows was a standalone component separate from windows firewall. This has changed since. IPSEC lets you author in more generic terms a set of rules and settings that define the security policies of your network and are implemented by each and every host on your network. You author these IPSEC policy settings as well as the individual policy or rules with IP filters and filter actions. IP filters define a set of IP traffic.
For example, a computer on the intranet can have the following rules: allow connections with resource servers, allow connections with other intranet computers, but deny connections to everyone else. These are authored as inbound and outbound rules.  Filters are evaluated based on weights. The weights are decided based on source IP address, subnet mask, destination IP address, subnet mask , IP protocol, source port, destination port. The source destination IP address port pairs identify a connection. Along with the filters and filter actions, you can also define the authentication methods such as Kerberos, Active Directory or certificate based.
The policies are written for the domain system and are retrieved by the policy agent running on the  host computers that want to communicate. These policies are passed to the IKE module which determines the authentication mechanism from negotiation settings of IPSEC, determines the secret key, and the protection of direct and tunnel mode traffic. These are then passed as SA parameters to the IPSEC drivers use these to protect the traffic. Since the IPSEC driver sits below the application and TCP/IP network stack, it handles all IP traffic.
After the policies are created, they can be assigned to different AD domains, sites and organizational units thus giving you the flexibility to define the scope for your rules and removing the redundancy from having to repeat the rules on each host. Local IP sec policies are overridden by domain based IPsec policies and so on.

Friday, May 17, 2013

Dump file format 2 (blog post continued)

Extracting the stack trace is different from resolving the stack trace function pointers with symbols. For the extraction part, we read from dump files from external sources. For the resolution part, we read symbols from mostly internal sources unless otherwise provided. The latter can happen offline. There is support via debugger interface access sdk (DIA) and somewhat more generally with the debugger client sdk that gets shipped with debugging tools from windows. The latter has an interface in C# as opposed to the COM based interface of DIA. There is also more debugging features available via the debugging sdk.
The debugging sdk assembly (mdbgeng) requires full trust. When redistributing a package with this assembly, it's probably better to register it to the global assembly cache or use with NuGet. For the most part, we want to focus on preliminary analysis of dumps using streams.
Also we could provide APIs for the functionalities we write, so that any client, powershell or standalone executable can call these.
API design could consider a subset of the debugger sdk as appropriate. The two main methods we are interested in are GetStackTraceFromStream and OutputStackTrace.
The APIs could additionally consider methods for retrieving bucket information, timestamps, and additional details such as the system information on which the dump occurred.
The powershell implementation of these APIs are enabled via appropriate attribute on the methods mentioned above.
Exception handling and return values are limited to very few meaningful messages.
Also, deployment of a standalone tool for this can be MSI based so that install and uninstall is easy. The MSI can be generated with libraries such as WixSharp.

Next blog post will continue on text indexing methods.

Thursday, May 16, 2013

Dump file format

Dump file format
Dump file have specific formats that help in debugging. For example, they store the system information and exception record as the first few fields of data that they carry and hence at calculatable offsets from the start of the file. The exception record has the exception which produced the dump. It also has additional information such as the exception code  which gives the bucket under which this exception falls such as access violation, array bounds exceeded, divide by zero, invalid operation, overflow or underflow, invalid operation etc. Exception records can be chained together to provide additional information on nested exceptions. Exception Address gives the address at which the exception occurred and used for stack trace.
Exceptions are not always on the first thread. Hence a display of the stack trace at the first thread may not capture the exception that triggered the dump. This is obtained with another command on the debugger by name .ecxr. This sets the context to that of the exception and then the stack trace command gives the desired stack trace. The stack trace can be manually displayed with dd command on ebp or esp register after .ecxr. This can then be resolved against symbols to display function names.
The dump file does not look for threads. The system information directory and exception record directory precede all other data. Hence the look up of the exception address is easier. The exception directory is followed by the exception record  and the context of the thread. Additional thread info structures can follow next in n * field info data structure.
Dump filters relative virtual addresses (RVA) to point to the data member within the file. These are offsets from the start of the file. The format specifies a set of directories that point to the data. Each directory specifies the following, the data type, the data size and the RVA to the location of data in the dump file. The file layout consists of a header that gives information on the version, signature, number of directories and RVA. This is followed by a set of directories each of which points to data in the dump data section.  The data sections follow this list of directories. The first two data sections are reserved for system information and exception stream.
Dump files can be of more than one type. They are categorized by their sizes  and are enumerated as context dumps, system dumps, complete dumps in the order of increasing size. The context dumps range in size from 4Kb- 64Kb,  the system dumps range from 64 Kb - several MB and the complete dumps store the entire physical memory and the 64Kb. The context dumps carry information such as exception that initiated the crash, context record of faulting thread, Module list and thread list although these are restricted to the faulting ones, callstack of faulting thread, 64 bytes of memory above and below the instruction pointer and the stack memory dump of the faulting thread that can fit in the 64KB limit. The other types of dump includes these same essential information but include the complete list of all modules, threads, and more memory dumps around the instruction pointers and stack. When the entire heap is included in the dump file, there is plenty of debugging information to even discern the values of local variables on the stack. However, that increases the size of the dumps considerably.
Dump file bucketing refers to grouping of dump files that arose from similar crashes such as those from a common code defect.  These can include variables like the application name, version and timestamp, the owner application name, version and timestamp, the module name, version and timestamp, and the offset into the module. Bucketing helps to determine the priority and severity of the associated code defect.
Dump file structures indicate how to navigate the file for specific information. These are well documented and essentially refer to using RVAs to find information. There are specific structures that represent thread call stack frames.
Note reading the dump file is a forward only operation and hence streams can be used with dump files to retrieve the stack trace.

User Mini Dump File: Only registers, stack and portions of memory are available
Symbol search path is: *** Invalid ***
****************************************************************************
* Symbol loading may be unreliable without a symbol search path.           *
* Use .symfix to have the debugger choose a symbol path.                   *
* After setting your symbol path, use .reload to refresh symbol locations. *
****************************************************************************
Executable search path is:
Windows 8 Version 9200 MP (8 procs) Free x64
Product: WinNt, suite: SingleUserTS
Built by: 6.2.9200.16384 (win8_rtm.120725-1247)
Machine Name:
Debug session time: Tue Apr 30 18:37:57.000 2013 (UTC - 7:00)
System Uptime: not available
Process Uptime: 0 days 0:00:45.000
.............................
----- User Mini Dump Analysis
MINIDUMP_HEADER:
Version         A793 (62F0)
NumberOfStreams 10
Flags           1105
                0001 MiniDumpWithDataSegs
                0004 MiniDumpWithHandleData
                0100 MiniDumpWithProcessThreadData
                1000 MiniDumpWithThreadInfo
Streams:
Stream 0: type ThreadListStream (3), size 00000094, RVA 00000410
  3 threads
  RVA 00000414, ID 38, Teb:000007F7BC25E000
  RVA 00000444, ID 3FFC, Teb:000007F7BC25C000
  RVA 00000474, ID 3828, Teb:000007F7BC25A000
Stream 1: type ThreadInfoListStream (17), size 000000CC, RVA 000004A4
  RVA 000004B0, ID 38
  RVA 000004F0, ID 3FFC
  RVA 00000530, ID 3828
Stream 2: type ModuleListStream (4), size 00000C40, RVA 00000570
  29 modules
  RVA 00000574, 000007f7`bd1c0000 - 000007f7`bd2cb000: 'C:\Windows\System32\calc
.exe', 8160
  RVA 000005E0, 000007f8`c31d0000 - 000007f8`c338e000: 'C:\Windows\System32\ntdl
l.dll', 140
  RVA 0000064C, 000007f8`c29d0000 - 000007f8`c2b06000: 'C:\Windows\System32\kern
el32.dll', 140
  RVA 000006B8, 000007f8`c0240000 - 000007f8`c0333000: 'C:\Windows\System32\KERN
ELBASE.dll', 140
  RVA 00000724, 000007f8`c10c0000 - 000007f8`c23a4000: 'C:\Windows\System32\shel
l32.dll', 140
  RVA 00000790, 000007f8`c2530000 - 000007f8`c2580000: 'C:\Windows\System32\shlw
api.dll', 140
  RVA 000007FC, 000007f8`c2c50000 - 000007f8`c2df0000: 'C:\Windows\WinSxS\amd64_
microsoft.windows.gdiplus_6595b64144ccf1df_1.1.9200.16384_none_72771d4ecc1c3a4d\
GdiPlus.dll', 140
  RVA 00000868, 000007f8`c2580000 - 000007f8`c265e000: 'C:\Windows\System32\adva
pi32.dll', 140
  RVA 000008D4, 000007f8`c2b80000 - 000007f8`c2c43000: 'C:\Windows\System32\olea
ut32.dll', 140
  RVA 00000940, 000007f8`be9e0000 - 000007f8`beac6000: 'C:\Windows\System32\uxth
eme.dll', 140
  RVA 000009AC, 000007f8`c2660000 - 000007f8`c27de000: 'C:\Windows\System32\ole3
2.dll', 140
  RVA 00000A18, 000007f8`ba1b0000 - 000007f8`ba419000: 'C:\Windows\WinSxS\amd64_
microsoft.windows.common-controls_6595b64144ccf1df_6.0.9200.16384_none_418c2a697
189c07f\comctl32.dll', 140
  RVA 00000A84, 000007f8`c0e20000 - 000007f8`c0f6c000: 'C:\Windows\System32\user
32.dll', 140
  RVA 00000AF0, 000007f8`c0ce0000 - 000007f8`c0e20000: 'C:\Windows\System32\rpcr
t4.dll', 140
  RVA 00000B5C, 000007f8`ba5b0000 - 000007f8`ba5d0000: 'C:\Windows\System32\winm
m.dll', 140
  RVA 00000BC8, 000007f8`c2890000 - 000007f8`c29d0000: 'C:\Windows\System32\gdi3
2.dll', 140
  RVA 00000C34, 000007f8`c1010000 - 000007f8`c10b5000: 'C:\Windows\System32\msvc
rt.dll', 140
  RVA 00000CA0, 000007f8`c0670000 - 000007f8`c0820000: 'C:\Windows\System32\comb
ase.dll', 140
  RVA 00000D0C, 000007f8`c0f70000 - 000007f8`c0fb8000: 'C:\Windows\System32\sech
ost.dll', 140
  RVA 00000D78, 000007f8`ba040000 - 000007f8`ba072000: 'C:\Windows\System32\WINM
MBASE.dll', 140
  RVA 00000DE4, 000007f8`c0fd0000 - 000007f8`c1009000: 'C:\Windows\System32\imm3
2.dll', 140
  RVA 00000E50, 000007f8`c30b0000 - 000007f8`c31c4000: 'C:\Windows\System32\msct
f.dll', 140
  RVA 00000EBC, 000007f8`ba420000 - 000007f8`ba5aa000: 'C:\Windows\System32\Wind
owsCodecs.dll', 140
  RVA 00000F28, 000007f8`bb410000 - 000007f8`bb431000: 'C:\Windows\System32\dwma
pi.dll', 140
  RVA 00000F94, 000007f8`bffb0000 - 000007f8`bffba000: 'C:\Windows\System32\CRYP
TBASE.dll', 140
  RVA 00001000, 000007f8`bff50000 - 000007f8`bffac000: 'C:\Windows\System32\bcry
ptPrimitives.dll', 1c0
  RVA 0000106C, 000007f8`c05d0000 - 000007f8`c0666000: 'C:\Windows\System32\clbc
atq.dll', 140
  RVA 000010D8, 000007f8`b9b30000 - 000007f8`b9b99000: 'C:\Windows\System32\olea
cc.dll', 140
  RVA 00001144, 000007f8`bf250000 - 000007f8`bf2e6000: 'C:\Windows\System32\SHCo
re.dll', 140
Stream 3: type MemoryListStream (5), size 00000354, RVA 00002D5D
  53 memory ranges
  range#    RVA      Address             Size
       0 000030B1    000007f8`bffb5000   00000000`00000730
       1 000037E1    00000043`da3f0860   00000000`00002000
       2 000057E1    00000043`da3f2bf0   00000000`00000028
       3 00005809    00000043`da3f8c80   00000000`00000008
       4 00005811    00000043`da3f94e0   00000000`00000010
       5 00005821    000007f8`c2572000   00000000`000014a0
       6 00006CC1    00000043`da3fc320   00000000`00000008
       7 00006CC9    00000043`da3fc770   00000000`00000410
       8 000070D9    000007f8`c0e21e3a   00000000`00000100
       9 000071D9    00000043`da4005c0   00000000`00000010
      10 000071E9    00000043`da400620   00000000`00000010
      11 000071F9    000007f8`c2ae1000   00000000`00001920
      12 00008B19    000007f8`c0ff9000   00000000`00001120
      13 00009C39    00000043`da415310   00000000`00000410
      14 0000A049    000007f8`b9b84000   00000000`00002eec
      15 0000CF35    00000043`da446cb0   00000000`00000008
      16 0000CF3D    00000043`da446d70   00000000`00000018
      17 0000CF55    00000043`da446db0   00000000`00000008
      18 0000CF5D    00000043`da44a760   00000000`00000410
      19 0000D36D    000007f8`c1939000   00000000`00000009
      20 0000D376    000007f8`c2975000   00000000`00003d28
      21 0001109E    000007f8`c27aa000   00000000`0000234a
      22 000133E8    000007f8`beaa7000   00000000`00003490
      23 00016878    000007f8`c0ebd000   00000000`00001ac9
      24 00018341    000007f8`c2617000   00000000`000048c6
      25 0001CC07    000007f8`c109e000   00000000`00004bda
      26 000217E1    00000043`da31d7d8   00000000`00002828
      27 00024009    000007f8`c07e4000   00000000`00006e08
      28 0002AE11    000007f8`c3308000   00000000`0000a1d0
      29 00034FE1    000007f8`c2dc4000   00000000`00001c38
      30 00036C19    000007f8`c0654000   00000000`00005790
      31 0003C3A9    000007f8`c316c000   00000000`00001d10
      32 0003E0B9    000007f8`c2dd8000   00000000`00003164
      33 0004121D    000007f8`ba3a6000   00000000`000041c8
      34 000453E5    000007f7`bd259000   00000000`0000517c
      35 0004A561    000007f8`ba588000   00000000`000039d0
      36 0004DF31    000007f8`c2c2c000   00000000`00002204
      37 00050135    00000043`df44f8c8   00000000`00000738
      38 0005086D    000007f8`ba3cc000   00000000`000055b8
      39 00055E25    000007f7`bc254000   00000000`00000388
      40 000561AD    000007f8`bf2d1000   00000000`00001080
      41 0005722D    000007f7`bc25a000   00000000`00006000
      42 0005D22D    000007f8`bf2e0000   00000000`00000009
      43 0005D236    000007f8`c0313000   00000000`00003176
      44 000603AC    000007f8`ba5c4000   00000000`00001694
      45 00061A40    000007f8`c18a4000   00000000`0000e4ac
      46 0006FEEC    000007f8`c0fac000   00000000`00002a08
      47 000728F4    000007f8`bb423000   00000000`00003420
      48 00075D14    000007f8`ba068000   00000000`00002050
      49 00077D64    000007f8`c31d311b   00000000`00000100
      50 00077E64    00000043`deb9f998   00000000`00000668
      51 000784CC    000007f8`c0dfc000   00000000`00001adb
      52 00079FA7    000007f8`bffa4000   00000000`00000ce8
  Total memory: 77bde
Stream 4: type SystemInfoStream (7), size 00000038, RVA 00000098
  ProcessorArchitecture   0009 (PROCESSOR_ARCHITECTURE_AMD64)
  ProcessorLevel          0006
  ProcessorRevision       2A07
  NumberOfProcessors      08
  MajorVersion            00000006
  MinorVersion            00000002
  BuildNumber             000023F0 (9200)
  PlatformId              00000002 (VER_PLATFORM_WIN32_NT)
  CSDVersionRva           000011B0
                            Length: 0
  Product: WinNt, suite: SingleUserTS
Stream 5: type MiscInfoStream (15), size 00000340, RVA 000000D0
Stream 6: type HandleDataStream (12), size 00000EE8, RVA 0007BB39
  95 descriptors, header size is 16, descriptor size is 40
    Handle(0000000000000004,"Directory","\KnownDlls")
    Handle(0000000000000008,"File","")
    Handle(000000000000000C,"File","")
    Handle(0000000000000010,"Key","\REGISTRY\MACHINE\SYSTEM\ControlSet001\Contro
l\SESSION MANAGER")
    Handle(0000000000000014,"ALPC Port","")
    Handle(0000000000000018,"File","")
    Handle(000000000000001C,"Key","\REGISTRY\MACHINE\SYSTEM\ControlSet001\Contro
l\Nls\Sorting\Versions")
    Handle(0000000000000020,"Key","\REGISTRY\MACHINE")
    Handle(0000000000000000,"","")
    Handle(0000000000000028,"Event","")
    Handle(000000000000002C,"Event","")
    Handle(0000000000000030,"Event","")
    Handle(0000000000000034,"Event","")
    Handle(0000000000000038,"Event","")
    Handle(000000000000003C,"Event","")
    Handle(0000000000000000,"","")
    Handle(0000000000000044,"Directory","\Sessions\1\BaseNamedObjects")
    Handle(0000000000000000,"","")
    Handle(000000000000004C,"Event","")
    Handle(0000000000000050,"WindowStation","\Sessions\1\Windows\WindowStations\
WinSta0")
    Handle(0000000000000054,"Desktop","\Default")
    Handle(0000000000000058,"WindowStation","\Sessions\1\Windows\WindowStations\
WinSta0")
    Handle(000000000000005C,"File","")
    Handle(0000000000000000,"","")
    Handle(0000000000000000,"","")
    Handle(0000000000000000,"","")
    Handle(0000000000000000,"","")
    Handle(0000000000000000,"","")
    Handle(0000000000000000,"","")
    Handle(0000000000000000,"","")
    Handle(0000000000000000,"","")
    Handle(0000000000000080,"Semaphore","")
    Handle(0000000000000084,"Semaphore","")
    Handle(0000000000000000,"","")
    Handle(0000000000000000,"","")
    Handle(0000000000000000,"","")
    Handle(0000000000000000,"","")
    Handle(0000000000000000,"","")
    Handle(0000000000000000,"","")
    Handle(0000000000000000,"","")
    Handle(0000000000000000,"","")
    Handle(0000000000000000,"","")
    Handle(0000000000000000,"","")
    Handle(0000000000000000,"","")
    Handle(0000000000000000,"","")
    Handle(0000000000000000,"","")
    Handle(0000000000000000,"","")
    Handle(0000000000000000,"","")
    Handle(00000000000000C4,"Section","")
    Handle(00000000000000C8,"Event","")
    Handle(0000000000000000,"","")
    Handle(0000000000000000,"","")
    Handle(00000000000000D4,"Event","")
    Handle(00000000000000D8,"Key","\REGISTRY\USER\S-1-5-21-2127521184-1604012920
-1887927527-1877126_CLASSES")
    Handle(0000000000000000,"","")
    Handle(00000000000000E0,"ALPC Port","")
    Handle(00000000000000E4,"Key","\REGISTRY\USER\S-1-5-21-2127521184-1604012920
-1887927527-1877126")
    Handle(00000000000000E8,"Section","\Windows\Theme3392824991")
    Handle(00000000000000EC,"Section","\Sessions\1\Windows\Theme2414463033")
    Handle(0000000000000000,"","")
    Handle(0000000000000000,"","")
    Handle(00000000000000F8,"Key","\REGISTRY\MACHINE\SYSTEM\ControlSet001\Contro
l\Nls\Locale")
    Handle(00000000000000FC,"Key","\REGISTRY\MACHINE\SYSTEM\ControlSet001\Contro
l\Nls\Locale\Alternate Sorts")
    Handle(0000000000000100,"Key","\REGISTRY\MACHINE\SYSTEM\ControlSet001\Contro
l\Nls\Language Groups")
    Handle(0000000000000104,"File","")
    Handle(0000000000000108,"Section","")
    Handle(000000000000010C,"Key","\REGISTRY\MACHINE\SYSTEM\ControlSet001\Contro
l\Nls\Sorting\Ids")
    Handle(0000000000000110,"Event","")
    Handle(0000000000000114,"Thread","")
    Handle(0000000000000118,"Event","")
    Handle(000000000000011C,"Mutant","")
    Handle(0000000000000000,"","")
    Handle(0000000000000124,"Event","")
    Handle(0000000000000128,"Event","")
    Handle(000000000000012C,"Event","")
    Handle(0000000000000130,"Event","")
    Handle(0000000000000134,"Event","")
    Handle(0000000000000000,"","")
    Handle(000000000000013C,"Section","\BaseNamedObjects\__ComCatalogCache__")
    Handle(0000000000000140,"File","")
    Handle(0000000000000144,"Key","\REGISTRY\USER\S-1-5-21-2127521184-1604012920
-1887927527-1877126_CLASSES")
    Handle(0000000000000000,"","")
    Handle(000000000000014C,"Event","\KernelObjects\MaximumCommitCondition")
    Handle(0000000000000150,"Key","\REGISTRY\MACHINE\SOFTWARE\Microsoft\WindowsR
untime\CLSID")
    Handle(0000000000000154,"Key","\REGISTRY\MACHINE\SOFTWARE\Classes\Activatabl
eClasses\CLSID")
    Handle(0000000000000158,"Section","\BaseNamedObjects\__ComCatalogCache__")
    Handle(000000000000015C,"Mutant","\Sessions\1\BaseNamedObjects\MSCTF.Asm.Mut
exDefault1")
    Handle(0000000000000160,"Key","\REGISTRY\USER\S-1-5-21-2127521184-1604012920
-1887927527-1877126_CLASSES")
    Handle(0000000000000164,"Event","")
    Handle(0000000000000168,"Event","")
    Handle(000000000000016C,"Thread","")
    Handle(0000000000000170,"Timer","")
    Handle(0000000000000174,"Event","")
    Handle(0000000000000000,"","")
    Handle(0000000000000184,"Section","\Sessions\1\BaseNamedObjects\windows_shel
l_global_counters")
Stream 7: type UnusedStream (0), size 00000000, RVA 00000000
Stream 8: type UnusedStream (0), size 00000000, RVA 00000000
Stream 9: type UnusedStream (0), size 00000000, RVA 00000000

Windows 8 Version 9200 MP (8 procs) Free x64
Product: WinNt, suite: SingleUserTS
Built by: 6.2.9200.16384 (win8_rtm.120725-1247)
Machine Name:
Debug session time: Tue Apr 30 18:37:57.000 2013 (UTC - 7:00)
System Uptime: not available
Process Uptime: 0 days 0:00:45.000
  Kernel time: 0 days 0:00:00.000
  User time: 0 days 0:00:00.000
*** WARNING: Unable to verify timestamp for user32.dll
*** ERROR: Module load completed but symbols could not be loaded for user32.dll
PEB at 000007f7bc254000
Unable to load image C:\Windows\System32\ntdll.dll, Win32 error 0n2
*** WARNING: Unable to verify timestamp for ntdll.dll
*** ERROR: Module load completed but symbols could not be loaded for ntdll.dll
*************************************************************************
***                                                                   ***
***                                                                   ***
***    Either you specified an unqualified symbol, or your debugger   ***
***    doesn't have full symbol information.  Unqualified symbol      ***
***    resolution is turned off by default. Please either specify a   ***
***    fully qualified symbol module!symbolname, or enable resolution ***
***    of unqualified symbols by typing ".symopt- 100". Note that   ***
***    enabling unqualified symbol resolution with network symbol     ***
***    server shares in the symbol path may cause the debugger to     ***
***    appear to hang for long periods of time when an incorrect      ***
***    symbol name is typed or the network symbol server is down.     ***
***                                                                   ***
***    For some commands to work properly, your symbol path           ***
***    must point to .pdb files that have full type information.      ***
***                                                                   ***
***    Certain .pdb files (such as the public OS symbols) do not      ***
***    contain the required information.  Contact the group that      ***
***    provided you with these symbols if you need this command to    ***
***    work.                                                          ***
***                                                                   ***
***    Type referenced: ntdll!_PEB                                    ***
***                                                                   ***
*************************************************************************
error 3 InitTypeRead( nt!_PEB at 000007f7bc254000)...
Finished dump check

Wednesday, May 15, 2013

Security application
In our previous posts we talked about a security administration application that enables domain object based security. We discussed several scenarios, features, approaches, and in general discussed a UI application that would enable configuration of user and object security. Today we try to improve upon the notion of user role management and it's place in this security application. Typically many of the web applications will leave user management to administrators and tools outside the application such as the operating system applets. And integrating user management with that of system, there is a lot more features and tools available for user management. Then there are applications like SiteMinder as well for single sign-on feature.  And there are some interoperability tools that lets you configure users across platforms. Even that is being pushed to system level such as with Active Directory integration freeing up the application to do more for its business users.
Therefore unless there is a business need for security, the applications don't support these kinds of operations. There might be other reasons to require security such as when web applications do have different membership providers that keep user information in different stores such as asp.net stores, SQL stores, local file systems based store that require a common interface for management.  Moreover, there may be mobile users who may require access that needs to be secured. In such cases, the mobile applications may not be hitting the web application UI but the API interfaces. Those methods may also need to be secured for different users and applications.
Overall, there's reasons for mapping users with objects and methods.
Most times these mapping is dynamic like a decision tree or a classifier that dynamically groups users and maps them to resources. This can be a policy server where the different policies or classification rules can be registered and maintained. The policies define which groups are associated with which pool of resources. The code to associate users with groups can be a scalar user defined function that takes incoming users and groups them. These groups have no meaning inside of the system other than a scalar value. The resources are what the application knows. They can be classified into some organizational units called pools. The users are temporary and they can change often. We keep track  of more stable groups and associate users with groups. The groups can have certain privilege levels and are different from roles in that the roles are a subset of the groups but groups are what pools of resources assigned to. By having a dynamic classification mechanism, the users can be switched to one or more groups.
Policy server and access control for a user is a complex topic involving many different organizational units. Take IPSEC for network access control. There are many parameters for controlling IP security.

Reminder on GC

The reason Dispose() method has a Boolean parameter is to differentiate between when we are called by a finalizer versus ourselves. 

Tuesday, May 14, 2013

Here we discuss an implementation from previous posts to finding topics based on a set of keywords. Let us say we have a function similar() that returns a set of words that co-occur with the words in the language corpora. Let us say we have selected a set of keyword candidates in set W. 
For each of the words, we have found the similar co-occurring words and put them in a cluster. The clusters have  a root keyword and all the similar words as leaves. When two clusters share common words, the clusters are merged. So the clusters could be additive. The root word of the combined cluster is the combination of the root words of their individual clusters. Similarly the leaves of the cluster are a combination of the leaves of the individual clusters. We may have to iterate several times until we find that there are no cluster pairs that share similar words.

Application Settings architecture

This is a review of the application settings architecture from MSDN. A setting specified in a custom settings file and embedded as a resource in the assembly is resolved when called from a console application but not from a test project. Hence this review is for a quick recap of the underlying mechanism.
Settings are strongly typed with either application scope or user scope. The default store for the settings is the local file based system. There is support for adding custom stores by way of SettingsProvider attribute.
SettingsBase provides access to settings through a collection. ApplicationSettingsBase adds higher level loading and saving operations, support for user-scoped settings, reverting a user's settings to the predefined defaults,  upgrading settings from a previous application  and validating.
Settings use the windows form data binding architecture to provide two-way communication of settings updates between the settings object and components. Embedded resources are pulled up with Reflection.

Monday, May 13, 2013

To get stack frames from streams instead of dump files

Dump files can be arbitrarily large and they may generally stored in compressed format along with other satellite files.  File operations including extraction and copying on a remote network can be expensive. If we were interested only in a stack trace, we are probably not interested in these operations. Besides, we rely on the debuggers to give us the stack trace. The debuggers can attach to process, launch an executable and open the three different kinds of dump files to give you the stack trace but they don't work with compressed files or sections of it. While the debuggers have to support a lot of commands from the user, retrieving a specific stack trace requires access only to specific ranges of offsets in the crash dump file. Besides, the stack trace comes from a single thread. Unless all the thread stacks have to be analyzed, we will look at how to retrieve a specific stack trace using stream instead of files.
Note getting a stack trace that we describe here does not require symbols. The symbols help to make the frames user friendly. That can be done separately from getting the stack trace. Program debug database files and raw stack frames are sufficient to pretty print a stack.
The dump files we are talking about are Microsoft proprietary but the format is helpful for debugging. Retrieving physical address in a memory dump is easy. TEB information has top and bottom of stack. and memory dump of these can give us the stack.
Using streams is an improvement over using files for retrieving this information.
Streams can be written to a local file so we don't lose any feature we currently have.
Streams allow you to work with specific ranges of offsets so you don't need the whole file.
With a stream,
Debugger SDK available with the debugging tools has both managed and unmanaged APIs to get a stack trace. These APIs instantiate a debugging client which can give a stack trace. However, there is no API for supporting a stream yet. This is probably because most debuggers prefer to work on local files because the round trips for an entire debugging session over a low bandwidth and high latency networks is just not preferable. However, for specific operations such as to get a stack trace, this is not a bad idea. In fact, what stream support to GetStackTrace buys us is the ability to save a few more roundtrips for extraction, save on local storage as well as creating archive locations, and reduce the files and database footprint.
Both 32 bit and 64 bit dump require similar operations to retrieve the stack trace. There is additional information in the 64-bit dump files that helps with parsing.
The stack trace once retrieved can be made user friendly by looking up the symbols. These symbols are parsed from the program debug database.  Modules and offsets are matched with the text and then the stack symbols can be printed better. Information need not be retrieved from these files by hand but they can be retrieved with the Debug Interface Access. There's an SDK available on MSDN for the same.
Lastly, with a streamlined operation of retrieving stack trace as read only, no file copy, no maintenance of data or metadata locally, the stack trace parsing and reporting can be an entirely in-memory operation.

Assembly Settings

In writing applications and libraries using C#, we may have frequently encountered a need to define configuration data as settings. This we define with a settings file and keep it under the Properties folder of the assembly source and consequently has the Properties namespace. As different libraries are loaded into the assembly, each assembly may define its own settings that can be used as is or overridden by the calling application. The settings are compiled into the assembly's resource which one can view from the assembly. When more than one assembly is referenced in the current application, these settings are resolved in a by first looking up in the local settings file and then any other settings provider which derive from the abstract SettingsProvider class. The provider that a wrapper class uses is determined by decorating the wrapper class with the SettingsProviderAttribute.

Sunday, May 12, 2013

Compiler design review

Programs that are written in a high level programming language by programmers need to be translated to a language that machines can understand. A compiler translates this high level programming language into the low level machine language that is required by the computers.
This translation involves the following:
1) Lexical analysis This is the part where the compiler divides the text of the program into tokens each of which corresponds to a symbol such as a variable name, keyword, or number.
2) Syntax analysis This is the part where the tokens generated in the previous step are 'parsed' and arranged in a tree-structure ( called the syntax tree) that reflects the structure of the program.
3) Type checking This is the part where the syntax tree is analyzed to determine if the program violates certain consistency requirements for example if a variable is used in a context where the type of the variable doesn't permit.
4) Intermediate code generation This is the part where the program is translated to a simple machine independent intermediate language.
5) Register allocation : This is the part where the symbolic variable names are translated to numbers each of which corresponds to a register in the target machine code.
6) Machine code generation : This is the part where the intermediate language is translated to assembly language for a specific architecture
7) Assembly and linking : This is the part where the assembly language code is translated to binary representation and addresses of variables, functions etc are determined.
The first three parts are called the frontend and the last three parts form the backend.
There are checks and transformation at each step of the processing in the order listed above such that each step passes stronger invariants to the next. The type checker for instance can assume the absence of syntax error.
Lexical analysis is done with regular expressions and precedence rules. Precedence rules are similar to algebraic convention. Regular expressions are transformed into efficient programs using non-deterministic finite automata which consists of a set of states including the starting state and a subset of accepting states and transitions from one state to another on the symbol c. Because they are non-deterministic, compilers use a more restrictive form called deterministic finite automaton. This conversion from a language description written as regular expression into an efficiently executable representation, a DFA, is done by the lexer generator.
Syntax analysis recombines the token that the lexical analysis split. This results in a syntax tree which has the tokens as the leaves and their left to right sequence is the same as input text. Like in lexical analysis, we rely on building automata and in this case the context free grammars we find can be converted to recursive programs called stack automata. There are two ways to generate such automata, the LL parser (the first L indicates the reading direction and the second L indicates the derivation order) and the SLR parser (S stands for simple)
Symbol tables are used to track the scope and binding of all named objects It supports operations such as initialize an empty symbol table, bind a name to an object, lookup a name in the symbol table, enter a new scope and exit a scope.
Bootstrapping a compiler is interesting because the compiler itself is a program. We resolve this with a quick and dirty compiler or intermediate compilers.
from the textbook on compiler design by Mogensen