Cluster computing

Thursday, May 9, 2013

signature file

A signature file is data structure for text based database systems that support efficient evaluation of boolean queries. It contains an index record for each document in the database. This index record is the signature and it is has a fixed size of b bits, b is called the signature width. The bits that are set depend on the words that appear in the document.Words are mapped to bits based on a hash function. Bits are set from the result of the hash function. The same bit could be set twice by different words because the hash function maps both words to same bit. One signature matches another if it has at least as many bits set as the other. For a query consisting of a conjunction of terms, we first generate the query signature by applying the hash function to each word in the query. Then we scan the signature file to get all the documents that match the query signature, because every such document is a potential result to the query. Note that the documents may not contain the words in which case it is called a false positive. These can be expensive since we have to retrieve the document, parse, stem and then check to determine whether it contains the query terms. For the queries that have a disjunction of terms, we generate a list of query signatures and apply a comparision rule of any signature match by the documents. Also, for each query, we have to scan the complete signature file, and there are as many records in the signature file as there are documents. To reduce the amount of data that has to be retrieved by each query, we can vertically partition the signature file into a bit of slices, and we call such an index a bit sliced signature file. The length of each bit slice is still equal to the number of documents in the database, but for q bits set, we only need to retrieve q bit slices. - as read from textbook on Database management systems

continuing on the previous post

One more concept in the visual rendering of objects is that the objects should redraw themselves on any user input such as dragging or adding or deleting them.

Wednesday, May 8, 2013

If you have used the tools such as code map, you may have liked the display of the organization of items as objects which float around to new locations as new ones are added and old ones are removed. While the default layout of the objects are such that each is evenly spaced and clustered around the center of the canvas, you can drag any object to a new position and the connectors that connect the moved object with others are automatically redrawn avoiding any objects that may be in the path between the moved object and others. This post talks about how to implement the logic that makes these objects float around to an appealing visual display. First, each item as it is added on the canvas finds its starting position. This is typically the center of the canvas for the first item. The next item is placed such that the centers of both these objects are at an equidistance from the center. As more and more vertices are added, the objects occupy the vertices of the corresponding polygons. This is true when the objects are all equal and the items appear uniformly spread around the canvas. But sometimes objects are represented as hierarchy in which there are multiple levels, each level appearing as a line of items. In such a case, the object occupy positions on the line such that their centers are equidistant. If there were a graph of objects, then the node with the maximum number of edges could occupy the center of the canvas, and the other nodes could be arranged around it in increasing circles such that the ones with more number of edges are closer. No matter how we choose to position the centers, the objects themselves can be visualized as a square or rectangle so that some space is left around the objects when they are rendered. Now we can talk about the connectors such which connect between the center of the edges of this square or rectangle bounding the objects. These connectors are drawn as straight lines when there are no obstructions and as curves when they have to avoid other objects in their path. The curves are drawn with no particular preference for left or right of the obstruction and chosen such that the distance covered is minimum. This can be easy to find given the x and y co-ordinates of the start and end of connectors. When the object is moved, all the connectors to that object are redrawn based on the new destination. This redrawing does not affect the other objects. The curvature of the connectors are dependent on the size of the obstruction. Both the connectors and the objects make use of their centers for finding positions on the canvas. Since the strategy for spacing out their centers on the canvas are interchangeable, the code to implement them could consider using the design patterns.

Enterprise logging application block

Event logging is required from web applications and servers for diagnostics. Applications choose from a variety of application such as ELMAH, Enterprise application block etc. ELMAH is based on http proxy and is great for listening in on http requessts and responses however enterprise application block provides a much more robust and rich framework for logging. Application can write events to the following:
event log
e-mail message
database
message queue
text file
wmi event
and custom locations
The application provides a consistent interface for logging information to destination. Application configuration settings determine what the destination is so one location can be swapped with another without any modification of the application code.
The same application block can be used within the application and across the enterprise. LogEntry is written with context information and is helpful for tracing. Priority, Severity and Categories when specified in the LogEntry can be very helpful to differentiate events and analyze them. Events when they are sent to the event viewer can be analyzed with tools such as LogParser
Logging Application Block can be extended in the following manner.
Create a new custom class and add it to the project
The class should implement the required interfaces
Create the custom object in the Enterprise Library Configuration Console
Specify the custom class as the type name
specify any custom configuration properties by modifying the attributes of the object

Tuesday, May 7, 2013

Building a file watcher service

Why is file watcher service a bad idea ?
Many applications require the use of a file watcher. Files can be dropped and they are picked up almost instantaneously for processing and queued for completion. There are several advantages to this method. First, files are visible in the explorer so that you don't need any tools to know what the requested item of work was. Second, the files can be arbitrarily large and they can hold a variety of datatypes both structured or semi-structured. Third, the file processing is asynchronous and there are no dependencies or blocking between the producer and consumer. Fourth, its simplicity and direct reliance on basic everyday common file operations makes it very popular. Fifth, the bulk of the processing that requires delayed or heavy background processing can work with a copy or the original file without any contention or dependency on anybody. Lastly, the system can scale because the tasks are partitioned on data.
Then what could go wrong. First, file locks are notorious for the dreaded "this file cannot be moved because another program is using it" error message. Second, the software that works on different file types may come with its own limitations suh as max file size, file conversion or translation and file handling. Third the file handling is a native operating system methods and vulnerable to different kinds of exceptions and errors. In fact, the scheduler/task handling the file operation may have to deal with difficult error handling and exceptions that requires retries and user intervention.
So what could be a good replacement. One alternative is to use a database in place of the file store and let the database handle the binary or blob storage as columns or FILESTREAM. This comes with all the benefits of keeping the data all together or portable. Another approach is to use a messaging queue such as MSMQ that has robust features for error handling and recovery such as retries and dispatch. A third approach is to use services such as WCF that translate requests to messages and allow the transport to handle reliability and robustness. In fact such services can scale well in a SOA model.

Monday, May 6, 2013

security application design continued

I would like to add the following to the previous posts on the security application:
1) object access control list
2) object lifetime management
3) object permissions view
I have looked at the security application block. That is helpful to authenticate and authorize users, retrieve role and cache caching user profile. It solves a lot of the application security and is extensible to add security providers.
However, domain objects are not required to have security access control or are controlled via business logic that typically is pushed down to the database server as stored procedures. Stored procedures are helpful in enterprise cases where the prepared plans can be very helpful. Besides, object persistence requires data store. And data store comes with security.
So it may seem like that there is no custom security modules required out of the data store. However with business logic sometimes spread across backend, middle-tier, and front end, there is no one layer in which the security can be consolidated. Consequently, validations may be spread out.
Moreover, some checks are done upfront where data is either hidden or rendered readonly from the user's view. Often the control states are based on what the view models allow and they pull their data from the models and in turn data store. Since the check happens as fast and as early as possible, the objects are expected to carry the security information with them at the time the viewmodels are initialized. The objects are instantiated and disposed for the duration of the view model only and this typically is so short lived that there is no need for object based security. Security is already declared and available from the data store.
However, let us view the case where we could do things a little differently. We want to have a security admin selectively make certain data as readonly for a certain downtime so that users using the database cannot modify these data during this downtime even though they have access. The security admin is not interested in making permanent changes. Further the scoping is not at schema level but to domain objects often referred to with their names or ids as values.
Let us look at how the security admin would selectively disable some objects from all users with existing tools. First, they could apply different labels to select records across the schema to disable these and revoke the same changes. These changes could be executed with a stored procedure giving all the benefits of security control and audit. The changes are also in one place and very easily managed across clients and applications since they are as close to the data as possible.
However, let's look at services and applications that use more than one data store and integrate across a variety of data providers. These services or applications could keep their own databases that they read data from the downstream data providers and that way we could revert to the previous method where we apply security labels to a single data store.
That said, let's consider the case where we implement a truly middle tier SOA service based security where the object turned on or off without necessarily reaching the database. Further let's say we don't want hard on or off to the objects that prevent user from read write but merely tag them with labels so that we can decide to take appropriate action on these objects on a graded scale.
So we are really looking for an object tagger that we can visualize to study such things as usages and access patterns. Then how do we build an object tagger that can be non-invasive to the object ?

Sunday, May 5, 2013

nltk

Let's quickly review the documentation for nltk.text
1) a bidirectional index between words and their 'contexts' in a text
methods:
word_similarity_dict : returns a dictionary mapping between words and their 'similarity_scores'
similar_words : returns words from the context
common_contexts: finds contexts where all the words can appear
2) Concordance index: an index that can tell where the words occur
methods:
print_concordance : prints a concordance for the word
3) TokenSearcher : uses regular expressions to search over tokenized strings
methods:
find_all: finds instances of the regular expressions in the text
4) Text: a wrapper around a sequence of simple string tokens, initialized from a simple
methods:
concordance : prints a concordance for word with the specified context window
collocations : prints collocations derived from text, ignoring stopwords
count: the number of times a given word appears
similar: this gives other words that appear in the same contexts as the specified word
dispersion_plot: shows the distribution of words throughout the text
5) TextCollection : initializes a collection of text