Cluster computing

Tuesday, May 7, 2013

Building a file watcher service

Why is file watcher service a bad idea ?
Many applications require the use of a file watcher. Files can be dropped and they are picked up almost instantaneously for processing and queued for completion. There are several advantages to this method. First, files are visible in the explorer so that you don't need any tools to know what the requested item of work was. Second, the files can be arbitrarily large and they can hold a variety of datatypes both structured or semi-structured. Third, the file processing is asynchronous and there are no dependencies or blocking between the producer and consumer. Fourth, its simplicity and direct reliance on basic everyday common file operations makes it very popular. Fifth, the bulk of the processing that requires delayed or heavy background processing can work with a copy or the original file without any contention or dependency on anybody. Lastly, the system can scale because the tasks are partitioned on data.
Then what could go wrong. First, file locks are notorious for the dreaded "this file cannot be moved because another program is using it" error message. Second, the software that works on different file types may come with its own limitations suh as max file size, file conversion or translation and file handling. Third the file handling is a native operating system methods and vulnerable to different kinds of exceptions and errors. In fact, the scheduler/task handling the file operation may have to deal with difficult error handling and exceptions that requires retries and user intervention.
So what could be a good replacement. One alternative is to use a database in place of the file store and let the database handle the binary or blob storage as columns or FILESTREAM. This comes with all the benefits of keeping the data all together or portable. Another approach is to use a messaging queue such as MSMQ that has robust features for error handling and recovery such as retries and dispatch. A third approach is to use services such as WCF that translate requests to messages and allow the transport to handle reliability and robustness. In fact such services can scale well in a SOA model.

Monday, May 6, 2013

security application design continued

I would like to add the following to the previous posts on the security application:
1) object access control list
2) object lifetime management
3) object permissions view
I have looked at the security application block. That is helpful to authenticate and authorize users, retrieve role and cache caching user profile. It solves a lot of the application security and is extensible to add security providers.
However, domain objects are not required to have security access control or are controlled via business logic that typically is pushed down to the database server as stored procedures. Stored procedures are helpful in enterprise cases where the prepared plans can be very helpful. Besides, object persistence requires data store. And data store comes with security.
So it may seem like that there is no custom security modules required out of the data store. However with business logic sometimes spread across backend, middle-tier, and front end, there is no one layer in which the security can be consolidated. Consequently, validations may be spread out.
Moreover, some checks are done upfront where data is either hidden or rendered readonly from the user's view. Often the control states are based on what the view models allow and they pull their data from the models and in turn data store. Since the check happens as fast and as early as possible, the objects are expected to carry the security information with them at the time the viewmodels are initialized. The objects are instantiated and disposed for the duration of the view model only and this typically is so short lived that there is no need for object based security. Security is already declared and available from the data store.
However, let us view the case where we could do things a little differently. We want to have a security admin selectively make certain data as readonly for a certain downtime so that users using the database cannot modify these data during this downtime even though they have access. The security admin is not interested in making permanent changes. Further the scoping is not at schema level but to domain objects often referred to with their names or ids as values.
Let us look at how the security admin would selectively disable some objects from all users with existing tools. First, they could apply different labels to select records across the schema to disable these and revoke the same changes. These changes could be executed with a stored procedure giving all the benefits of security control and audit. The changes are also in one place and very easily managed across clients and applications since they are as close to the data as possible.
However, let's look at services and applications that use more than one data store and integrate across a variety of data providers. These services or applications could keep their own databases that they read data from the downstream data providers and that way we could revert to the previous method where we apply security labels to a single data store.
That said, let's consider the case where we implement a truly middle tier SOA service based security where the object turned on or off without necessarily reaching the database. Further let's say we don't want hard on or off to the objects that prevent user from read write but merely tag them with labels so that we can decide to take appropriate action on these objects on a graded scale.
So we are really looking for an object tagger that we can visualize to study such things as usages and access patterns. Then how do we build an object tagger that can be non-invasive to the object ?

Sunday, May 5, 2013

nltk

Let's quickly review the documentation for nltk.text
1) a bidirectional index between words and their 'contexts' in a text
methods:
word_similarity_dict : returns a dictionary mapping between words and their 'similarity_scores'
similar_words : returns words from the context
common_contexts: finds contexts where all the words can appear
2) Concordance index: an index that can tell where the words occur
methods:
print_concordance : prints a concordance for the word
3) TokenSearcher : uses regular expressions to search over tokenized strings
methods:
find_all: finds instances of the regular expressions in the text
4) Text: a wrapper around a sequence of simple string tokens, initialized from a simple
methods:
concordance : prints a concordance for word with the specified context window
collocations : prints collocations derived from text, ignoring stopwords
count: the number of times a given word appears
similar: this gives other words that appear in the same contexts as the specified word
dispersion_plot: shows the distribution of words throughout the text
5) TextCollection : initializes a collection of text

calculating distance measure

Similarity distance measures between terms require that probabilities and conditional probabilities for the terms are computed. We rely on the corpus text to compute these. In addition, we use a naive Bayes classifier to determine the probability of term occurrences. Some of these probabilities were mentioned in an earlier post but today we take a look at whether we need to calculate them on the fly as we cluster the terms. Probabilities associated with the corpus text can be calculated in advance of the processing of a given text. For example, the probability of selecting an occurrence of a term from a source document as given by the number of occurence of the term in that document and the total number of occurrences in the corpus is something we can calculate and keep.
The distance measure itself is calculated once for each term that we evaluate from the document. If we choose a distance measure like the Jaccard coefficient, then the we evaluate the parts corresponding to each term in the pair. The calculation is a bit different when we use cosine similarity (Wartena 2008) between terms because we now use the sum of the respective probalities as well as their squares. The distance measure is calculated as one minus the cosine similarity.
The terms as well as the measure depend on summation of the probabilities over all documents. These documents are those from a collection C where each term from the term collection T can be found in exactly one source document. This doesn't mean the other documents cannot have occurences of the same term just that this particular instance of the term cannot be in multiple documents. So each occurrence is uniquely identified by the term, document pair and position. When we want to find the number of occurrences of a term, we sum the occurrences over all the documents in the collection.
We also consider Markov chain evolution for finding distributions of co-occurring terms. The first step is the calculation of the term occurrence in a particular document given that the term distribution of the occurrences is p. This we find as a sum over all the terms.
If we have a document distribution instead of the term distribution, we similarly compute the probability of finding a document with a particular term occurrence and their sum over all the documents. This leads to a weighted average of all the term distributions in the documents.
We can combine the above two when we evaluate the chain twice to get a new distribution which we use to find the distribution of the co-occurring terms t and z. By that we mean we find the distribution of one term given the first step of finding the distribution of a previous term. This gives an indication of the density of the document rather than the mere occurrence or non-occurrence of a keyword in a document. Otherwise it is similar to the previous model.
The document collection plays an important factor in the evaluation of the probabilities. A good choice of the documents and their processing will improve the results from the keyword analysis here. The corpus text is a comprehensive collection of documents and has already been tagged and parsed. While there could be improvements to the corpus text such as with the substitution of pronouns with corresponding nouns such that the frequency and distribution of terms are improved, the existing set of documents in the corpus text, their variance and the size is sufficient for a reasonable results from the term set.

Saturday, May 4, 2013

security application discussion continued ...

The Security application discussed in the previous post will enable the following workflows:

The Security administrator must be able to navigate any roles and see the members included. Users care about roles. Additionally, the application add and remove of these members should be enabled from the list and their details should be visible by double-clicking the items. There need not be grid lines and tab pages to separate the views. Instead we could use CSS and borderless transitions between views. These views are the same for each of the member and can include information such as groups, roles, resources, access levels etc. Mostly, we want the UI to not be boxy but clear and simple with seamless and smoothed transitions. A clear white background is preferable to any other colors. So the list of all members in a particular role can be listed on the same page with a white background and no borders for grid. And when the user double-clicks a particular member, the details are shown on the same page. Boxes and borders are great when we comparmentalize the UI parts and great for organizing properties on the UI however the ask here is for simpler information rendering with options to bring details onto the same focus area for the user with minimal peripheral changes. A light stationary hue in the peripheral area actually brightens up the canvas so that the user is drawn towards the simpler format of the information presented somewhere near the center of the page. Technology wise there can be based on XAML, prism, and .Net stack with little or no other front end technologies. But the application can be simpler and nicer ableit for security administration.

Another workflow that we could vision for this application other than adding users to roles as discussed above, is to grant users access to domain objects via both label mapping as well as object hierarchy. Users care about their objects. Note however the premise of the previous discussion was based on row level granularity and not object access. We could exercise object access and object control outside the database while the database has row level granularity. However security applications may have workflows to secure both. But looking at the database schema where each record has row level granularity that is set typically at one time only. ( you may actually want to forbid changing labels of those records because you have evaluated the record for the duration of its existence when roles and all else hasn't changed. Updates to the record does not change the identity of the record and on the other hand changing merely the labels on user input could mean we could end up with an inappropriate label because column constraints may not be able to catch all. This does not mean labels cannot be changed and in fact internal methods may exist to take action on user's behalf) So now let's look at enforcing object access security which is probably the primary workflow of this application. As stated earlier, security admin may want to add security to domain objects and expect it to cascade down to all row level entries. Objects could propagate permissions both on inheritance and composition but the preferred way is inheritance since no traversal is needed. Now coming back to the application to enable object security in a label based schema, the solution is to flatten all the derived objects to the same concrete entities and have them all be labeled the same via updateable views. So in effect we will be updating the row level entries. Note however that the inheritance based flow of security is secondary in priority to directly assigning security to individual objects themselves such as test pass and test results.

Friday, May 3, 2013

Control table for Label based security model in database

Let's look at some examples for the control table data when using the label security kit from sql server. In our case where we are desiging a UI application for security management of popular test tools, we will go by the use cases to pick and choose the values to populate in the tables. For example, we know that test tool users want to preserve the integrity of test results. Hence the results may be read write for data entry but read only for others. Likewise, read only results should be filterable. Hence read only users should be able to specify tags. Also test case cloning may be common operation requiring the use of templates. Similarly, we know that test cases can be used across suites and may be included in different matrices. Therefore they should be made available for increasing reuse. Testers may want the ease of use to define security up the object hierarchy and expect it to cascade down. Hence, we use the classifications of Reserved, private, protected and public. Further we can have compartments of none, readonly, readwrite and owner.
We may have only one category and one compartment. Categories can be hierarchical as in our case but compartments are mutually exclusive. The markings that we have for our category are the classifications mentioned above. Note that the default or guest low privileged access corresponding to public marking may not be sufficient for security provisioning of all out of box features and hence it may need to be split or refined into more classifications. The classification hierarchy is expressed in the marking hierarchy table as opposed to the marking table. Next we have the unique label table that assigns a unique label to a combination of markings and roles.
Database roles will be at least one for each possible value of an any or all comparision rule of a non-hierarchical category. For hierarchical categories, again there will be one for each possible value but the roles will also be nested. Some example of roles are guest, dev, test, production support, reporting, owners, administrator, security administrator etc.
When using label based security model, it is important to note that the labels are assigned directly to each row of the base table. The labels are small often a short byte or an integer and can have a non-clustered index on them. Sometimes, tags are not kept together in a base table but in a junction table of the base identifier and the label marking identifier. The reason this doesn't scale well is because it creates a duplicate column of the identifier. Moreover, if the identifier column is of type guid, then we can't even build an index on them and performance can suffer with scanning and full comparision between guids.Therefore, it is advisable to spend time and effort one time to add labels at the row level directly to the base table.
Next we define a view with a list of all security labels present in the database that the current user has access to. Users may or may not have access to specifying their labels with insert/update/delete. Also,
label syntax and semantics validation can be offloaded to xsd based checks when labels are represented by typed XML. Representations of labels in xml have an element per category.
We can also create helper functions looking up the labels based on id and vice versa and for resolving the label to whether a user has access to the data.
Thus we have discussed the database level schema changes for enabling row based security.
Next let's continue to look at the design of the UI application for Security.
The landing page of the UI security will have a split view between resources and users.Resource lockdown and user access management require detail view by themselves but the security admin's job can become easier if the landing page is like a dashboard with all the controls visible. Some examples are EMC Archer and Sharepoint applications that make governance easier. The security admin ideally wants to enable mapping between users and roles for a tool. Just as likely, (s)he may want an intuitive UI to define one or more of the base tables with appropriate tags. These tasks cannot be left to a designer or sql scripts. The UI for a security provisioning application is very much required to make the job easier and visual for a security admin.
Next the role provisioning, promotion and demotion of user accounts as well as selecting multiple roles for the same accounts must be facilitated with proper UI controls. It would be ideal if there's an illustration of the resultant privileges on a sample data based on the admin's selection of roles for a given user. This visual rendering of the final privilege set may re-inforce what is expected from the changes made.
Lastly, the changes made by an admin should be in the form of a ticket response such as for incident management. This ticket is opened whenever a security change is requested and the actions associated with the changes are documented in the ticket, ideally as automatic by the tool. The tickets not only obviate repudiation but also provide an audit trail.
The UI for a security application could open detailed views on any single account or role or label based on double-clicking with the ability to make and save changes. This gives a glimpse of the UI for security provisioning.

A clustering method for finding keywords in a text

Given a distance function between two terms that measures the similarity between the two terms, we build a tree of clusters which we traverse to insert the term in the cluster with the nearest center. For that cluster we recompute the center as if the record r is inserted into it. If the cluster threshold is exceeded, we can proceed to the next record. If the tree grows beyond a maximum number of clusters because we want to keep only a few clusters, then we can increase the threshold so that the clusters can be merged or accomodate more records

B+ Tree:

class Node:

def __init__(self, data, l = None, r = None, center = None):

self.l = l

self.next = None

l.next = r

self.r = r

self.center = center

def value(self):

return self.center