Cluster computing

Saturday, December 7, 2013

In order to complete the discussion on test portal, we do the following:
We say that there is very little code to be written for the page to kick off a test run and to list the test results.
This works well with a model view controller,
We can add more views for detailed results of each run easily with the framework
We want the results to be published to TFS instead of publishing to files or database.
The runs are kicked off on a local or remote machine via a service that can listen and kick off the mstest. In our case, since we wanted to target the testing of queues, we could use queues for the service to listen on. That said the overall approach is to use as much of the out of box solutions we get rather than having to customize or write new components. Test controller and runs with MTM is well known. The focus here is on our test categorizations and the use of our portal with a variety of tests.
I want to talk a little bit about the UI and database as well. The choice of MVC/ MVVM let's us build out separate testable views with no limit on what can go into the views. We can even have JavaScript and Ajax for post backs and handle control toolkits. With the use of view models, we can have more than one tabs on the pages. This gives us the flexibility to have partial views one for each genre of test such as webtests, apitests, and queue tests. Then with the choices to customize the page based on user, have separate views for changing test data and pages to view stats or associate TFS bugs/ work items and to provide charts or email subscriptions, we can make it all the more sophisticated.
I also don't want to limit the discussion to any one stack. We should be able to implement with any stack be it VS or selenium based.
As an aside, test is also an area where there are many ideas applicable and these have value elsewhere too. For example, from UI, we can borrow the ideas for consolidation, customization, shared or dedicated views , resource grouping and categorization, layout and organization and from MVC we can borrow decomposition, test ability, and separation of concerns. We can illustrate ideas with an example for test ability, an example for MVC organization, and an idea for representing persistence via an organization of providers. We can even represent incremental builds.
As a developer we look for ideas and implementations everywhere.

Friday, December 6, 2013

I want to talk about a web application for executing test runs and viewing results. Lets say that the tests are authored in VS and that they are categorized by means of attributes. The user is able to select one or more categories of this test for execution. For each run, the corresponding mstest arguments are set and invoked. The web page itself has a progress indicator for the duration of the test execution. The test run is invoked by a command line an d is independent of the portal view. The results of the test run could be displayed with the availability if a trx file from the run or from a database where the results are populated independently. The tests themselves have names, owners, categories and the path to the assembly so they can be found and executed. The test execution and history can also be displayed from files or database. What we need us the minimal logic to kick off a run and show the results. The results for each run could then show the history of each test. Since the views are only about the test results and the input to kickoff a run, the models are for the runs/results and the test/category to be executed.
The views are separated out into the list view and the details view for each results. The list view could have additional features such as paging, sorting, searching etc. to make it easier to navigate the test results but the details page only requires the path to the trx file and a drill-down into the history of individual tests within the category. The web page could be personalized to display the signed on user and the test runs associated with that user. This requires the use of AD integration for the login if the webpages are to be displayed internally. A domain name registration could help reach this web application by name.
The reason for describing this web application is to see if the implementation can be done with minimal code and leveraging the latest MVC webAPI frameworks to get great working results. This will allow the views and pages to be added rapidly. So far we have talked about the controller kicking off the run and returning a default view. Then the views get updated when the results are ready. Since the completion is dependent on when we have a trx file or the results in the database, the two are handled separately.
In other words, we display the web page caters to the convenience of kicking off the run and waiting for the results. To talk about the process where the results are populated into a database offline, we could use either a custom executable or the option with mstest to publish the results to TFS. The latter is convenient in the sense that TFS already enables web access and navigation so our portal doesn't need to do a lot.
Finally, we display the results with JQuery progress bars, itemized row level styles and overall page beautifying style and content layout. The web pages could also have themes and variations based on how the tests are populated. For example, for tests that target say MSMQ, the tests could optionally show which queues are targeted.

Thursday, December 5, 2013

In today's post we talk about maximal Kullback-Leibler Divergence cluster analysis by Anne Sheehy. In this paper, she discusses a new procedure for performing a cluster analysis along with a proof for a consistency result for the procedure. The clustering of data X1, X2, ..., Xn with distribution P on sample space X in R with a set of known partitions K results in members Vk from the collection of k partitions of space X. The partition that best describes the clustering structure of the data is defined to be the one which maximizes a criterion. In this case, the criterion is a weighted sum of Kullback-Leibler divergences.
The Kullback-Leibler function K(Q,P) takes the value 0 when m = EpX and increases as m moves away from EpX.The rate of increase of the function will depend on many things besides the distance of m from EpX, including the dispersion of P in the direction in which m is moving.

The value K(Qn-min(A),Pn) differentiates the subset A of {X1, X2, ...., Xn} from the rest of the sample. If the cluster A of the sample is far removed from the rest, it will have a very large value for K(Qn-min(A),Pn).
Here the subset A are chosen such that P(A) > 0.
The paper mentions a reference to Blaschke's Selection Theorem
If a clustering procedure results in k clusters on a given set of data, and all points in any one of these clusters is removed from the sample space and re-clustered, it should give the original k-1 clusters except for the one that is removed. This "cluster omission admissibility" is a characterisitic of a good clustering procedure.
In order to select a subset of the original sample space, we could consider clustering the probabilites into a set of k -partitions or ranges of values. Since the value ranges are bounded to be between 0 and 1 and we can divide the space into k non-overlapping ranges, we will be able to use one or more of these clusters for different subsets. The goal of the subset is to maximize the KLD criterion so this works to choose one from the different subsets based on the value of the K(Qn-min(A), Pn)
The probabilities used in KLD is based on term frequencies and can be taken as the ratio of the sum of the total number of terms in sentences where this term appears divided by the total number of terms in the document.
Thus we can apply the term frequency at any scope, whether sentences relative to documents or documents relative to collection.
The procedure is the same for selection regardless of the scope and the clustering too can be applied using measures that are based on different scopes.

Today I want to talk about some technologies that are not commonly encountered in .Net world but are staple in Java shops. Tools are also different for these technologies. For example, Eclipse is the Java based developer environment and has many features like Visual Studio. Git is used for source control. It has command line that's based on bash shell. This shell is like a unix variant and the version control is also incremental.
Among the technologies usually included with Java are Apache server, JSP and TomCat.
Apache is a web server which fueled the growth of internet. Its most commonly used on Unix like system and released with a popular license. Apache is supported on a wide variety of platforms. Its implemented as compiled modules which extend the core functionalities. Its interfaces support a variety of scripting languages. Transport, authentication, proxy, logging, URL rewriting and filtering are some of the feature available via modules. It scales well for performance since it provides a variety of multi-processing modules which allow Apache to be run in a process based or hybrid mode. Apache strives to reduce latency and increase throughput. Apache is used in LAMP stacks which is an acronym for Linux, Apache, MySQL and PHP/Perl. MySQL or MongoDB is for databases and PHP/Perl/Python are scripting languages. This stack is open source but its popular for its high performance and high availability.
JSP is a technology that helps create HTML pages or other document types. It is similar to PHP but uses the Java programming language. To deploy and run Java server pages, Apache or similar web server is required.
JSPs are translated into servlets at runtime and they can be used independently or as views in model-view-controller design. The compiled pages uses java byte code rather than a native software format.
JBoss is an executable over JavaEE and it's an application server. It's now called as WildFly. Also, IBM WebSphere Application Server is used as a web application server or middleware for hosting Java based web applications.
Maven and Jenkins are used for builds and test. Maven is a build automation tool used primarily for Java projects It can be used to build C# and other projects as well. The projects for build are configured using a project object model and declared in a pom.xml file.
It uses the JUnit framework for building the project and running the unit-tests.

Wednesday, December 4, 2013

In the previous post we were describing the steps for the algorithm by Matsuo. We mentioned that the number of running terms was taken as Ntotal and that the top 30% of these terms were taken as frequent terms. Then the frequent terms were clustered pair wise whose Jensen-Shannon divergence is above the threshold. This results in C clusters. The number of terms that co-occurs with c and denoted by nc gives us the expected probability as nc/Ntotal. Then we compute the chi-square for each term and then show the given number of terms as the ones with the largest chi-square .
Index terms are evaluated based on precision and recall but this algorithm does not use a corpus. So the algorithm was run on different data sets by different authors who were also asked to provide five or more terms which they thought were indispensable keywords. Coverage of each method was calculated by taking the ratio of the indispensable terms to the net 15 terms found by this algorithm. The results were comparable with tf-idf.

In the previous post, we mentioned an innovative approach to extract the keywords from a single document. It uses word co-occurrences and we wanted to use Kullback-Leibler, clustering and cosine similarity. We were keen on extracting keywords by associating with topics simultaneously. and repeat the partitioning until cluster centers stabilize. We wanted the flexibility for topic overlap with fuzzy memberships. We were also interested in a term-attribute table that spanned all the words in a dictionary and with attributes that helped us discern topics, tones and era. Note that we attached the relevance or term weights before we clustered the topics. But we used them together in each iteration. And we wanted to evaluate the clusters with the measures we discussed earlier.
In Matsuo's paper, the extracted keyword quality is improved by selecting the proper set of columns from a co-occurrence matrix. This set of columns are the set of terms or keywords and they are preferably orthogonal and they extract it with clustering. They mention two major approaches to clustering - similarity based clustering and pairwise clustering. If terms w1 and w2 have similar distribution of co-occurrence with other terms, w1 and w2 are considered to be the same cluster. For pairwise clustering, if terms w1 and w2 co-occur frequently, w1 and w2 are considered to be the same cluster. They found similarity based clustering to be effective in grouping paraphrases and phrases. They measure similarity of two distributions is measured statistically by Jensen-Shannon divergence. On the other hand, they found pairwise clustering to yield relevant terms in the same cluster. Thresholds are determined by preliminary experiments. Proper clustering of frequent terms results in an appropriate chi-square value for each term. The steps imvolve preprocessing. selection of frequent terms, clustering of frequent terms, calculation of expected probability, calculation of chi-dash-square value, and output keywords. Frequent terms are selected as the top 30% . Frequent terms are clustered by pairs whose Jensen-Shannon divergence is above threshold 0.95 * log 2

Tuesday, December 3, 2013

I came across a paper that has some similarity with my interest in keyword extraction. The paper is titled Keyword extraction from a single document using word co-occurrence statistical information by Matsuo and Ishizuka. Their algorithm works on a single document without using a corpus. They extract the frequent terms first. Then they count the co-occurrences of a term and frequent terms. If a term appears selectively with a particular subset of frequent terms, the term is likely to have an important meaning. They measure the degree of bias of the co-occurrence distribution by Chi-square goodness. They show that their algorithm performs just as well as a tf-idf with corpus.
They assume that a sentence as a basket of words ignoring term order and grammatical information. They make a co-occurrence matrix by counting frequencies of pairwise term co-occurrence. This matrix is a symmetric N x N matrix where N is the number of different terms and different from the number of frequent terms G. They ignore the diagonal components. If a term w appears independent from frequent terms G, the distribution of w and G is similar to unconditional distribution. On the other hand, if there is a semantic relation with a subset g from G, then the w and g have a biased distribution. Since the term frequency could be small, the degree of biases is not reliable.So they test the significance of biases using chi-square. In this case, the chi-square is defined as Sum-g((freq(w,g) - nw,pg)^2/nw.pg) where
pg is the expected probability equal to the unconditional probability of the frequent term g and
nw is the total number of co-occurrence of term w and frequent terms G.
nwpg represents the expected frequency of co-occurrence.
Terms with high chi-square value are relatively more important in the document than the ones with low chi-square.
They use clustering.